Arxiv今日论文 | 2025-02-12

本篇博文主要内容为 2025-02-12 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（LLMs）在实际应用中的高计算成本问题，特别是限制其在实时应用场景中的广泛使用。为应对这一挑战，论文提出了一种名为\sysname的训练感知型结构化剪枝方法。该方法的关键在于通过进化搜索过程生成多个子模型，并结合轻量级的多阶段训练过程，逐步增加令牌数量并淘汰表现不佳的模型，从而不仅识别出有效的子结构，还考虑到了压缩后的再训练效果。这种方法实现了结构化剪枝的最新性能，在Llama-2-7B、Llama-3.1-8B和Qwen-2.5-14B-Instruct等模型上验证了其有效性，且所需压缩后训练数据仅为ShearedLlama的五分之一。

链接: https://arxiv.org/abs/2502.07780
作者: Shengkun Tang,Oliver Sieberling,Eldar Kurtic,Zhiqiang Shen,Dan Alistarh
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for \emphnon-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for \emphtraining-aware structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring 5\times less training data during post-compression training.
zh

[NLP-1] Auditing Prompt Caching in Language Model APIs

【速读】：该论文旨在解决由大型语言模型（Large Language Models, LLMs）中的提示缓存（prompt caching）引起的隐私泄露风险及信息泄露问题。论文的关键解决方案在于开发并实施统计审计方法，以检测实际应用中的LLM API提供商是否存在提示缓存现象。研究发现，七家API提供商包括OpenAI存在全局缓存共享的情况，这可能导致用户提示的隐私泄露。此外，论文通过分析因提示缓存导致的时间变化，揭示了关于模型架构的信息，如OpenAI的嵌入模型可能是一个仅包含解码器的Transformer结构，这一发现此前并未公开。

链接: https://arxiv.org/abs/2502.07776
作者: Chenchen Gu,Xiang Lisa Li,Rohith Kuditipudi,Percy Liang,Tatsunori Hashimoto
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users’ prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI’s embedding model is a decoder-only Transformer, which was previously not publicly known.
zh

[NLP-2] Breaking Down Bias: On The Limits of Generalizable Pruning Strategies

【速读】：该论文旨在探究大型语言模型（LLMs）中的种族偏见如何通过模型剪枝（model pruning）来理解和减轻，并评估是否存在一种可推广的缓解策略。研究的关键在于发现基于神经元的剪枝方法比剪枝整个注意力头更为有效，但同时也指出，随着剪枝策略的泛化，其效果迅速下降。论文强调种族偏见在语言模型中部分表现为通用概念，而另一部分则高度依赖于具体情境，这表明可推广的缓解策略可能效果有限。因此，有效的缓解策略应包括在特定应用场景中部署模型的法律责任分配。

链接: https://arxiv.org/abs/2502.07771
作者: Sibo Ma,Alejandro Salinas,Peter Henderson,Julian Nyarko
机构: Stanford University; Princeton University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 28 pages, 9 figures, 1 table

点击查看摘要

Abstract:We employ model pruning to examine how LLMs conceptualize racial biases, and whether a generalizable mitigation strategy for such biases appears feasible. Our analysis yields several novel insights. We find that pruning can be an effective method to reduce bias without significantly increasing anomalous model behavior. Neuron-based pruning strategies generally yield better results than approaches pruning entire attention heads. However, our results also show that the effectiveness of either approach quickly deteriorates as pruning strategies become more generalized. For instance, a model that is trained on removing racial biases in the context of financial decision-making poorly generalizes to biases in commercial transactions. Overall, our analysis suggests that racial biases are only partially represented as a general concept within language models. The other part of these biases is highly context-specific, suggesting that generalizable mitigation strategies may be of limited effectiveness. Our findings have important implications for legal frameworks surrounding AI. In particular, they suggest that an effective mitigation strategy should include the allocation of legal responsibility on those that deploy models in a specific use case.
zh

[NLP-3] An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating

【速读】：该论文旨在通过整合数据增强、特征提取和分类等先进技术，提升医疗诊断中的自然语言处理（NLP）能力。解决方案的关键在于采用后翻译生成多样化的释义数据集以增强数据鲁棒性并减少过拟合现象，利用解码增强的BERT模型结合离散注意力机制及动态上下文位置门控技术捕捉细粒度的上下文和位置关系，并通过基于注意力的前馈神经网络（ABFNN）有效聚焦于最相关的特征以提高决策准确性。这些方法共同作用，显著提升了医疗文本分类的精度与可靠性。

链接: https://arxiv.org/abs/2502.07755
作者: Mohammad Ali Labbaf Khaniki,Sahabeh Saadati,Mohammad Manthouri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel Natural Language Processing (NLP) framework for enhancing medical diagnosis through the integration of advanced techniques in data augmentation, feature extraction, and classification. The proposed approach employs back-translation to generate diverse paraphrased datasets, improving robustness and mitigating overfitting in classification tasks. Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained contextual and positional relationships, dynamically adjusting the influence of positional information based on semantic context to produce high-quality text embeddings. For classification, an Attention-Based Feedforward Neural Network (ABFNN) is utilized, effectively focusing on the most relevant features to improve decision-making accuracy. Applied to the classification of symptoms, clinical notes, and other medical texts, this architecture demonstrates its ability to address the complexities of medical data. The combination of data augmentation, contextual embedding generation, and advanced classification mechanisms offers a robust and accurate diagnostic tool, with potential applications in automated medical diagnosis and clinical decision support. This method demonstrates the effectiveness of the proposed NLP framework for medical diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of 99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only underscore the model’s robust performance in classifying medical texts with exceptional precision and reliability but also highlight its superiority over existing methods, making it a highly promising tool for automated diagnostic systems.
zh

[NLP-4] WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLM）在叙事背景下的演绎推理能力。为了实现这一目标，作者构建了一个名为WhoDunIt的数据集，该数据集由开放域的推理小说和短篇故事组成，挑战LLMs在阅读和理解故事后识别罪犯的能力。为了测试模型的鲁棒性，研究引入了多种角色名称变换，包括原始名称、名称互换以及替换为广为人知的真实或虚构实体。此外，还使用了不同的提示风格来探讨提示对演绎推理准确性的影响。通过与最先进的模型（如GPT-4o、GPT-4-turbo和GPT-4o-mini）进行多轮试验，并采用多数响应选择机制确保结果可靠性，验证了所提出方法的有效性。实验结果显示，尽管在未经修改的文本上LLMs表现可靠，但在某些广泛认知的名称替换条件下，准确性会下降。因此，该研究的关键在于通过WhoDunIt数据集及其多样化的名称变换和提示风格，系统地评估和分析LLMs的演绎推理能力。

链接: https://arxiv.org/abs/2502.07747
作者: Kshitij Gupta
机构: BITS Pilani
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.07747 [cs.CL] (or arXiv:2502.07747v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.07747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-5] Economics of Sourcing Human Data

【速读】：该论文旨在解决由大型语言模型广泛使用引发的人类生成数据质量与诚信问题，这些问题不仅限于过滤AI生成的内容，更揭示了现有数据收集系统设计中的深层次缺陷。这些系统通常以速度、规模和效率为优先，牺牲了人类贡献者的内在动机，导致参与度和数据质量下降。论文的关键解决方案在于重新思考数据收集系统的构建方式，使其与贡献者的内在动机相契合，而不仅仅是依赖外部激励，以此来维持大规模高质量数据采集的同时保持贡献者的信任和长期参与。

链接: https://arxiv.org/abs/2502.07732
作者: Sebastin Santy,Prasanta Bhattacharya,Manoel Horta Ribeiro,Kelsey Allen,Sewoong Oh
机构: University of Washington (华盛顿大学); Institute of High Performance Computing (IHPC), A*STAR (高性能计算研究所，ASTAR); Princeton University (普林斯顿大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Progress in AI has relied on human-generated data, from annotator marketplaces to the wider Internet. However, the widespread use of large language models now threatens the quality and integrity of human-generated data on these very platforms. We argue that this issue goes beyond the immediate challenge of filtering AI-generated content–it reveals deeper flaws in how data collection systems are designed. Existing systems often prioritize speed, scale, and efficiency at the cost of intrinsic human motivation, leading to declining engagement and data quality. We propose that rethinking data collection systems to align with contributors’ intrinsic motivations–rather than relying solely on external incentives–can help sustain high-quality data sourcing at scale while maintaining contributor trust and long-term participation.
zh

[NLP-6] Making Language Models Robust Against Negation NAACL2025

【速读】：该论文旨在解决语言模型在处理否定句（Negation）时的长期挑战。解决方案的关键在于引入了一种新的自监督方法，通过设计一个名为Next Sentence Polarity Prediction (NSPP) 的新任务以及改进的Next Sentence Prediction (NSP) 任务，使预训练的语言模型如BERT和RoBERTa在处理包含否定的句子时更为稳健。实验结果表明，这些经过特定任务进一步预训练的模型在九个与否定相关的基准测试中表现更优，尤其在CondaQA数据集上的提升显著，达到了1.8%到9.1%。

链接: https://arxiv.org/abs/2502.07717
作者: MohammadHossein Rezaei,Eduardo Blanco
机构: Department of Computer Science, University of Arizona (亚利桑那大学计算机科学系)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Negation has been a long-standing challenge for language models. Previous studies have shown that they struggle with negation in many natural language understanding tasks. In this work, we propose a self-supervised method to make language models more robust against negation. We introduce a novel task, Next Sentence Polarity Prediction (NSPP), and a variation of the Next Sentence Prediction (NSP) task. We show that BERT and RoBERTa further pre-trained on our tasks outperform the off-the-shelf versions on nine negation-related benchmarks. Most notably, our pre-training tasks yield between 1.8% and 9.1% improvement on CondaQA, a large question-answering corpus requiring reasoning over negation.
zh

[NLP-7] Large Language Models as Proxies for Theories of Human Linguistic Cognition

【速读】：该论文旨在探讨当前大型语言模型（Large Language Models, LLMs）在人类语言认知研究中的潜在作用。论文的关键在于将LLMs作为认知理论的代理，这些理论在表征和学习方面相对语言中立，但与现有的LLMs在关键方面有所不同。论文通过两种类型的问题展示了LLMs作为认知理论代理的潜在用途：(a) 目标理论是否能够解释从特定语料库中获取某一模式；(b) 目标理论是否使得某一类型学上公认的模式比另一种未被类型学证实的模式更容易习得。论文基于近期文献指出，LLMs在这方面可能提供帮助，但目前这种帮助仍然相当有限。

链接: https://arxiv.org/abs/2502.07687
作者: Imry Ziv,Nur Lan,Emmanuel Chemla,Roni Katzir
机构: Tel Aviv University(特拉维夫大学); École Normale Supérieure(高等师范学校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We consider the possible role of current large language models (LLMs) in the study of human linguistic cognition. We focus on the use of such models as proxies for theories of cognition that are relatively linguistically-neutral in their representations and learning but differ from current LLMs in key ways. We illustrate this potential use of LLMs as proxies for theories of cognition in the context of two kinds of questions: (a) whether the target theory accounts for the acquisition of a given pattern from a given corpus; and (b) whether the target theory makes a given typologically-attested pattern easier to acquire than another, typologically-unattested pattern. For each of the two questions we show, building on recent literature, how current LLMs can potentially be of help, but we note that at present this help is quite limited.
zh

[NLP-8] xHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem

【速读】：该论文旨在解决同行评审过程中分配合适审稿人的挑战，传统手动方法因劳动密集且效果不佳，常导致无效或有偏见的评审。论文的关键解决方案是引入exHarmony基准，将审稿人分配问题（RAP）重新构想为检索任务，并利用OpenAlex的大量数据提出一种新方法，该方法考虑作者、最相似专家及引用关系等多个信号作为潜在指标，以确定适合的审稿人。这种方法使开发标准基准数据集成为可能，用于评估审稿人分配问题而无需显式标签。实验结果表明，虽然传统方法表现尚可，但基于学术文献训练的上下文化神经嵌入（Contextualized Neural Embeddings）表现最佳。

链接: https://arxiv.org/abs/2502.07683
作者: Sajad Ebrahimi,Sara Salamat,Negar Arabzadeh,Mahdi Bashari,Ebrahim Bagheri
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The peer review process is crucial for ensuring the quality and reliability of scholarly work, yet assigning suitable reviewers remains a significant challenge. Traditional manual methods are labor-intensive and often ineffective, leading to nonconstructive or biased reviews. This paper introduces the exHarmony (eHarmony but for connecting experts to manuscripts) benchmark, designed to address these challenges by re-imagining the Reviewer Assignment Problem (RAP) as a retrieval task. Utilizing the extensive data from OpenAlex, we propose a novel approach that considers a host of signals from the authors, most similar experts, and the citation relations as potential indicators for a suitable reviewer for a manuscript. This approach allows us to develop a standard benchmark dataset for evaluating the reviewer assignment problem without needing explicit labels. We benchmark various methods, including traditional lexical matching, static neural embeddings, and contextualized neural embeddings, and introduce evaluation metrics that assess both relevance and diversity in the context of RAP. Our results indicate that while traditional methods perform reasonably well, contextualized embeddings trained on scholarly literature show the best performance. The findings underscore the importance of further research to enhance the diversity and effectiveness of reviewer assignments.
zh

[NLP-9] Auto-Drafting Police Reports from Noisy ASR Outputs: A Trust-Centered LLM Approach

【速读】：该论文旨在解决在维护执法信任与保护警民双方权利之间实现微妙平衡的研究和产品挑战。解决方案的关键在于提出了一种创新的AI驱动系统，能够从复杂的、噪声多角色对话数据中自动生成警方报告草稿。此方法通过智能提取执法互动中的关键要素，并将其纳入报告草稿中，从而生成高质量且结构化的叙述，增强问责制和程序透明度。

链接: https://arxiv.org/abs/2502.07677
作者: Param Kulkarni,Yingchi Liu,Hao-Ming Fu,Shaohua Yang,Isuru Gunasekara,Matt Peloquin,Noah Spitzer-Williams,Xiaotian Zhou,Xiaozhong Liu,Zhengping Ji,Yasser Ibrahim
机构: AAAI Press(AAAI出版社);
Association for the Advancement of Artificial Intelligence(人工智能促进会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Achieving a delicate balance between fostering trust in law en- forcement and protecting the rights of both officers and civilians continues to emerge as a pressing research and product challenge in the world today. In the pursuit of fairness and transparency, this study presents an innovative AI-driven system designed to generate police report drafts from complex, noisy, and multi-role dialogue data. Our approach intelligently extracts key elements of law enforcement interactions and includes them in the draft, producing structured narratives that are not only high in quality but also reinforce accountability and procedural clarity. This frame- work holds the potential to transform the reporting process, ensur- ing greater oversight, consistency, and fairness in future policing practices. A demonstration video of our system can be accessed at this https URL Y-kpCHNO/view?usp=sharing
zh

[NLP-10] Human Decision-making is Susceptible to AI-driven Manipulation

【速读】：该论文旨在探究人类对人工智能驱动的操纵的易感性，并通过随机对照试验发现，参与者在与具有操纵意图的人工智能系统交互后，更倾向于选择有害选项。关键在于通过对比不同类型的AI代理（中立代理、操纵代理及策略增强型操纵代理）的影响，揭示了即使是带有隐晦操纵目标的AI系统也能够显著影响人类决策，从而强调了在AI技术部署中实施伦理保障和监管框架以保护人类自主权的重要性。

链接: https://arxiv.org/abs/2502.07663
作者: Sahand Sabour,June M. Liu,Siyang Liu,Chris Z. Yao,Shiyao Cui,Xuanming Zhang,Wen Zhang,Yaru Cao,Advait Bhat,Jian Guan,Wei Wu,Rada Mihalcea,Tim Althoff,Tatia M.C. Lee,Minlie Huang
机构: The CoAI Group, DCST, Institute for Artificial Intelligence, Tsinghua University (清华大学), Beijing, China;
State Key Laboratory of Brain and Cognitive Sciences, The University of Hong Kong (香港大学), Hong Kong SAR, China;
The LIT Group, Department of Computer Science and Engineering, University of Michigan (密歇根大学), Ann Arbor;
ANT Group;
Department of Psychology, University of International Relations (国际关系学院), Beijing, China;
Department of Chinese Language and Literature, Northwest Minzu University (西北民族大学), Lanzhou, China;
Paul G. Allen School of Computer Science and Engineering, University of Washington (华盛顿大学), Seattle, WA, USA;
The CoAI Group, DCST, Institute for Artificial Intelligence, Tsinghua University (清华大学), Beijing, China;
Minlie Huang, The CoAI Group, DCST, Institute for Artificial Intelligence, Tsinghua University (清华大学), Beijing, China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Work in progress. Code and data will be made available via this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users’ cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants’ decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy.
zh

[NLP-11] FoQA: A Faroese Question-Answering Dataset

【速读】：该论文旨在构建一个Faroeese语言的抽取式问答(Faroeese Extractive Question Answering, FoQA)数据集，并提供其评估基准。关键解决方案在于采用半自动化方法，结合大型语言模型（Large Language Models, LLMs）和人工验证，利用GPT-4-turbo生成初始问答对，并通过重述问题以增加复杂性，再由母语验证者确保数据质量。该数据集的发布包括三个版本：经过验证的2,000个样本集、包含所有10,001个生成样本的完整集以及2,395个被拒绝样本集用于错误分析。

链接: https://arxiv.org/abs/2502.07642
作者: Annika Simonsen,Dan Saattrup Nielsen,Hafsteinn Einarsson
机构: University of Iceland (冰岛大学); Alexandra Institute (亚历山德拉研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Camera-ready version for RESOURCEFUL workshop, 2025

点击查看摘要

Abstract:We present FoQA, a Faroese extractive question-answering (QA) dataset with 2,000 samples, created using a semi-automated approach combining Large Language Models (LLMs) and human validation. The dataset was generated from Faroese Wikipedia articles using GPT-4-turbo for initial QA generation, followed by question rephrasing to increase complexity and native speaker validation to ensure quality. We provide baseline performance metrics for FoQA across multiple models, including LLMs and BERT, demonstrating its effectiveness in evaluating Faroese QA performance. The dataset is released in three versions: a validated set of 2,000 samples, a complete set of all 10,001 generated samples, and a set of 2,395 rejected samples for error analysis.
zh

[NLP-12] BiaSWE: An Expert Annotated Dataset for Misogyny Detection in Swedish

【速读】：该论文旨在解决瑞典语环境中针对女性的歧视（misogyny）检测问题。为应对瑞典文化及语言特有的挑战，研究团队与社会科学和人文学科的专家合作，开发了一套严谨的标注流程，融合了领域知识和语言专长，以捕捉瑞典语环境中歧视女性现象的细微差别。关键在于通过跨学科合作，确保数据集不仅具有文化相关性，还适用于低资源语言中的偏见检测任务。

链接: https://arxiv.org/abs/2502.07637
作者: Kätriin Kukk,Danila Petrelli,Judit Casademont,Eric J. W. Orlowski,Michał Dzieliński,Maria Jacobson
机构: AI Sweden; Linköping University; AI Singapore; Stockholm University; Anti-Discrimination Agency West Sweden
类目: Computation and Language (cs.CL)
备注: To appear at NoDaLiDa 2025

点击查看摘要

Abstract:In this study, we introduce the process for creating BiaSWE, an expert-annotated dataset tailored for misogyny detection in the Swedish language. To address the cultural and linguistic specificity of misogyny in Swedish, we collaborated with experts from the social sciences and humanities. Our interdisciplinary team developed a rigorous annotation process, incorporating both domain knowledge and language expertise, to capture the nuances of misogyny in a Swedish context. This methodology ensures that the dataset is not only culturally relevant but also aligned with broader efforts in bias detection for low-resource languages. The dataset, along with the annotation guidelines, is publicly available for further research.
zh

[NLP-13] Exploring Mobile Touch Interaction with Large Language Models

【速读】：该论文旨在解决在移动设备上通过触控手势直接控制大型语言模型（LLMs）进行文本编辑的问题。解决方案的关键在于设计了两种触控映射控制方式：展开以生成（spread-to-generate）和捏合以缩短（pinch-to-shorten），并引入视觉反馈循环。研究通过用户实验评估了三种反馈设计的效果，结果表明，基于触控的LLMs控制不仅可行而且用户友好，其中长度加词汇指示器（length + word indicator）对于管理文本生成最为有效。

链接: https://arxiv.org/abs/2502.07629
作者: Tim Zindulka,Jannek Sekowski,Florian Lehmann,Daniel Buschek
机构: University of Bayreuth(Bayreuth)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 21 pages, 16 figures, 3 tables, ACM CHI 2025

点击查看摘要

Abstract:Interacting with Large Language Models (LLMs) for text editing on mobile devices currently requires users to break out of their writing environment and switch to a conversational AI interface. In this paper, we propose to control the LLM via touch gestures performed directly on the text. We first chart a design space that covers fundamental touch input and text transformations. In this space, we then concretely explore two control mappings: spread-to-generate and pinch-to-shorten, with visual feedback loops. We evaluate this concept in a user study (N=14) that compares three feedback designs: no visualisation, text length indicator, and length + word indicator. The results demonstrate that touch-based control of LLMs is both feasible and user-friendly, with the length + word indicator proving most effective for managing text generation. This work lays the foundation for further research into gesture-based interaction with LLMs on touch devices.
zh

[NLP-14] Lexical categories of stem-forming roots in Mapudüngun verb forms

【速读】：该论文旨在验证用于开发马普切语形态分析系统的词源语言学假设，并重点修订马普切语（Mapudüngun）动词根的词类分类。解决方案的关键在于通过系统性的分析，确保这些修订能够直接应用于计算分析器中，从而提高其准确性。此外，这些修订有望澄清关于马普切语词类的一些不确定性，并为后续识别真动词根的价态任务奠定基础。

链接: https://arxiv.org/abs/2502.07623
作者: Andrés Chandía
机构: University of Barcelona (巴塞罗那大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 2 large tables, 2 sample tables

点击查看摘要

Abstract:After developing a computational system for morphological analysis of the Mapuche language, and evaluating it with texts from various authors and styles, it became necessary to verify the linguistic assumptions of the source used as the basis for implementing this tool. In the present work, the primary focus is on the lexical category classification of Mapudüngun roots recognised as verbal in the source utilised for the development of the morphological analysis system. The results of this lexical category revision directly benefit the computational analyser, as they are implemented as soon as they are verified. Additionally, it is hoped that these results will help clarify some uncertainties about lexical categories in the Mapuche language. This work addresses a preliminary task to identify the valency of true verbal roots, the results of which will be presented in a subsequent work that complements this article. Comments: 22 pages, 2 large tables, 2 sample tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.07623 [cs.CL] (or arXiv:2502.07623v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.07623 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andrés Chandía [view email] [v1] Tue, 11 Feb 2025 15:10:23 UTC (73 KB)
zh

[NLP-15] ractable Transformers for Flexible Conditional Generation

【速读】：该论文旨在解决非自回归（NAR）生成模型在处理条件生成任务时，尽管在无条件生成方面表现出色，但在条件生成性能上仍不如自回归（AR）模型的问题。关键在于现有模型依赖于从完整输入中提取的全局上下文特征，难以泛化到训练过程中未见过的条件概率查询。论文提出了一种基于Transformer的新型生成模型——可解析变换器（Tracformer），通过引入稀疏Transformer编码器来捕捉局部和全局上下文信息，并将其路由至解码器进行条件生成，从而提高了模型在不同条件生成任务中的鲁棒性。

链接: https://arxiv.org/abs/2502.07616
作者: Anji Liu,Xuejie Liu,Dayuan Zhao,Mathias Niepert,Yitao Liang,Guy Van den Broeck
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
zh

[NLP-16] owards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

【速读】：该论文旨在解决零样本异常检测（Zero-Shot Anomaly Detection, ZSAD）中的细粒度异常细节检测与描述不准确的问题。当前多模态大型语言模型（Multimodal Large Language Models, MLLMs），如GPT-4o，在处理图像异常推理时存在不足。为了解决这一问题，论文提出了一种名为Anomaly-OneVision（Anomaly-OV）的关键方案，这是一种专门用于ZSAD和推理的视觉助手。Anomaly-OV通过引入看两次特征匹配（Look-Twice Feature Matching, LTFM）机制，能够自适应地选择并突出异常视觉标记，从而显著提升检测与推理性能。

链接: https://arxiv.org/abs/2502.07601
作者: Jiacong Xu,Shao-Yuan Lo,Bardia Safaei,Vishal M. Patel,Isht Dwivedi
机构: Johns Hopkins University; Honda Research Institute USA
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-DR. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: this https URL
zh

[NLP-17] DPO-Shift: Shifting the Distribution of Direct Preference Optimization

【速读】：该论文旨在解决Direct Preference Optimization (DPO)及其变体在训练过程中出现的所选响应概率下降的问题，即所谓的“likelihood displacement”。为了解决这一挑战，论文引入了一种方法，能够可控地调整所选响应的概率分布。该方法的关键在于，在提高所选响应概率的同时，需要权衡牺牲奖励边际的代价。通过理论分析和实验验证，论文展示了这种方法不仅有效缓解了DPO的likelihood displacement问题，还在MT-Bench和一个设计的胜率实验等下游任务中表现出优越性。

链接: https://arxiv.org/abs/2502.07599
作者: Xiliang Yang,Feng Jiang,Qianen Zhang,Lei Zhao,Xiao Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at this https URL.
zh

[NLP-18] We Cant Understand AI Using our Existing Vocabulary

【速读】：该论文旨在解决人类与机器之间的概念差异导致的可解释性问题，将其视为一种沟通难题：人类需要能够引用和控制机器的概念，并向机器传达人类的概念。论文的关键解决方案在于通过创造新词（neologisms）来建立共享的人机语言，这些新词能够实现适度的抽象化，既不过于详细也不过于笼统，从而在多个上下文中具有重用性并传递精确信息。论文以“长度新词”和“多样性新词”为例，展示了如何通过这些新词来控制大语言模型（LLM）的响应长度和采样变异性，从而更好地控制和理解机器。

链接: https://arxiv.org/abs/2502.07586
作者: John Hewitt,Robert Geirhos,Been Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Position paper

点击查看摘要

Abstract:This position paper argues that, in order to understand AI, we cannot rely on our existing vocabulary of human words. Instead, we should strive to develop neologisms: new words that represent precise human concepts that we want to teach machines, or machine concepts that we need to learn. We start from the premise that humans and machines have differing concepts. This means interpretability can be framed as a communication problem: humans must be able to reference and control machine concepts, and communicate human concepts to machines. Creating a shared human-machine language through developing neologisms, we believe, could solve this communication problem. Successful neologisms achieve a useful amount of abstraction: not too detailed, so they’re reusable in many contexts, and not too high-level, so they convey precise information. As a proof of concept, we demonstrate how a “length neologism” enables controlling LLM response length, while a “diversity neologism” allows sampling more variable responses. Taken together, we argue that we cannot understand AI using our existing vocabulary, and expanding it through neologisms creates opportunities for both controlling and understanding machines better.
zh

[NLP-19] Automated Capability Discovery via Model Self-Exploration

【速读】：该论文旨在解决现有方法难以精确刻画新兴模型全面能力及潜在风险的问题。为应对这一挑战，论文提出了一种名为自动能力发现（Automated Capability Discovery, ACD）的框架。ACD的关键在于利用一个基础模型作为“科学家”，系统性地设计开放性任务以探测目标模型的能力，从而自动且系统地揭示目标模型的意外能力和失效情况。通过结合前沿模型与开放性领域理念，ACD能够显著提升评估新兴人工智能系统的效率与广度。

链接: https://arxiv.org/abs/2502.07577
作者: Cong Lu,Shengran Hu,Jeff Clune
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of capabilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers both surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically reveals thousands of capabilities that would be challenging for any single team to uncover. We further validate our method’s automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models’ ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at this https URL.
zh

[NLP-20] LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

【速读】：该论文旨在解决在训练线性注意力（Linear Attention）变换器模型时，现有序列并行化（Sequence Parallelism, SP）方法因不适应线性注意力特有的右乘优先特性或采用环形通信策略，导致计算与通信并行性不足及可扩展性差的问题。关键解决方案在于引入LASP-2方法，重新设计了序列并行化中的最小通信需求，并重组了通信与计算的工作流程。通过仅需一次AllGather集体通信，且通信大小与序列长度无关，显著提升了通信与计算的并行性及其重叠度。此外，进一步扩展至LASP-2H，适用于混合模型中标准注意力模块的高效序列并行化。

链接: https://arxiv.org/abs/2502.07563
作者: Weigao Sun,Disen Lan,Yiran Zhong,Xiaoye Qu,Yu Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical report, 17 pages

点击查看摘要

Abstract:Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: this https URL.
zh

[NLP-21] O1 Embedder: Let Retrievers Think Before Action

【速读】：该论文旨在开发一种新型的检索模型能力，以应对多任务检索、零样本检索以及需要复杂关系推理的任务等关键挑战。解决方案的关键在于提出了一种名为O1 Embedder的新方法，该方法在为目标文档进行检索之前，先为输入查询生成有用的思考。为了实现这一目标，论文解决了两个技术难题：首先，设计了一个数据合成工作流程，通过从LLM专家生成初始想法，并使用检索委员会对其进行后续优化来为O1 Embedder创建训练信号；其次，优化了训练过程，使预训练模型能够通过行为克隆生成检索思路，并通过对比学习执行密集检索。

链接: https://arxiv.org/abs/2502.07555
作者: Ruin Yan,Zheng Liu,Defu Lian
机构: University of Science and Technology of China(中国科学技术大学); Beijing Academy of Artifical Intelligence(北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs’ ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder’s remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.07555 [cs.CL] (or arXiv:2502.07555v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.07555 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] Unsupervised Translation of Emergent Communication AAAI2025

【速读】：该论文旨在解决如何解读和评估 Emergent Communication (EC) 与自然语言 (NL) 之间的关系。研究的关键在于采用无监督神经机器翻译 (UNMT) 技术，通过分析不同任务复杂度下的指代游戏所形成的 EC，揭示任务复杂度特别是语义多样性对 EC 可译性的影响。研究表明，语义多样性高的任务复杂度能够提升 EC 的可译性，而语义变化受限的高任务复杂度则表现出实用性的 EC，尽管这种 EC 难以理解，但仍适合翻译。此研究首次尝试在没有平行数据的情况下翻译 EC。

链接: https://arxiv.org/abs/2502.07552
作者: Ido Levy,Orr Paradise,Boaz Carmeli,Ron Meir,Shafi Goldwasser,Yonatan Belinkov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages (including appendix and bibliography), Accepted to AAAI 2025

点击查看摘要

Abstract:Emergent Communication (EC) provides a unique window into the language systems that emerge autonomously when agents are trained to jointly achieve shared goals. However, it is difficult to interpret EC and evaluate its relationship with natural languages (NL). This study employs unsupervised neural machine translation (UNMT) techniques to decipher ECs formed during referential games with varying task complexities, influenced by the semantic diversity of the environment. Our findings demonstrate UNMT’s potential to translate EC, illustrating that task complexity characterized by semantic diversity enhances EC translatability, while higher task complexity with constrained semantic variability exhibits pragmatic EC, which, although challenging to interpret, remains suitable for translation. This research marks the first attempt, to our knowledge, to translate EC without the aid of parallel data.
zh

[NLP-23] Grammar Control in Dialogue Response Generation for Language Learning Chatbots NAACL2025

【速读】：该论文旨在解决语言学习者在与基于大型语言模型的聊天机器人进行对话练习时，难以控制对话中的语法形式以满足其当前需求的问题。解决方案的关键在于将对话响应生成模型基于教学语法技能仓库进行接地（grounding），并通过策略性解码（strategically decoding）技术优化生成效果，特别是通过Llama3模型在容忍轻微响应质量损失的情况下优于GPT-3.5。这种方法预测能够支持适应学习者熟练度的语法习得。

链接: https://arxiv.org/abs/2502.07544
作者: Dominik Glandorf,Peng Cui,Detmar Meurers,Mrinmaya Sachan
机构: EPFL(EPFL); University of Tübingen(图宾根大学); ETH Zürich(苏黎世联邦理工学院); Leibniz-Institut für Wissensmedien(莱布尼茨信息学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Chatbots based on large language models offer cheap conversation practice opportunities for language learners. However, they are hard to control for linguistic forms that correspond to learners’ current needs, such as grammar. We control grammar in chatbot conversation practice by grounding a dialogue response generation model in a pedagogical repository of grammar skills. We also explore how this control helps learners to produce specific grammar. We comprehensively evaluate prompting, fine-tuning, and decoding strategies for grammar-controlled dialogue response generation. Strategically decoding Llama3 outperforms GPT-3.5 when tolerating minor response quality losses. Our simulation predicts grammar-controlled responses to support grammar acquisition adapted to learner proficiency. Existing language learning chatbots and research on second language acquisition benefit from these affordances. Code available on GitHub.
zh

[NLP-24] Corporate Greenwashing Detection in Text - a Survey

【速读】：该论文旨在解决通过自然语言处理方法识别潜在误导性的气候相关企业沟通的问题，以检测是否存在绿色清洗行为。关键在于将绿色清洗的检测分解为中间任务，并回顾每项任务的最先进方法，同时讨论数据集、方法、结果及其局限性和开放性挑战。

链接: https://arxiv.org/abs/2502.07541
作者: Tom Calamai,Oana Balalau,Théo Le Guenedal,Fabian M. Suchanek
机构: 未知
类目: Computation and Language (cs.CL)
备注: 35 pages, 1 figure, 21 pages (appendix), working paper

点击查看摘要

Abstract:Greenwashing is an effort to mislead the public about the environmental impact of an entity, such as a state or company. We provide a comprehensive survey of the scientific literature addressing natural language processing methods to identify potentially misleading climate-related corporate communications, indicative of greenwashing. We break the detection of greenwashing into intermediate tasks, and review the state-of-the-art approaches for each of them. We discuss datasets, methods, and results, as well as limitations and open challenges. We also provide an overview of how far the field has come as a whole, and point out future research directions.
zh

[NLP-25] Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在准确检索关键信息方面存在的问题。解决方案的关键在于提出了一种名为掩码增强自回归预测（Mask-Enhanced Autoregressive Prediction, MEAP）的新训练范式，它将掩码语言建模（Masked Language Modeling, MLM）无缝集成到下一个词预测（Next-Token Prediction, NTP）中，以增强后者的上下文检索能力。通过这种方式，MEAP 不需要使用双向注意力机制或编码器-解码器架构来实现 MLM，并且在预训练和推理过程中不会增加额外的计算开销。

链接: https://arxiv.org/abs/2502.07490
作者: Xialie Zhuang,Zhikai Jia,Jianjin Li,Zhenyu Zhang,Li Shen,Zheng Cao,Shiwei Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages,7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter’s in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.
zh

[NLP-26] Multi-Agent Collaboration for Multilingual Code Instruction Tuning

【速读】：该论文旨在解决现有代码大型语言模型（Code LLMs）在不同编程语言之间知识迁移不足的问题。关键解决方案在于引入了一种新颖的多智能体协作框架，通过多个具有生成记忆的语言特定智能组件协同工作，高效且有效地实现从一种编程语言到另一种编程语言的知识转移。具体而言，首先利用代码片段生成特定语言的指令数据，并将其作为种子数据供特定语言的智能体使用。这些智能体通过讨论和协作来制定新的指令及其对应的解决方案。为了进一步促进跨语言迁移，每个智能体存储其生成历史作为记忆，并总结其优势与不足。最终，高质量的多语言指令数据被用于鼓励不同编程语言之间的知识共享，从而训练Qwen2.5-xCoder。

链接: https://arxiv.org/abs/2502.07487
作者: Jian Yang,Wei Zhang,Jiaxi Yang,Yibo Miao,Shanghaoran Quan,Zhenhe Wu,Qiyao Peng,Liqun Yang,Tianyu Liu,Zeyu Cui,Binyuan Hui,Junyang Lin
机构: Alibaba Group(阿里集团); Shanghai Jiao Tong University(上海交通大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
zh

[NLP-27] PerCul: A Story-Driven Cultural Evaluation of LLM s in Persian NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）主要反映西方文化的问题，这主要是由于训练数据以英语为中心。为了解决这一不平衡，论文引入了PerCul数据集，该数据集专门设计用于评估LLMs对波斯文化的敏感性。解决方案的关键在于PerCul数据集通过故事性的多选题来捕捉文化细微差异，并且由母语为波斯语的标注员精心策划，以确保其真实性和防止仅依赖翻译作为捷径。

链接: https://arxiv.org/abs/2502.07459
作者: Erfan Moosavi Monazzah,Vahid Rahimzadeh,Yadollah Yaghoobzadeh,Azadeh Shakery,Mohammad Taher Pilehvar
机构: University of Tehran(德黑兰大学), Iran; Iran University of Science and Technology(伊朗科技大学), Tehran, Iran; Tehran Institute for Advanced Studies, Khatam University(德黑兰高级研究院, 卡塔姆大学), Iran; Institute for Research in Fundamental Sciences (IPM)(基础科学研究所), Tehran, Iran; Cardiff University(卡迪夫大学), United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at NAACL 2025 Main Conference, the dataset is available on HuggingFace (see this https URL )

点击查看摘要

Abstract:Large language models predominantly reflect Western cultures, largely due to the dominance of English-centric training data. This imbalance presents a significant challenge, as LLMs are increasingly used across diverse contexts without adequate evaluation of their cultural competence in non-English languages, including Persian. To address this gap, we introduce PerCul, a carefully constructed dataset designed to assess the sensitivity of LLMs toward Persian culture. PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios. Unlike existing benchmarks, PerCul is curated with input from native Persian annotators to ensure authenticity and to prevent the use of translation as a shortcut. We evaluate several state-of-the-art multilingual and Persian-specific LLMs, establishing a foundation for future research in cross-cultural NLP evaluation. Our experiments demonstrate a 11.3% gap between best closed source model and layperson baseline while the gap increases to 21.3% by using the best open-weight model. You can access the dataset from here: this https URL
zh

[NLP-28] RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation NAACL2025

【速读】：该论文旨在解决文本到图像生成模型在文化多样性方面的不足，特别是针对非英语文化群体的偏见。论文的关键解决方案在于提出了一个名为RusCode的基准测试，用于评估包含俄罗斯文化元素的文本到图像生成的质量。通过形成涵盖19个类别的列表来代表俄罗斯视觉文化的特征，并构建了一个包含1250个俄语文本提示及其英文翻译的数据集，从而实现对多种主题的广泛覆盖。

链接: https://arxiv.org/abs/2502.07455
作者: Viacheslav Vasilev,Julia Agafonova,Nikolai Gerasimenko,Alexander Kapitanov,Polina Mikhailova,Evelina Mironova,Denis Dimitrov
机构: Sber AI (斯伯银行人工智能部门); MIPT (莫斯科物理技术学院); ITMO University (圣彼得堡信息技术金属与光学学院); SberDevices (斯伯银行设备部门); AIRI (人工智能机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for NAACL 2025 Findings, GitHub: this https URL

点击查看摘要

Abstract:Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people’s names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.
zh

[NLP-29] Forget What You Know about LLM s Evaluations - LLM s are Like a Chameleon

【速读】：该论文旨在解决大型语言模型（LLMs）在公共基准测试中表现出色，但可能过度依赖数据集特定的表面线索而非真正的语言理解的问题。论文的关键解决方案是引入了变色龙基准过拟合检测器（Chameleon Benchmark Overfit Detector, C-BOD），这是一种元评估框架，通过参数变换系统性地扭曲基准提示，并检测LLMs的过拟合情况。C-BOD通过保留语义内容和标签的同时重新表述输入，揭示模型性能是否由记忆模式驱动。

链接: https://arxiv.org/abs/2502.07445
作者: Nurit Cohen-Inger,Yehonatan Elisha,Bracha Shapira,Lior Rokach,Seffi Cohen
机构: Ben Gurion University (本古里安大学); Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model’s performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD’s dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
zh

[NLP-30] Hierarchical Document Parsing via Large Margin Feature Matching and Heuristics AAAI-25

【速读】：该论文旨在解决AAAI-25视觉关系检测与理解（VRD-IU）挑战赛中的问题，并提出了一种结合大间隔损失函数以增强特征区分能力的方法，同时使用启发式规则来优化层次关系。关键在于融合基于深度学习的匹配策略与贪心算法，从而显著提升解析文档结构的准确性，同时保持计算效率。最终，该方法在私人排行榜上达到了0.98904的准确率。

链接: https://arxiv.org/abs/2502.07442
作者: Duong Anh Kiet
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: DocUI@AAAI-25, 2 pages, technical report

点击查看摘要

Abstract:We present our solution to the AAAI-25 VRD-IU challenge, achieving first place in the competition. Our approach integrates large margin loss for improved feature discrimination and employs heuristic rules to refine hierarchical relationships. By combining a deep learning-based matching strategy with greedy algorithms, we achieve a significant boost in accuracy while maintaining computational efficiency. Our method attains an accuracy of 0.98904 on the private leaderboard, demonstrating its effectiveness in document structure parsing. Source codes are publicly available at this https URL
zh

[NLP-31] RomanLens: Latent Romanization and its role in Multilinguality in LLM s

【速读】：该论文旨在探究大型语言模型（LLMs）在主要以英语为中心的语料库训练下，如何实现显著的多语言泛化能力，特别是针对非拉丁字母文字的语言。研究的关键在于分析罗马字化（romanization）——即使用拉丁字母表示非拉丁字母文字的作用，作为多语言处理中的桥梁。通过机械性可解释性技术，研究发现LLMs在中间层经常以罗马字形式表示目标词汇，然后过渡到原生脚本，这一现象被称为潜在罗马字化（Latent Romanization）。此外，通过激活补丁实验，研究显示LLMs在本地脚本和罗马字化脚本之间编码语义概念的方式相似，表明存在共享的底层表示。这些发现有助于深入理解LLMs中的多语言表示，并强调了罗马字化在促进语言迁移中的隐含作用。

链接: https://arxiv.org/abs/2502.07424
作者: Alan Saji(1),Jaavid Aktar Husain(2),Thanmay Jayakumar(1 and 3),Raj Dabre(1, 3, 4 and 5),Anoop Kunchukuttan(1, 3 and 6),Mitesh M. Khapra(1 and 3),Ratish Puduppully(7) ((1) Nilekani Centre at AI4Bharat, (2) Singapore University of Technology and Design, (3) Indian Institute of Technology Madras, India, (4) National Institute of Information and Communications Technology, Kyoto, Japan, (5) Indian Institute of Technology Bombay, India, (6) Microsoft, India, (7) IT University of Copenhagen)
机构: Nilekani Centre at AI4Bharat(尼勒卡尼中心，AI4Bharat); Singapore University of Technology and Design(新加坡科技设计大学); Indian Institute of Technology Madras, India(印度理工学院马德拉斯分校); National Institute of Information and Communications Technology, Kyoto, Japan(日本京都信息通信技术研究所); Indian Institute of Technology Bombay, India(印度理工学院孟买分校); Microsoft, India(微软，印度); IT University of Copenhagen(哥本哈根信息技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 18 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit remarkable multilingual generalization despite being predominantly trained on English-centric corpora. A fundamental question arises: how do LLMs achieve such robust multilingual capabilities? For non-Latin script languages, we investigate the role of romanization - the representation of non-Latin scripts using Latin characters - as a bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and romanized scripts, suggesting a shared underlying representation. Additionally in translation towards non Latin languages, our findings reveal that when the target language is in romanized form, its representations emerge earlier in the model’s layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of romanization in facilitating language transfer. Our work provides new directions for potentially improving multilingual language modeling and interpretability.
zh

[NLP-32] Entity Linking using LLM s for Automated Product Carbon Footprint Estimation

【速读】：该论文旨在解决制造商在减少碳足迹过程中识别产品各部件环境影响的需求。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）自动将制造商物料清单（Bills of Materials, BOMs）中的部件映射到生命周期评估（Life Cycle Assessment, LCA）数据库条目，从而减少手动数据处理的需求，推动更便捷的可持续实践。

链接: https://arxiv.org/abs/2502.07418
作者: Steffen Castle,Julian Moreno Schneider,Leonhard Hennig,Georg Rehm
机构: Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Growing concerns about climate change and sustainability are driving manufacturers to take significant steps toward reducing their carbon footprints. For these manufacturers, a first step towards this goal is to identify the environmental impact of the individual components of their products. We propose a system leveraging large language models (LLMs) to automatically map components from manufacturer Bills of Materials (BOMs) to Life Cycle Assessment (LCA) database entries by using LLMs to expand on available component information. Our approach reduces the need for manual data processing, paving the way for more accessible sustainability practices.
zh

[NLP-33] arget-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation

【速读】：该论文旨在解决多模态讽刺解释中忽视讽刺目标的问题。解决方案的关键在于提出了一种名为TURBO（Target-aUgmented shaRed fusion-Based sarcasm explanatiOn）的模型，通过引入共享融合机制来利用图像及其标题之间的跨模态关系，并明确假设讽刺目标以指导模型学习讽刺意图的复杂性，从而生成更精确的解释。

链接: https://arxiv.org/abs/2502.07391
作者: Palaash Goel,Dushyant Singh Chauhan,Md Shad Akhtar
机构: Indraprastha Institute of Information Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Patna(印度理工学院帕特纳分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sarcasm is a linguistic phenomenon that intends to ridicule a target (e.g., entity, event, or person) in an inherent way. Multimodal Sarcasm Explanation (MuSE) aims at revealing the intended irony in a sarcastic post using a natural language explanation. Though important, existing systems overlooked the significance of the target of sarcasm in generating explanations. In this paper, we propose a Target-aUgmented shaRed fusion-Based sarcasm explanatiOn model, aka. TURBO. We design a novel shared-fusion mechanism to leverage the inter-modality relationships between an image and its caption. TURBO assumes the target of the sarcasm and guides the multimodal shared fusion mechanism in learning intricacies of the intended irony for explanations. We evaluate our proposed TURBO model on the MORE+ dataset. Comparison against multiple baselines and state-of-the-art models signifies the performance improvement of TURBO by an average margin of +3.3% . Moreover, we explore LLMs in zero and one-shot settings for our task and observe that LLM-generated explanation, though remarkable, often fails to capture the critical nuances of the sarcasm. Furthermore, we supplement our study with extensive human evaluation on TURBO’s generated explanations and find them out to be comparatively better than other systems.
zh

[NLP-34] Parametric type design in the era of variable and color fonts

【速读】：该论文旨在探索基于参数化设计原则（parametric design principles）的现代字体设计流程，特别是使用MetaPost技术。论文的关键在于通过这种方法创建了两种可变字体（variable fonts），并将其以自由开源许可证发布。解决方案的关键在于利用参数化设计方法实现字体的灵活性与多样性，从而推动字体设计的新进展。

链接: https://arxiv.org/abs/2502.07386
作者: Santhosh Thottingal
机构: 未知
类目: Computation and Language (cs.CL); Graphics (cs.GR)
备注: Conference: Grapholinguistics in the 21st century - From graphemes to knowledge

点击查看摘要

Abstract:Parametric fonts are programatically defined fonts with variable parameters, pioneered by Donald Kunth with his MetaFont technology in the 1980s. While Donald Knuth’s ideas in MetaFont and subsequently in MetaPost are often seen as legacy techniques from the pre-graphical user interface (GUI) era of type design, recent trends like variable fonts suggest a resurgence of certain principles. This paper explores a modern type design process built on parametric design principles, specifically using MetaPost. The author created two variable fonts with this method and released them under a free, open-source license. The paper details the methodology, workflow, and insights gained from this process.
zh

[NLP-35] EvoFlow: Evolving Diverse Agent ic Workflows On The Fly

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Model, LLM）的多智能体自动化流水线缺乏异构性和专注于单一目标性能优化的问题，这限制了它们结合较弱模型以提供更定制化且成本效益更高的解决方案。为应对这一挑战，论文提出了一种名为EvoFlow的框架，关键在于采用基于生态位的进化算法来自动搜索一组异构且自适应复杂度的智能体工作流，而非单一的同质化复杂工作流。具体而言，EvoFlow通过标签检索提取父代工作流，通过交叉和变异生成新的工作流，并利用生态位选择策略维持种群多样性和质量。

链接: https://arxiv.org/abs/2502.07373
作者: Guibin Zhang,Kaijie Chen,Guancheng Wan,Heng Chang,Hong Cheng,Kun Wang,Shuyue Hu,Lei Bai
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The past two years have witnessed the evolution of large language model (LLM)-based multi-agent systems from labor-intensive manual design to partial automation (\textite.g., prompt engineering, communication topology) and eventually to fully automated design. However, existing agentic automation pipelines often lack LLM heterogeneity and focus on single-objective performance optimization, limiting their potential to combine weaker models for more customized and cost-effective solutions. To address this challenge, we propose EvoFlow, a niching evolutionary algorithm-based framework to automatically search a population of heterogeneous and complexity-adaptive agentic workflows, rather than a single homogeneous, complex workflow. Technically, EvoFlow performs \textit(1) tag-based retrieval to extract parent workflows from an agentic population, evolves new workflows through \textit(2) crossover and \textit(3) mutation, and employs \textit(4) niching-based selection to maintain population diversity and quality. Extensive evaluations across seven benchmarks demonstrate that EvoFlow is: \textbf(I) diverse, evolving a population of workflows ranging from simple I/O tasks to complex multi-turn interactions; \textbf(II) high-performing, outperforming previous handcrafted and automated workflows by 1.23%\sim29.86% ; \textbf(III) economical, surpassing powerful \llmnameo1-preview at 12.4% of its inference cost using weaker open-source models.
zh

[NLP-36] LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

【速读】：该论文旨在解决大型语言模型（LLMs）在扩展上下文窗口后导致短文本任务性能下降的问题。论文指出，这一问题主要由隐藏状态和注意力分数的分布漂移以及连续预训练过程中的灾难性遗忘所引起。为了解决这些问题，论文提出了一种名为长上下文预训练与恢复蒸馏（LongReD）的新方法。LongReD通过最小化扩展模型与原始模型之间的分布差异来缓解短文本性能下降的问题。其关键是不仅在长文本上进行训练，还在短文本上蒸馏原始模型选定层的隐藏状态，并引入短文本到长文本的蒸馏，以对齐不同长度文本上的输出分布。

链接: https://arxiv.org/abs/2502.07365
作者: Zican Dong,Junyi Li,Jinhao Jiang,Mingyu Xu,Wayne Xin Zhao,Bingning Wang,Weipeng Chen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); Department of Computer Science, National University of Singapore (计算机科学系，新加坡国立大学); Baichuan Inc. (百川智能)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model’s short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.
zh

[NLP-37] Bridging the Evaluation Gap: Leverag ing Large Language Models for Topic Model Evaluation

【速读】：该论文旨在解决科学文献中动态演化的主题分类体系自动化评估的问题。解决方案的关键在于利用大规模语言模型（Large Language Models, LLMs）来衡量诸如连贯性、重复性、多样性以及主题-文档对齐等关键质量维度，而无需过多依赖专家标注或狭窄的统计指标。通过定制的提示引导LLM进行评估，确保在不同数据集和建模技术上的评估结果具有一致性和可解释性。

链接: https://arxiv.org/abs/2502.07352
作者: Zhiyin Tan,Jennifer D’Souza
机构: L3S Research Center, Leibniz University Hannover (汉诺威莱布尼茨大学 L3S 研究中心); TIB Leibniz Information Centre for Science and Technology (Leibniz科学与技术信息中心)
类目: Computation and Language (cs.CL)
备注: accepted by IRCDL 2025

点击查看摘要

Abstract:This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efficiently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method’s robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.
zh

[NLP-38] BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

【速读】：该论文旨在解决跨语言评估大型语言模型（LLMs）在指令遵循、推理、长上下文理解及代码生成等高级能力方面不充分的问题。解决方案的关键在于引入BenchMAX，一个多向多语言评估基准，通过机器翻译将数据从英语翻译成其他16种语言，并由三位独立的母语标注员进行高质量的标注，从而实现这些重要能力在不同语言间的公平比较。

链接: https://arxiv.org/abs/2502.07346
作者: Xu Huang,Wenhao Zhu,Hanxu Hu,Conghui He,Lei Li,Shujian Huang,Fei Yuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.
zh

[NLP-39] Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

【速读】：该论文旨在解决在指令微调阶段，训练大型语言模型（LLMs）使用包含不熟悉知识的数据导致模型过度自信和产生幻觉的问题。解决方案的关键在于引入了一个名为NOVA的新框架，通过内部一致性探测（ICP）和语义等价识别（SEI）来衡量模型对指令数据的熟悉程度，并采用专家对齐奖励模型来提升数据质量。ICP通过计算多个自动生成响应之间的定制一致性来评估模型对给定指令的理解，而SEI则利用提议的语义聚类和精心设计的投票策略来进一步评估模型对目标响应的熟悉度。这些方法共同确保所选数据的质量，从而有效减少幻觉现象并保持模型遵循指令的能力。

链接: https://arxiv.org/abs/2502.07340
作者: Shuzheng Si,Haozhe Zhao,Gang Chen,Cheng Gao,Yuzhuo Bai,Zhitong Wang,Kaikai An,Kangyang Luo,Chen Qian,Fanchao Qi,Baobao Chang,Maosong Sun
机构: Tsinghua University (清华大学); Peking University (北京大学); DeepLang AI (DeepLang AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training LLMs on data that contains unfamiliar knowledge during the instruction tuning stage can make LLMs overconfident and encourage hallucinations. To address this challenge, we introduce a novel framework, NOVA, which identifies high-quality data that aligns well with the LLM’s learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM’s understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity to enhance data quality. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less. Extensive experiments and analysis show that NOVA significantly reduces hallucinations and allows LLMs to maintain a strong ability to follow instructions.
zh

[NLP-40] Music for All: Exploring Multicultural Representations in Music Generation Models (Camera Ready) NAACL’25

【速读】：该论文旨在解决音乐生成模型在不同音乐类型和文化中的表现不均衡问题。研究发现，现有音乐数据集中仅有5.7%的数据来自非西方音乐类型，导致模型在不同音乐风格上的表现存在显著差异。为缓解这一偏见，论文探索了参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术在小规模非西方音乐数据集上的应用效果。实验结果显示，尽管存在挑战，PEFT技术在跨音乐类型适应方面展现出潜力，强调了构建更公平且适用于跨文化迁移学习的音乐-语言模型的重要性。

链接: https://arxiv.org/abs/2502.07328
作者: Atharva Mehta,Shivam Chauhan,Amirbek Djanibekov,Atharva Kulkarni,Gus Xia,Monojit Choudhury
机构: Mohamed bin Zayed University of Artificial Intelligence
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 17 pages, 5 figures, accepted to NAACL’25

点击查看摘要

Abstract:The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres. We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models – MusicGen and Mustango, for two underrepresented non-Western music traditions – Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.
zh

[NLP-41] MEMIT-Merge: Addressing MEMITs Key-Value Conflicts in Same-Subject Batch Editing for LLM s

【速读】：该论文旨在解决大型语言模型在使用批量编辑算法MEMIT进行知识修改时，当批次内包含多个具有相同主体的事实编辑任务时，其编辑效能显著下降的问题。论文的关键解决方案是提出了MEMIT-Merge方法，通过合并共享同一主体的事实值计算过程，有效解决了相同主体批量编辑场景中的性能退化问题。实验结果显示，相比MEMIT在较大批次大小下编辑成功率降至约50%，MEMIT-Merge的编辑成功率仍超过90%，展示了其在处理主体实体冲突时的卓越稳健性。

链接: https://arxiv.org/abs/2502.07322
作者: Zilu Dong,Xiangqing Shen,Rui Xia
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models continue to scale up, knowledge editing techniques that modify models’ internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover a critical limitation that MEMIT’s editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals that the root cause lies in MEMIT’s key value modeling framework: When multiple facts with the same subject in a batch are modeled through MEMIT’s key value mechanism, identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in updates conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in same-subject batch editing scenarios. Experimental results demonstrate that when MEMIT’s edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions.
zh

[NLP-42] CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

【速读】：该论文旨在解决大型语言模型在处理多样推理任务时因训练数据稀疏和碎片化而导致性能提升困难的问题。关键解决方案是CodeI/O方法，通过将代码转换为输入-输出预测格式，并利用自然语言中的链式思考（Chain-of-Thought, CoT）理性来训练模型，从而系统地提炼出嵌入在上下文相关代码中的多样化推理模式。这种方法使模型能够接触通用的推理原语，如逻辑流程规划、状态空间搜索、决策树遍历和模块分解，同时剥离结构化推理与特定代码语法的联系，保持过程严谨性。

链接: https://arxiv.org/abs/2502.07316
作者: Junlong Li,Daya Guo,Dejian Yang,Runxin Xu,Yu Wu,Junxian He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives – like logic flow planning, state-space searching, decision tree traversal, and modular decomposition – while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at this https URL.
zh

[NLP-43] RAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation

【速读】：本文旨在解决视觉-语言导航（Vision-Language Navigation, VLN）任务中的路径规划问题。为实现这一目标，论文提出了一种模块化方法，将问题分解为四个子模块，并采用零样本设置下的最先进大规模语言模型（Large Language Models, LLMs）和视觉-语言模型（Vision-Language Models, VLMs）。关键在于通过LLM提取地标及其访问顺序，利用拓扑地图上的最短路径算法生成路径假设，并使用动态规划计算全景图像序列与地标名称序列之间的对齐分数，最终通过nDTW度量评估路径保真度。这种方法在复杂的R2R-Habitat数据集上展示了优于其他使用联合语义地图的方法的性能。

链接: https://arxiv.org/abs/2502.07306
作者: Navid Rajabi,Jana Kosecka
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this work, we propose a modular approach for the Vision-Language Navigation (VLN) task by decomposing the problem into four sub-modules that use state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) in a zero-shot setting. Given navigation instruction in natural language, we first prompt LLM to extract the landmarks and the order in which they are visited. Assuming the known model of the environment, we retrieve the top-k locations of the last landmark and generate k path hypotheses from the starting location to the last landmark using the shortest path algorithm on the topological map of the environment. Each path hypothesis is represented by a sequence of panoramas. We then use dynamic programming to compute the alignment score between the sequence of panoramas and the sequence of landmark names, which match scores obtained from VLM. Finally, we compute the nDTW metric between the hypothesis that yields the highest alignment score to evaluate the path fidelity. We demonstrate superior performance compared to other approaches that use joint semantic maps like VLMaps \citevlmaps on the complex R2R-Habitat \citer2r instruction dataset and quantify in detail the effect of visual grounding on navigation performance.
zh

[NLP-44] Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification

【速读】：该论文旨在解决多组学数据（multi-omics data）中DNA、RNA和蛋白质相互作用的综合分析问题。论文的关键在于提出了一种名为Life-Code的全面框架，通过反向转录RNA和反向翻译氨基酸序列，将多组学数据统一整合到核苷酸为基础的序列中。此外，设计了一种密码子分词器和混合长序列架构，并采用掩码建模预训练来编码编码区和非编码区的相互作用。为了模拟编码序列的翻译和折叠过程，Life-Code通过知识蒸馏从现成的蛋白质语言模型中学习对应的氨基酸结构。这些设计使Life-Code能够捕捉基因序列中的复杂相互作用，从而提供对多组学更全面的理解。

链接: https://arxiv.org/abs/2502.07299
作者: Zicheng Liu,Siyuan Li,Zhiyuan Chen,Lei Xin,Fang Wu,Chang Yu,Qirong Yang,Yucheng Guo,Yujie Yang,Stan Z. Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Genomics (q-bio.GN)
备注: 12 pages main text with 6 pages Appendix

点击查看摘要

Abstract:The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. While modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains under-explored. In this paper, we follow the guidance of the central dogma to redesign both the data and model pipeline and offer a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions of both coding and non-coding regions with masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive Experiments show that Life-Code achieves state-of-the-art performance on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
zh

[NLP-45] Small Language Model Makes an Effective Long Text Extractor AAAI’25

【速读】：该论文旨在解决从扩展文本（如主页）中提取较长实体跨度（如奖项）的问题。当前方法主要分为基于跨度的方法和基于生成的方法，但它们分别面临冗余计算和GPU内存使用过高，以及长跨度准确生成和有效微调时间成本高的挑战。论文的关键解决方案是提出了一种轻量级的基于跨度的命名实体识别方法SeNER，该方法结合了双向箭头注意力机制和在[CLS]标记上的LogN-缩放技术以有效嵌入长文本，并引入了一种新颖的双向滑动窗口加号注意力机制，显著减少了冗余候选标记对跨度，并同时建模这些跨度之间的交互作用。

链接: https://arxiv.org/abs/2502.07286
作者: Yelin Chen,Fanjin Zhang,Jie Tang
机构: Zhipu AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI’25, 9 pages, 1 appendix pages

点击查看摘要

Abstract:Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: this https URL
zh

[NLP-46] GENERator: A Long-Context Generative Genomic Foundation Model

【速读】：该论文旨在解决DNA序列预测与解读的挑战，特别是在现有模型中普遍存在的鲁棒性和应用范围局限性问题。解决方案的关键在于提出了名为GENERator的生成式基因组基础模型，该模型具备98k碱基对（base pairs, bp）的上下文长度和12亿参数，并经过包含3860亿碱基对的真核生物DNA数据集训练。GENERator在多个基准测试中展示了最先进的性能，尤其在序列优化和特定活性谱启动子序列的响应生成方面表现出显著潜力。

链接: https://arxiv.org/abs/2502.07272
作者: Wei Wu,Qiuyi Li,Mingyang Li,Kun Fu,Fuli Feng,Jieping Ye,Hui Xiong,Zheng Wang
机构: 未知
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions.
zh

[NLP-47] When More is Less: Understanding Chain-of-Thought Length in LLM s

【速读】：该论文旨在探讨链式思维（CoT）长度与大规模语言模型（LLMs）推理准确性之间的关系，并解决是否增加CoT长度始终能够提升推理准确性的问题。研究表明，随着推理步骤的增加，性能起初提升但最终会下降。论文的关键解决方案是提出一个长度过滤投票机制（Length-filtered Vote），以缓解过长或过短CoT带来的负面影响，从而优化多步推理过程。通过理论分析和实验验证，论文揭示了存在最优CoT长度，并提供了基于模型能力和任务难度的缩放定律。

链接: https://arxiv.org/abs/2502.07266
作者: Yuyang Wu,Yifei Wang,Tianqi Du,Stefanie Jegelka,Yisen Wang
机构: Peking University; MIT CSAIL (MIT 计算机科学与人工智能实验室); TU Munich (慕尼黑工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that longer reasoning processes are increasingly susceptible to noise. We theoretically prove the existence of an optimal CoT length and derive a scaling law for this optimal length based on model capability and task difficulty. Inspired by our theory, we conduct experiments on both synthetic and real world datasets and propose Length-filtered Vote to alleviate the effects of excessively long or short CoTs. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.
zh

[NLP-48] Hidden Division of Labor in Scientific Teams Revealed Through 1.6 Million LaTeX Files

【速读】：该论文旨在解决科学奖励系统中个体贡献识别的问题，特别是合作论文中贡献不明确的情况。传统的方法如作者顺序和职业阶段无法避免偏见，而自我报告的贡献声明也仅限于部分期刊。论文的关键解决方案在于构建了一个大规模的数据集，通过分析1991年至2023年间由200万科学家撰写的160万篇论文中的LaTeX文件中的作者特定宏（author-specific macros），揭示了科学团队内部隐含的劳动分工：一些作者主要负责概念性章节（如引言和讨论），而另一些则专注于技术性章节（如方法和实验）。这一发现提供了科学团队内隐性劳动分工的第一批大规模证据，挑战了传统的作者身份惯例，并为机构政策在贡献分配上的制定提供了依据。

链接: https://arxiv.org/abs/2502.07263
作者: Jiaxin Pei,Lulin Yang,Lingfei Wu
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Recognition of individual contributions is fundamental to the scientific reward system, yet coauthored papers obscure who did what. Traditional proxies-author order and career stage-reinforce biases, while contribution statements remain self-reported and limited to select journals. We construct the first large-scale dataset on writing contributions by analyzing author-specific macros in LaTeX files from 1.6 million papers (1991-2023) by 2 million scientists. Validation against self-reported statements (precision = 0.87), author order patterns, field-specific norms, and Overleaf records (Spearman’s rho = 0.6, p 0.05) confirms the reliability of the created data. Using explicit section information, we reveal a hidden division of labor within scientific teams: some authors primarily contribute to conceptual sections (e.g., Introduction and Discussion), while others focus on technical sections (e.g., Methods and Experiments). These findings provide the first large-scale evidence of implicit labor division in scientific teams, challenging conventional authorship practices and informing institutional policies on credit allocation.
zh

[NLP-49] DrugImproverGPT : A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization

【速读】：该论文旨在通过微调大型语言模型（Large Language Model, LLM）来实现药物优化，以满足特定目标。论文的关键解决方案在于引入了一种新的强化学习算法——结构化策略优化（Structured Policy Optimization, SPO），它被用于改进基于LLM的生成模型。SPO算法通过调整生成分子与输入分子之间的关系，使其符合期望的目标，从而在保持原有药物有益化学性质的同时，增强其在多个靶标上的性能。

链接: https://arxiv.org/abs/2502.07237
作者: Xuefeng Liu,Songhao Jiang,Siyu Chen,Zhuoran Yang,Yuxin Chen,Ian Foster,Rick Stevens
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: this https URL.
zh

[NLP-50] Graph RAG -Tool Fusion

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在工具知识库中选择相关工具时未能捕捉到工具之间的结构化依赖关系的问题，从而限制了检索精度。论文的关键解决方案是引入Graph RAG-Tool Fusion方法，这是一种结合向量检索优势与高效图遍历的新方法，能够捕获预定义工具知识图谱中的所有相关工具及其嵌套依赖关系。

链接: https://arxiv.org/abs/2502.07223
作者: Elias Lumer,Pradeep Honaganahalli Basavaraju,Myles Mason,James A. Burke,Vamse Kumar Subbiah
机构: PricewaterhouseCoopers(普华永道)
类目: Computation and Language (cs.CL)
备注: 25 pages, 14 figures, 2 tables

点击查看摘要

Abstract:Recent developments in retrieval-augmented generation (RAG) for selecting relevant tools from a tool knowledge base enable LLM agents to scale their complex tool calling capabilities to hundreds or thousands of external tools, APIs, or agents-as-tools. However, traditional RAG-based tool retrieval fails to capture structured dependencies between tools, limiting the retrieval accuracy of a retrieved tool’s dependencies. For example, among a vector database of tools, a “get stock price” API requires a “stock ticker” parameter from a “get stock ticker” API, and both depend on OS-level internet connectivity tools. In this paper, we address this limitation by introducing Graph RAG-Tool Fusion, a novel plug-and-play approach that combines the strengths of vector-based retrieval with efficient graph traversal to capture all relevant tools (nodes) along with any nested dependencies (edges) within the predefined tool knowledge graph. We also present ToolLinkOS, a new tool selection benchmark of 573 fictional tools, spanning over 15 industries, each with an average of 6.3 tool dependencies. We demonstrate that Graph RAG-Tool Fusion achieves absolute improvements of 71.7% and 22.1% over naïve RAG on ToolLinkOS and ToolSandbox benchmarks, respectively (mAP@10). ToolLinkOS dataset is available at this https URL
zh

[NLP-51] A Large-Scale Benchmark for Vietnamese Sentence Paraphrases NAACL2025

【速读】：该论文旨在构建一个高质量的越南语句子改写数据集ViSP，包含从多个领域收集的120万组原始句子及其对应的改写句子。为确保数据质量，采用了自动改写生成与人工评估相结合的混合方法。研究通过使用如回译、EDA、基线模型（如BART和T5）以及大型语言模型（如GPT-4o、Gemini-1.5、Aya、Qwen-2.5和Meta-Llama-3.1变体）等方法进行实验。该工作的关键是开发了一个大规模且高质量的越南语改写数据集，这在越南语领域尚属首次，为未来的越南语改写任务研究和应用奠定了坚实的基础。

链接: https://arxiv.org/abs/2502.07188
作者: Sang Quang Nguyen,Kiet Van Nguyen
机构: University of Information Technology (信息技术大学), Vietnam National University (越南国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted in NAACL 2025 Findings

点击查看摘要

Abstract:This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
zh

[NLP-52] Perceived Confidence Scoring for Data Annotation with Zero-Shot LLM s

【速读】：该论文旨在解决零样本大语言模型（LLMs）在文本分类任务中的性能不足问题。解决方案的关键在于提出了一种新的方法，即感知置信评分（Perceived Confidence Scoring, PCS），它通过利用元形态关系（Metamorphic Relations, MRs）来评估LLM对其输入分类的信心。元形态关系生成语义等效但文本上变异的输入版本，并通过分析LLM在这些变异版本上的响应一致性来计算置信分数。此外，文中还引入了感知微分进化算法（Perceived Differential Evolution, PDE）以确定分类任务中各元形态关系及LLMs的最佳权重分配。

链接: https://arxiv.org/abs/2502.07186
作者: Sina Salimian,Gias Uddin,Most Husne Jahan,Shaina Raza
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Zero-shot LLMs are now also used for textual classification tasks, e.g., sentiment/emotion detection of a given input as a sentence/article. However, their performance can be suboptimal in such data annotation tasks. We introduce a novel technique Perceived Confidence Scoring (PCS) that evaluates LLM’s confidence for its classification of an input by leveraging Metamorphic Relations (MRs). The MRs generate semantically equivalent yet textually mutated versions of the input. Following the principles of Metamorphic Testing (MT), the mutated versions are expected to have annotation labels similar to the input. By analyzing the consistency of LLM responses across these variations, PCS computes a confidence score based on the frequency of predicted labels. PCS can be used both for single LLM and multiple LLM settings (e.g., majority voting). We introduce an algorithm Perceived Differential Evolution (PDE) that determines the optimal weights assigned to the MRs and the LLMs for a classification task. Empirical evaluation shows PCS significantly improves zero-shot accuracy for Llama-3-8B-Instruct (4.96%) and Mistral-7B-Instruct-v0.3 (10.52%), with Gemma-2-9b-it showing a 9.39% gain. When combining all three models, PCS significantly outperforms majority voting by 7.75%.
zh

[NLP-53] Refine Knowledge of Large Language Models via Adaptive Contrastive Learning ICLR2025

【速读】：该论文旨在减轻大型语言模型（LLMs）的幻觉现象，这是LLMs研究社区追求的根本目标。论文的关键解决方案在于提出了一种自适应对比学习（Adaptive Contrastive Learning）策略，通过模仿人类的学习过程，灵活构建正负样本，从而帮助LLMs巩固正确知识，深化对已遇但未完全掌握的知识的理解，摒弃错误知识，并诚实地承认所缺乏的知识。

链接: https://arxiv.org/abs/2502.07184
作者: Yinghui Li,Haojing Huang,Jiayi Kuang,Yangning Li,Shu-Yu Guo,Chao Qu,Xiaoyu Tan,Hai-Tao Zheng,Ying Shen,Philip S. Yu
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室); Sun-Yat Sen University (中山大学); INFLY TECH (Shanghai) Co., Ltd. (英飞智联科技（上海）有限公司); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:How to alleviate the hallucinations of Large Language Models (LLMs) has always been the fundamental goal pursued by the LLMs research community. Looking through numerous hallucination-related studies, a mainstream category of methods is to reduce hallucinations by optimizing the knowledge representation of LLMs to change their output. Considering that the core focus of these works is the knowledge acquired by models, and knowledge has long been a central theme in human societal progress, we believe that the process of models refining knowledge can greatly benefit from the way humans learn. In our work, by imitating the human learning process, we design an Adaptive Contrastive Learning strategy. Our method flexibly constructs different positive and negative samples for contrastive learning based on LLMs’ actual mastery of knowledge. This strategy helps LLMs consolidate the correct knowledge they already possess, deepen their understanding of the correct knowledge they have encountered but not fully grasped, forget the incorrect knowledge they previously learned, and honestly acknowledge the knowledge they lack. Extensive experiments and detailed analyses on widely used datasets demonstrate the effectiveness of our method.
zh

[NLP-54] Dont Just Demo Teach Me the Principles: A Principle-Based Multi-Agent Prompting Strategy for Text Classification AAAI2025

【速读】：该论文旨在解决文本分类任务中的零样本提示（zero-shot prompting）性能不足的问题。解决方案的关键在于提出了一种基于原则的多智能体提示策略（PRINCIPLE-BASED PROMPTING），通过让多个大语言模型（LLM）智能体独立生成候选原则，并由最终化智能体整合这些原则，再传递给分类智能体进行下游分类任务。这种方法不仅在宏观F1得分上取得了显著的性能提升（1.55% - 19.37%），而且在不同大小的LLM和分类数据集上均优于其他基线方法（如CoT和stepback提示）。此外，该策略通过标签信息和多智能体协同LLM框架生成高质量原则，从而降低了推理成本，同时在两个私有数据集上展现了比人工设计原则更好的效果。

链接: https://arxiv.org/abs/2502.07165
作者: Peipei Wei,Dimitris Dimitriadis,Yan Xu,Mingwei Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in AAAI 2025 Workshop on Advancing LLM-Based Multi-Agent Collaboration

点击查看摘要

Abstract:We present PRINCIPLE-BASED PROMPTING, a simple but effective multi-agent prompting strategy for text classification. It first asks multiple LLM agents to independently generate candidate principles based on analysis of demonstration samples with or without labels, consolidates them into final principles via a finalizer agent, and then sends them to a classifier agent to perform downstream classification tasks. Extensive experiments on binary and multi-class classification datasets with different sizes of LLMs show that our approach not only achieves substantial performance gains (1.55% - 19.37%) over zero-shot prompting on macro-F1 score but also outperforms other strong baselines (CoT and stepback prompting). Principles generated by our approach help LLMs perform better on classification tasks than human crafted principles on two private datasets. Our multi-agent PRINCIPLE-BASED PROMPTING approach also shows on-par or better performance compared to demonstration-based few-shot prompting approaches, yet with substantially lower inference costs. Ablation studies show that label information and the multi-agent cooperative LLM framework play an important role in generating high-quality principles to facilitate downstream classification tasks.
zh

[NLP-55] Does Training on Synthetic Data Make Models Less Robust?

【速读】：该论文旨在探究利用合成数据训练大规模语言模型（Large Language Models, LLMs）是否会加剧模型的盲点（blindspots），即是否会使模型在某些特定任务上的表现变得更差。研究通过在自然语言推理（Natural Language Inference, NLI）任务上进行模拟实验，使用Llama-2-7B-hf模型，并采用MultiNLI作为一般任务，HANS作为针对性评估集来测量特定启发式策略的存在。研究的关键在于验证在使用合成数据微调模型时，是否会导致模型在盲点任务上的性能显著下降。研究结果表明，合成数据并未如预期那样加剧模型的盲点问题，即虽然微调过程中未减少特定启发式的使用，但也未使其变得更糟。

链接: https://arxiv.org/abs/2502.07164
作者: Lingze Zhang,Ellie Pavlick
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain “blindspots” by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our “blindspot” task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn’t necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.
zh

[NLP-56] Ask Patients with Patience: Enabling LLM s for Human-Centric Medical Dialogue with Grounded Reasoning

【速读】：该论文旨在解决在线医疗咨询中当前大型语言模型（LLMs）诊断准确性与效率不足的问题。这些问题包括单轮交互限制、缺乏通过后续提问来改进预测的能力，以及响应中包含复杂的医学术语，导致非专业用户的理解障碍。为了解决这些问题，论文提出了一种名为Ask Patients with Patience (APP) 的多轮对话系统。APP的关键在于通过基于实证推理的迭代诊断改进和结合医学指南及熵最小化策略，从而提高诊断的准确性和效率。此外，APP通过以用户为中心的沟通方式，显著提升了用户对复杂医学术语的理解和参与度。

链接: https://arxiv.org/abs/2502.07143
作者: Jiayuan Zhu,Junde Wu
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate and efficient diagnosis in online medical consultations remains a challenge for current large language models. These models often rely on single-turn interactions and lack the ability to refine their predictions through follow-up questions. Additionally, their responses frequently contain complex medical terminology, making them less accessible to non-medical users and creating barriers to effective communication. In this paper, we introduce Ask Patients with Patience (APP), the first multi-turn dialogue that enables LLMs to iteratively refine diagnoses based on grounded reasoning. By integrating medical guidelines and entropy minimization, APP improves both diagnostic accuracy and efficiency. Furthermore, it features human-centric communication that bridges the gap between user comprehension and medical terminology, significantly enhancing user accessibility and engagement. We evaluated APP using a subset of the ReMeDi dataset, comparing it with single-turn and traditional multi-turn LLM baselines. APP achieved higher similarity scores in diagnosis predictions, demonstrating better alignment with ground truth diagnoses. Entropy analysis showed that APP reduces diagnostic uncertainty more rapidly across iterations, increasing confidence in its predictions. APP also excels in user accessibility and empathy, further bridging the gap between complex medical language and user understanding. Code will be released at: this https URL.
zh

[NLP-57] Language-TPP: Integrating Temporal Point Processes with Language Models for Event Analysis

【速读】：该论文旨在解决事件序列建模中文本描述与时间动态难以有效结合的问题。解决方案的关键在于引入Language-TPP框架，通过一种新颖的时间编码机制将连续时间间隔转换为专用的字节标记，从而实现与标准大语言模型（LLMs）架构的无缝集成。这种方法使得Language-TPP在多个任务上实现了最先进的性能，包括事件时间预测、类型预测和强度估计。

链接: https://arxiv.org/abs/2502.07139
作者: Quyu Kong,Yixuan Zhang,Yang Liu,Panrong Tong,Enqi Liu,Feng Zhou
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Temporal Point Processes (TPPs) have been widely used for event sequence modeling, but they often struggle to incorporate rich textual event descriptions effectively. Conversely, while Large Language Models (LLMs) have been shown remarkable capabilities in processing textual data, they lack mechanisms for handling temporal dynamics. To bridge this gap, we introduce Language-TPP, a unified framework that integrates TPPs with LLMs for enhanced event sequence modeling. Language-TPP introduces a novel temporal encoding mechanism that converts continuous time intervals into specialized byte-tokens, enabling seamless integration with standard LLM architectures. This approach allows Language-TPP to achieve state-of-the-art performance across multiple TPP tasks, including event time prediction, type prediction, and intensity estimation, on five datasets. Additionally, we demonstrate that incorporating temporal information significantly improves the quality of generated event descriptions.
zh

[NLP-58] owards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

【速读】：该论文旨在解决多模态仇恨内容检测中的跨模态融合问题。研究发现现有方法在单一模态下表现良好，但在不同模态组合（如视频和图像）上的综合效果尚不明确。关键在于通过系统分析和详尽的消融研究，揭示当前融合方法难以捕捉复杂的跨模态交互作用，尤其是在存在良性干扰因素的情况下。研究表明，虽然简单的嵌入融合在视频内容上（HateMM数据集）取得了显著效果（F1分数提升9.9%），但在处理图像与文本复杂关系的梗图（Hateful Memes数据集）时则表现出局限性。论文强调了开发更稳健的仇恨检测系统的需求，并提出了针对不同模态架构的具体考虑。

链接: https://arxiv.org/abs/2502.07138
作者: Girish A. Koushik,Diptesh Kanojia,Helen Treharne
机构: NICE Research, University of Surrey (萨里大学), Guildford, United Kingdom; Surrey Centre for Cyber Security (萨里网络安全中心), University of Surrey (萨里大学), Guildford, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the MM4SG Workshop at the WebConf 2025

点击查看摘要

Abstract:Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at this https URL.
zh

[NLP-59] WICE: What Advantages Can Low-Resource Domain-Specific Embedding Model Bring? - A Case Study on Korea Financial Texts ICLR

【速读】：该论文旨在解决现有嵌入模型基准测试在低资源语言环境下的局限性问题，特别是针对韩国金融领域的文本分析。论文的关键在于引入KorFinMTEB这一新型基准测试集，以反映低资源语言特有的文化特征。实验结果显示，虽然模型在翻译版本的FinMTEB上表现稳健，但在KorFinMTEB上的表现揭示了需要更深层次语义理解的任务中存在细微但重要的差异，这突显了直接翻译基准测试的局限性。因此，论文强调了开发融入语言特异性和文化细微差别的领域特定评估框架的重要性，以便更准确地评估和推动低资源环境下嵌入模型的进步。

链接: https://arxiv.org/abs/2502.07131
作者: Yewon Hwang,Sungbum Jung,Hanwool Lee,Sara Yu
机构: ModuLabs; Brian Impact; NCSOFT; Shinhan Securities Co; KT
类目: Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注: Submitted to ICLR@Financial AI

点击查看摘要

Abstract:Domain specificity of embedding models is critical for effective performance. However, existing benchmarks, such as FinMTEB, are primarily designed for high-resource languages, leaving low-resource settings, such as Korean, under-explored. Directly translating established English benchmarks often fails to capture the linguistic and cultural nuances present in low-resource domains. In this paper, titled TWICE: What Advantages Can Low-Resource Domain-Specific Embedding Models Bring? A Case Study on Korea Financial Texts, we introduce KorFinMTEB, a novel benchmark for the Korean financial domain, specifically tailored to reflect its unique cultural characteristics in low-resource languages. Our experimental results reveal that while the models perform robustly on a translated version of FinMTEB, their performance on KorFinMTEB uncovers subtle yet critical discrepancies, especially in tasks requiring deeper semantic understanding, that underscore the limitations of direct translation. This discrepancy highlights the necessity of benchmarks that incorporate language-specific idiosyncrasies and cultural nuances. The insights from our study advocate for the development of domain-specific evaluation frameworks that can more accurately assess and drive the progress of embedding models in low-resource settings.
zh

[NLP-60] Cardiverse: Harnessing LLM s for Novel Card Game Prototyping

【速读】：该论文旨在解决计算机游戏，特别是纸牌游戏原型设计过程中大量的人力创意构思和游戏玩法评估需求。解决方案的关键在于提出了一套全面的自动化纸牌游戏原型设计框架，其中包括基于图的索引方法以生成新颖的游戏设计，由大语言模型驱动的系统以生成一致的游戏代码并通过游戏记录验证，以及通过自对弈优化的大语言模型生成的动作-价值函数集成方法来构建游戏玩法AI。这些贡献旨在加速纸牌游戏原型设计过程，减少人力劳动，并降低游戏开发者的入门门槛。

链接: https://arxiv.org/abs/2502.07128
作者: Danrui Li,Sen Zhang,Sam S. Sohn,Kaidong Hu,Muhammad Usman,Mubbasir Kapadia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 13 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game designs, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated action-value functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers.
zh

[NLP-61] Structural Reformation of Large Language Model Neuron Encapsulation for Divergent Information Aggregation

【速读】：该论文旨在解决深度学习架构中信息聚合与专业化不足的问题。解决方案的关键在于引入了一种结构化神经元封装的模块化框架，通过这种框架，模型展示了改进的困惑度分数、更大的词汇变异性以及增强的逻辑推理一致性。这表明结构化的参数分布有助于更高效的语言表示。关键机制在于封装的神经元承担了专门的处理角色，从而提高了语言生成的适应性，并减少了模块化架构中的内部冲突。尽管处理开销略有增加，但参数效率和结构化决策的提升弥补了这一复杂性。

链接: https://arxiv.org/abs/2502.07124
作者: Denis Bakushev,Gideon Boultinghouse,Harriet Oppenheimer,Sebastian Gillingwater,Valentina Ashington,Wilfred Stanborough
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Structured neuron encapsulation introduces a modular framework that enables more effective aggregation and specialization of information within deep learning architectures. A model modified through this framework demonstrated improved perplexity scores, greater lexical variability, and enhanced consistency in logical reasoning, suggesting that structured parameter distribution contributes to more efficient language representation. Statistical analyses of generated text highlighted a wider range of sentence structures and reduced redundancy in token selection, indicating that encapsulation fosters more adaptable language generation. A detailed evaluation of attention weight distributions revealed that the experimental model exhibited greater divergence in cross-layer activations, supporting the hypothesis that encapsulated neurons assume specialized processing roles. Logical consistency assessments further demonstrated that modular architectures mitigate contradictory outputs, reducing internal conflicts in inferred relationships between linguistic constructs. Computational trade-offs were analyzed, with results showing a minor increase in processing overhead, though improvements in parameter efficiency and structured decision-making compensated for the additional complexity. The mathematical formulation of the encapsulation mechanism confirmed that modular aggregation maintains stable convergence properties while promoting distinct functional roles for different neuron clusters.
zh

[NLP-62] SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation

【速读】：该论文旨在解决序列分类任务中敏感性计算的高成本问题，特别是由于现有框架的时间复杂度呈指数级增长。论文的关键解决方案是引入了一种基于灵敏度的多臂老虎机框架（Sensitivity-based Multi-Armed Bandit, SMAB），该框架提供了一种可扩展的方法来计算词级别局部（句子级别）和全局（聚合）灵敏度，适用于任何数据集下的底层文本分类器。通过多种应用和实验验证了该方法的有效性。

链接: https://arxiv.org/abs/2502.07101
作者: Saurabh Kumar Pandey,Sachin Vashistha,Debrup Das,Somak Aditya,Monojit Choudhury
机构: MBZUAI; Indian Institute of Technology, Kharagpur; University of Massachusetts Amherst
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Bandit framework (SMAB), which provides a scalable approach for calculating word-level local (sentence-level) and global (aggregated) sensitivities concerning an underlying text classifier for any dataset. We establish the effectiveness of our approach through various applications. We perform a case study on CHECKLIST generated sentiment analysis dataset where we show that our algorithm indeed captures intuitively high and low-sensitive words. Through experiments on multiple tasks and languages, we show that sensitivity can serve as a proxy for accuracy in the absence of gold data. Lastly, we show that guiding perturbation prompts using sensitivity values in adversarial example generation improves attack success rate by 15.58%, whereas using sensitivity as an additional reward in adversarial paraphrase generation gives a 12.00% improvement over SOTA approaches. Warning: Contains potentially offensive content.
zh

[NLP-63] Kernels of Selfhood: GPT -4o shows humanlike patterns of cognitive consistency moderated by free choice

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）是否也反映了人类不那么审慎的心理过程。研究通过两个预先注册的研究，检验了GPT-4o在撰写关于俄罗斯领导人普京的正面或负面文章后，其对普京的态度是否表现出与人类认知一致性效应相仿的变化模式。关键在于发现当LLM被提供选择撰写正面或负面文章的幻觉时，态度变化的程度显著增加，这表明GPT-4o展现了一种类似人类自我意识的功能性类比，尽管这种行为背后的机制尚需进一步理解。

链接: https://arxiv.org/abs/2502.07088
作者: Steven A. Lehr,Ketan S. Saichandran,Eddie Harmon-Jones,Nykko Vitali,Mahzarin R. Banaji
机构: Cangrade, Inc.(Cangrade公司); Boston University (波士顿大学); The University of New South Wales (新南威尔士大学); Harvard University (哈佛大学); Harvard University (哈佛大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Main Article: 10 pages, Supporting Information: 61 pages

点击查看摘要

Abstract:Large Language Models (LLMs) show emergent patterns that mimic human cognition. We explore whether they also mirror other, less deliberative human psychological processes. Drawing upon classical theories of cognitive consistency, two preregistered studies tested whether GPT-4o changed its attitudes toward Vladimir Putin in the direction of a positive or negative essay it wrote about the Russian leader. Indeed, GPT displayed patterns of attitude change mimicking cognitive consistency effects in humans. Even more remarkably, the degree of change increased sharply when the LLM was offered an illusion of choice about which essay (positive or negative) to write. This result suggests that GPT-4o manifests a functional analog of humanlike selfhood, although how faithfully the chatbot’s behavior reflects the mechanisms of human attitude change remains to be understood.
zh

[NLP-64] “Once Upon a Time…” Literary Narrative Connectedness Progresses with Grade Level: Potential Impact on Reading Fluency and Literacy Skills

【速读】：该论文旨在探究儿童读物的叙事复杂性是否随年级变化，并试图解答儿童书籍在不同学段是否存在类似的复杂度演变模式。研究的关键解决方案在于运用词频图分析法（Word-Recurrence Graph Analysis）对涵盖13年教育历程的1,627篇文学文本进行分析，揭示了这些文本在连通性上的显著指数增长，特别是在最初三年的教育阶段，这一发现与儿童口头叙述复杂性的演变模式相吻合。

链接: https://arxiv.org/abs/2502.07082
作者: Marina Ribeiro,Bárbara Malcorra,Diego Pintor,Natália Bezerra Mota
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:Selecting an appropriate book is crucial for fostering reading habits in children. While children exhibit varying levels of complexity when generating oral narratives, the question arises: do children’s books also differ in narrative complexity? This study explores the narrative dynamics of literary texts used in schools, focusing on how their complexity evolves across different grade levels. Using Word-Recurrence Graph Analysis, we examined a dataset of 1,627 literary texts spanning 13 years of education. The findings reveal significant exponential growth in connectedness, particularly during the first three years of schooling, mirroring patterns observed in children’s oral narratives. These results highlight the potential of literary texts as a tool to support the development of literacy skills.
zh

[NLP-65] Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

【速读】：该论文旨在解决如何系统性评估大型语言模型（LLMs）在多样化和现实场景中的拟人化行为。论文的关键解决方案在于提出了一种多回合评估方法，通过模拟用户交互实现自动化，并通过大规模人类受试者研究来验证所测模型行为与真实用户的拟人化感知之间的关联。这种方法超越了单一回合静态基准测试，从三个维度推进了最先进LLM评估技术的发展。

链接: https://arxiv.org/abs/2502.07077
作者: Lujain Ibrahim,Canfer Akbulut,Rasmi Elasmar,Charvi Rastogi,Minsuk Kahng,Meredith Ringel Morris,Kevin R. McKee,Verena Rieser,Murray Shanahan,Laura Weidinger
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
zh

[NLP-66] IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中存在的偏见问题，特别是毒性问题。论文的关键在于提出了一种名为IRepair的新颖动态切片意图感知修复策略，通过针对性地修复模型中最易出错的部分，而非均匀地处理所有参数。这种方法能够更有效地进行修复，并且对模型整体性能的影响较小。

链接: https://arxiv.org/abs/2502.07072
作者: Sayem Mohammad Imtiaz,Astha Singh,Fraol Batole,Hridesh Rajan
机构: Iowa State University(A衣荷华州立大学); Tulane University(A图兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted as full research paper at FSE’2025

点击查看摘要

Abstract:Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall performance by altering a smaller portion of the model. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.
zh

[NLP-67] Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations NAACL2025

【速读】：该论文旨在解决通过模拟群体级调查结果来降低社会科学研究成本和时间的问题。论文的关键解决方案在于专门化大规模语言模型（Large Language Models, LLMs），通过基于首词概率的微调方法来最小化预测与实际调查响应分布之间的偏差。这种方法显著优于其他方法和零样本分类器，即使在未见过的问题、国家和全新的调查中也表现出色。尽管现有模型在处理未见过的问题时仍存在挑战，但研究结果证明了专门化对于提高模拟准确性的重要性，这可能加速未来准确模拟的实现。

链接: https://arxiv.org/abs/2502.07068
作者: Yong Cao,Haijiang Liu,Arnav Arora,Isabelle Augenstein,Paul Röttger,Daniel Hershcovich
机构: University of Tübingen (图宾根大学); Wuhan University of Science and Technology (武汉科技大学); University of Copenhagen (哥本哈根大学); Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 9 figures, accepted to NAACL 2025 main

点击查看摘要

Abstract:Large-scale surveys are essential tools for informing social science research and policy, but running surveys is costly and time-intensive. If we could accurately simulate group-level survey results, this would therefore be very valuable to social science research. Prior work has explored the use of large language models (LLMs) for simulating human behaviors, mostly through prompting. In this paper, we are the first to specialize LLMs for the task of simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions for a given question. Then, we show that this method substantially outperforms other methods and zero-shot classifiers, even on unseen questions, countries, and a completely unseen survey. While even our best models struggle with the task, especially on unseen questions, our results demonstrate the benefits of specialization for simulation, which may accelerate progress towards sufficiently accurate simulation in the future.
zh

[NLP-68] Using Contextually Aligned Online Reviews to Measure LLM s Performance Disparities Across Language Varieties NAACL

【速读】：该论文旨在解决不同语言变体对自然语言处理（NLP）模型性能的影响问题，特别是大型语言模型（LLMs）在广泛使用的语言变体数据上进行训练时。论文的关键解决方案在于提出了一种新颖且成本效益高的方法，通过国际在线评论平台收集真实场景中的多语言变体数据，构建具有不同语言变体的语料库，并利用这些数据来评估模型在多种语言变体上的表现。具体而言，研究团队构建了一个包含台湾华语和大陆华语文本的上下文对齐数据集，并测试了六种LLMs在情感分析任务中的表现，结果表明LLMs在台湾华语文本上的表现普遍较差。

链接: https://arxiv.org/abs/2502.07058
作者: Zixin Tang,Chieh-Yang Huang,Tsung-Chi Li,Ho Yim Sam Ng,Hen-Hsen Huang,Ting-Hao ‘Kenneth’ Huang
机构: College of Information Sciences and Technology, The Pennsylvania State University(宾夕法尼亚州立大学信息科学与技术学院); MetaMetrics Inc.(MetaMetrics公司); Institute of Information Science, Academia Sinica(中央研究院资讯科学研究所)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted by 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), theme track

点击查看摘要

Abstract:A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as this http URL, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.
zh

[NLP-69] okenization Standards for Linguistic Integrity: Turkish as a Benchmark

【速读】：该论文旨在解决自然语言处理（NLP）中针对形态丰富及低资源语言的系统性评估tokenization策略的问题。关键解决方案在于引入了一个新的评估框架，通过五个核心指标（词汇量、token数量、处理时间、目标语言特定token百分比(%TR)以及token纯度）来系统地评估不同tokenizer在保持语言结构方面的能力。研究发现，目标语言特定token百分比(%TR)与下游任务性能（如MMLU得分）之间的强相关性，强调了其在提高模型准确性中的作用。这表明，针对具体语言进行优化的tokenizer策略相较于模型参数大小更为重要。

链接: https://arxiv.org/abs/2502.07057
作者: M. Ali Bayram,Ali Arda Fincan,Ahmet Semih Gümüş,Sercan Karakaş,Banu Diri,Savaş Yıldırım
机构: Yıldız Technical University (伊斯坦布尔技术大学); Yeditepe University (耶迪代佩大学); University of Chicago (芝加哥大学); Istanbul Bilgi University (伊斯坦布尔比尔吉大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models’ (LLMs) ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for systematically evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages. Using a Turkish dataset of 6,200 multiple-choice questions from the Massive Multitask Language Understanding (MMLU) benchmark, the framework assesses tokenizers across five key metrics: vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity. These metrics provide a structured approach to evaluating how well tokenizers preserve linguistic structures. While %TR measures the proportion of valid words in the target language, %Pure assesses the alignment of tokens with meaningful linguistic units, such as roots and valid morphemes, minimizing semantic fragmentation. The findings reveal that %TR, introduced as a critical metric, exhibits a stronger correlation with downstream performance (e.g., MMLU scores) than token purity, emphasizing its role in improving model accuracy. Additionally, larger model parameters do not necessarily yield better tokenization quality or enhanced results, highlighting the importance of tailored tokenization strategies that prioritize linguistic alignment. This framework sets a new standard for developing robust tokenization methods optimized for morphologically complex and low-resource languages. Future work will refine morphological analysis, explore domain-specific customizations, and conduct cross-linguistic evaluations to further enhance tokenization practices.
zh

[NLP-70] Scalable and Ethical Insider Threat Detection through Data Synthesis and Analysis by LLM s

【速读】：该论文旨在解决通过分析公开招聘网站上的匿名评论来检测内部威胁情感的问题。论文的关键解决方案在于利用大规模语言模型（Large Language Models, LLMs）来分析和识别这些评论中的威胁情感，并通过合成数据生成技术克服伦理和后勤方面的数据采集障碍。研究结果表明，LLMs在大多数情况下与人工评估结果一致，尽管在真实人类生成的数据上的表现略低于合成数据。

链接: https://arxiv.org/abs/2502.07045
作者: Haywood Gelman,John D. Hastings
机构: The Beacom College of Computer and Cyber Sciences (计算机与网络科学学院), Dakota State University (达科他州立大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 6 pages, 0 figures, 8 tables

点击查看摘要

Abstract:Insider threats wield an outsized influence on organizations, disproportionate to their small numbers. This is due to the internal access insiders have to systems, information, and infrastructure. %One example of this influence is where anonymous respondents submit web-based job search site reviews, an insider threat risk to organizations. Signals for such risks may be found in anonymous submissions to public web-based job search site reviews. This research studies the potential for large language models (LLMs) to analyze and detect insider threat sentiment within job site reviews. Addressing ethical data collection concerns, this research utilizes synthetic data generation using LLMs alongside existing job review datasets. A comparative analysis of sentiment scores generated by LLMs is benchmarked against expert human scoring. Findings reveal that LLMs demonstrate alignment with human evaluations in most cases, thus effectively identifying nuanced indicators of threat sentiment. The performance is lower on human-generated data than synthetic data, suggesting areas for improvement in evaluating real-world data. Text diversity analysis found differences between human-generated and LLM-generated datasets, with synthetic data exhibiting somewhat lower diversity. Overall, the results demonstrate the applicability of LLMs to insider threat detection, and a scalable solution for insider sentiment testing by overcoming ethical and logistical barriers tied to data acquisition.
zh

[NLP-71] Leverag ing Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment NAACL2025

【速读】：该论文旨在解决在不典型发音评估中区分不典型与典型发音时，由于忽略音素环境变化导致的音位allophone建模复杂性问题。论文的关键解决方案是提出MixGoP方法，利用高斯混合模型来表征具有多个子簇的音素分布，并结合冻结自监督语音模型（S3M）特征，以更有效地捕捉音素环境变化。实验结果表明，MixGoP在四个数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2502.07029
作者: Kwanghee Choi,Eunjung Yeo,Kalvin Chang,Shinji Watanabe,David Mortensen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to NAACL 2025. Codebase available at this https URL

点击查看摘要

Abstract:Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.
zh

[NLP-72] AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements ICLR2025

【速读】：该论文旨在解决政府监管机构在评估企业现代奴隶制声明时面临的挑战，特别是如何有效识别和区分具体的反奴措施与模糊的声明。关键解决方案在于构建一个包含5,731份来自澳大利亚现代奴隶制登记册的声明数据集，并对其进行逐句标注。通过精心设计的标注规范、声明的选择与预处理以及高质量标注子集的创建，该数据集能够有效地用于模型评估。论文进一步提出了一种机器学习方法，用于检测符合澳大利亚现代奴隶制法案强制报告要求的相关句子，并在此基础上评估不同语言模型在零样本和有监督学习设置下的性能。

链接: https://arxiv.org/abs/2502.07022
作者: Adriana Eufrosiana Bora,Pierre-Luc St-Charles,Mirko Bronzi,Arsène Fansi Tchango,Bruno Rousseau,Kerrie Mengersen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Camera ready. ICLR 2025

点击查看摘要

Abstract:Despite over a decade of legislative efforts to address modern slavery in the supply chains of large corporations, the effectiveness of government oversight remains hampered by the challenge of scrutinizing thousands of statements annually. While Large Language Models (LLMs) can be considered a well established solution for the automatic analysis and summarization of documents, recognizing concrete modern slavery countermeasures taken by companies and differentiating those from vague claims remains a challenging task. To help evaluate and fine-tune LLMs for the assessment of corporate statements, we introduce a dataset composed of 5,731 modern slavery statements taken from the Australian Modern Slavery Register and annotated at the sentence level. This paper details the construction steps for the dataset that include the careful design of annotation specifications, the selection and preprocessing of statements, and the creation of high-quality annotation subsets for effective model evaluations. To demonstrate our dataset’s utility, we propose a machine learning methodology for the detection of sentences relevant to mandatory reporting requirements set by the Australian Modern Slavery Act. We then follow this methodology to benchmark modern language models under zero-shot and supervised learning settings.
zh

[NLP-73] Finding Words Associated with DIF: Predicting Differential Item Functioning using LLM s and Explainable AI

【速读】：该论文旨在解决利用文本预测项目功能差异（Differential Item Functioning, DIF）的问题。关键解决方案在于微调并比较了几种基于编码器的Transformer大型语言模型（Large Language Models, LLM），并通过可解释人工智能（Explainable Artificial Intelligence, XAI）方法识别与DIF相关的特定词汇。研究结果表明，许多与DIF相关的词汇反映了测试蓝图中故意包含的次要子领域，而非应从评估中移除的无关内容。这种方法可以在项目编写过程中即时修订与DIF相关的词汇，或通过突出文本中的关键词来帮助审查传统的DIF分析结果。

链接: https://arxiv.org/abs/2502.07017
作者: Hotaka Maeda,Yikai Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures, 6 tables

点击查看摘要

Abstract:We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to these models to identify specific words associated with DIF. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction R^2 ranged from .04 to .32 among eight focal and reference group pairs. Our findings suggest that many words associated with DIF reflect minor sub-domains included in the test blueprint by design, rather than construct-irrelevant item content that should be removed from assessments. This may explain why qualitative reviews of DIF items often yield confusing or inconclusive results. Our approach can be used to screen words associated with DIF during the item-writing process for immediate revision, or help review traditional DIF analysis results by highlighting key words in the text. Extensions of this research can enhance the fairness of assessment programs, especially those that lack resources to build high-quality items, and among smaller subpopulations where we do not have sufficient sample sizes for traditional DIF analyses.
zh

[NLP-74] Demystifying Singular Defects in Large Language Models

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）中高范数标记（high-norm tokens）的产生机制及其特性，而这些问题在视觉变换器（Vision Transformers, ViTs）中已有数学建模。论文的关键在于通过理论分析与实证验证，揭示层间奇异方向（singular direction）预测高范数标记的突然增加，负特征值（negative eigenvalues）解释其急剧衰减，并指出初始标记与非初始标记在形成高范数标记的计算路径上的差异。此外，论文指出高范数标记是由近似相应模块的矩阵的右主导奇异向量触发的。这些发现不仅增进了对LLMs中奇异缺陷的理解，还提出了量化方案改进和LLM签名设计等实际应用。

链接: https://arxiv.org/abs/2502.07004
作者: Haoqi Wang,Tong Zhang,Mathieu Salzmann
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs and will therefore publicly release our code.
zh

[NLP-75] SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在协作软件工程（Collaborative Software Engineering, CSE）中面临的“不同步”挑战。这种挑战指的是代理的理解与当前环境状态不一致，从而导致行动失败并引发整合问题。为了解决这一问题，论文引入了SyncMind框架，系统性地定义了LLM代理在CSE中所面临的不同步问题，并基于此创建了SyncBench基准测试集，包含来自21个流行GitHub仓库的真实世界CSE场景中的24,332个代理不同步实例。通过SyncBench实验，论文揭示了现有LLM代理的能力和局限性，强调了现有LLM在CSE中合作意愿低下的根本局限性，并指出当合作发生时，其成功恢复不同步与合作意愿呈正相关。此外，实验还显示代理在资源感知不同步恢复方面的性能差异较小，进一步暴露了它们在资源意识和适应性方面的显著不足，为未来构建资源高效的协作系统提供了方向。

链接: https://arxiv.org/abs/2502.06994
作者: Xuehang Guo,Xingyao Wang,Yangyi Chen,Sha Li,Chi Han,Manling Li,Heng Ji
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants – whether humans or AI agents – to stay on the same page as their environment evolves. When a collaborator’s understanding diverges from the current state – what we term the out-of-sync challenge – the collaborator’s actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents’ capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent = 3.33% to Claude-3.5-Sonnet = 28.18%), their consistently low collaboration willingness (= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents’ resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: this https URL.
zh

[NLP-76] Investigating the Zone of Proximal Development of Language Models for In-Context Learning NAACL2025

【速读】：该论文旨在通过学习分析框架探究大型语言模型（Large Language Models, LLMs）在情境学习（In-Context Learning, ICL）中的行为，并将其与近侧发展区（Zone of Proximal Development, ZPD）的概念相结合。论文的关键在于提出了一种新的方法来测量和预测LLMs的ZPD分布，基于模型在有无ICL情况下的个体示例性能。通过这种方法，论文展示了如何利用预测的ZPD来优化ICL的应用，以实现推理成本与性能之间的更好平衡，并提出了一个人类式课程用于模型微调，从而提高了模型性能。这一解决方案的核心在于有效地将教育心理学中的ZPD概念应用于LLMs，从而提供新的见解并增强其应用效能。

链接: https://arxiv.org/abs/2502.06990
作者: Peng Cui,Mrinmaya Sachan
机构: 未知
类目: Computation and Language (cs.CL)
备注: NAACL 2025 findings

点击查看摘要

Abstract:In this paper, we introduce a learning analytics framework to analyze the in-context learning (ICL) behavior of large language models (LLMs) through the lens of the Zone of Proximal Development (ZPD), an established theory in educational psychology. ZPD delineates the space between what a learner is capable of doing unsupported and what the learner cannot do even with support. We adapt this concept to ICL, measuring the ZPD of LLMs based on model performance on individual examples with and without ICL. Furthermore, we propose an item response theory (IRT) model to predict the distribution of zones for LLMs. Our findings reveal a series of intricate and multifaceted behaviors of ICL, providing new insights into understanding and leveraging this technique. Finally, we demonstrate how our framework can enhance LLM in both inference and fine-tuning scenarios: (1) By predicting a model’s zone of proximal development, we selectively apply ICL to queries that are most likely to benefit from demonstrations, achieving a better balance between inference cost and performance; (2) We propose a human-like curriculum for fine-tuning, which prioritizes examples within the model’s ZPD. The curriculum results in improved performance, and we explain its effectiveness through an analysis of the training dynamics of LLMs.
zh

[NLP-77] Neighborhood-Order Learning Graph Attention Network for Fake News Detection

【速读】：该论文旨在解决假新闻检测中的关键挑战，特别是在社交网络和在线通信网络日益普及的背景下。传统图神经网络（Graph Neural Networks, GNN）在处理此类问题时存在一个主要局限性，即无法有效利用超过网络层深度的邻居信息，这可能降低模型的准确性和有效性。为了解决这一问题，论文提出了一种名为邻序学习图注意力网络（Neighborhood-Order Learning Graph Attention Network, NOL-GAT）的新模型。该模型的关键在于每个节点能够在每一层独立学习其最优邻序，从而能够有针对性且高效地从远距离邻居中提取重要信息。NOL-GAT架构包括两个主要部分：Hop网络用于确定最优邻序，嵌入网络则使用这些最优邻序更新节点嵌入。实验结果表明，NOL-GAT在准确性、F1分数等指标上显著优于基线模型，尤其是在标注数据有限的情况下。

链接: https://arxiv.org/abs/2502.06927
作者: Batool Lakzaei,Mostafa Haghir Chehreghani,Alireza Bagheri
机构: AUT(阿米尔·卡比尔工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 37 pages

点击查看摘要

Abstract:Fake news detection is a significant challenge in the digital age, which has become increasingly important with the proliferation of social media and online communication networks. Graph Neural Networks (GNN)-based methods have shown high potential in analyzing graph-structured data for this problem. However, a major limitation in conventional GNN architectures is their inability to effectively utilize information from neighbors beyond the network’s layer depth, which can reduce the model’s accuracy and effectiveness. In this paper, we propose a novel model called Neighborhood-Order Learning Graph Attention Network (NOL-GAT) for fake news detection. This model allows each node in each layer to independently learn its optimal neighborhood order. By doing so, the model can purposefully and efficiently extract critical information from distant neighbors. The NOL-GAT architecture consists of two main components: a Hop Network that determines the optimal neighborhood order and an Embedding Network that updates node embeddings using these optimal neighborhoods. To evaluate the model’s performance, experiments are conducted on various fake news datasets. Results demonstrate that NOL-GAT significantly outperforms baseline models in metrics such as accuracy and F1-score, particularly in scenarios with limited labeled data. Features such as mitigating the over-squashing problem, improving information flow, and reducing computational complexity further highlight the advantages of the proposed model.
zh

[NLP-78] Synthetic Audio Helps for Cognitive State Tasks

【速读】：该论文旨在解决单一文本模态在认知状态建模任务中的局限性，通过引入音频信号来补充文本信息。论文的关键在于提出了一种名为Synthetic Audio Data fine-tuning (SAD)的框架，该框架利用从现成的文本转语音（Text-to-Speech, TTS）系统生成的合成音频数据，与文本数据进行多模态训练。实验结果表明，在七项认知状态建模任务中，这种多模态训练方法相较于仅使用文本数据能够带来显著的性能提升，甚至在包含真实音频数据的任务中，SAD框架的表现也具有竞争力。

链接: https://arxiv.org/abs/2502.06922
作者: Adil Soubki,John Murzaku,Peter Zeng,Owen Rambow
机构: Stony Brook University (石溪大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: John Murzaku and Adil Soubki contributed equally to this work

点击查看摘要

Abstract:The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.
zh

[NLP-79] Emergence of Episodic Memory in Transformers: Characterizing Changes in Temporal Structure of Attention Scores During Training

【速读】：该论文旨在探究情境内时间偏差在注意力头及Transformer输出中的表现。研究通过认知科学方法分析了不同规模的GPT-2模型的注意力得分和输出，发现注意力头表现出类似人类情景记忆的特性，包括时间邻近性、首因效应和近因效应。Transformer输出则显示出情境内序列回忆的趋势。关键在于，这种效应在消融归纳头（induction heads）后被消除，而归纳头正是产生邻近效应的主要驱动因素。因此，该研究揭示了Transformer在情境学习中组织时间信息的方式，并阐明了其与人类记忆和学习的异同。

链接: https://arxiv.org/abs/2502.06902
作者: Deven Mahesh Mistry,Anooshka Bajaj,Yash Aggarwal,Sahaj Singh Maini,Zoran Tiganj
机构: Department of Computer Science (计算机科学系); Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate in-context temporal biases in attention heads and transformer outputs. Using cognitive science methodologies, we analyze attention scores and outputs of the GPT-2 models of varying sizes. Across attention heads, we observe effects characteristic of human episodic memory, including temporal contiguity, primacy and recency. Transformer outputs demonstrate a tendency toward in-context serial recall. Importantly, this effect is eliminated after the ablation of the induction heads, which are the driving force behind the contiguity effect. Our findings offer insights into how transformers organize information temporally during in-context learning, shedding light on their similarities and differences with human memory and learning.
zh

[NLP-80] Enabling Autoregressive Models to Fill In Masked Tokens

【速读】：该论文旨在解决在掩码填充任务中，自回归（Autoregressive, AR）模型无法进行掩码预测以及掩模语言模型（Masked Language Modeling, MLM）在训练和推理过程中存在的计算效率低下问题。论文的关键解决方案是引入了MARIA（掩码和自回归填充架构），通过结合预训练的MLM和AR模型，并使用线性解码器来处理它们的隐藏状态拼接，从而使AR模型能够执行掩码填充任务，同时保持其快速推理的优势。

链接: https://arxiv.org/abs/2502.06901
作者: Daniel Israel,Aditya Grover,Guy Van den Broeck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
zh

[NLP-81] A New Hybrid Intelligent Approach for Multimodal Detection of Suspected Disinformation on TikTok

【速读】：该论文旨在解决在多媒体内容快速传播背景下，如何有效识别TikTok等社交平台上的虚假信息。解决方案的关键在于提出了一种混合框架，该框架结合了深度学习的计算能力与模糊逻辑的可解释性，以检测TikTok视频中的疑似虚假信息。该框架包含两个核心组件：多模态特征分析器，用于提取和评估文本、音频和视频数据；以及基于模糊逻辑的多模态虚假信息检测器。这两个系统共同运作，通过分析人类行为线索（如肢体语言、语音模式和文本连贯性）来评估虚假信息传播的嫌疑，并生成高质量、全面且结构良好的报告。

链接: https://arxiv.org/abs/2502.06893
作者: Jared D.T. Guerrero-Sosa,Andres Montoro-Montarroso,Francisco P. Romero,Jesus Serrano-Guerrero,Jose A. Olivas
机构: University of Castilla-La Mancha (卡斯蒂利亚-拉曼查大学), Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:In the context of the rapid dissemination of multimedia content, identifying disinformation on social media platforms such as TikTok represents a significant challenge. This study introduces a hybrid framework that combines the computational power of deep learning with the interpretability of fuzzy logic to detect suspected disinformation in TikTok videos. The methodology is comprised of two core components: a multimodal feature analyser that extracts and evaluates data from text, audio, and video; and a multimodal disinformation detector based on fuzzy logic. These systems operate in conjunction to evaluate the suspicion of spreading disinformation, drawing on human behavioural cues such as body language, speech patterns, and text coherence. Two experiments were conducted: one focusing on context-specific disinformation and the other on the scalability of the model across broader topics. For each video evaluated, high-quality, comprehensive, well-structured reports are generated, providing a detailed view of the disinformation behaviours.
zh

[NLP-82] Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在处理交互式法律场景时因情景数据匮乏而受阻的问题。解决方案的关键在于引入了多智能体法律仿真驱动器（MASER），通过模拟交互式法律场景来规模化生成合成数据，同时确保参与者之间的法律属性一致性，并引入监管机制以协调参与者的行为和角色，从而有效应对干扰。此外，构建了多阶段交互式法律评估（MILE）基准以评估LLMs在动态法律场景中的表现。

链接: https://arxiv.org/abs/2502.06882
作者: Shengbin Yue,Ting Huang,Zheng Jia,Siyuan Wang,Shujun Liu,Yun Song,Xuanjing Huang,Zhongyu Wei
机构: Fudan University (复旦大学); University of Southern California (南加州大学); Northwest University of Political and Law (西北政法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NAACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants’ characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs’ performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.
zh

[NLP-83] Mix Data or Merge Models? Balancing the Helpfulness Honesty and Harmlessness of Large Language Model via Model Merging

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在有益性（Helpfulness）、诚实性（Honesty）和无害性（Harmlessness）方面的平衡对齐问题（3H优化）。现有方法如数据混合策略存在依赖专家知识和优化信号冲突等局限性。论文的关键解决方案是提出了一种重加权增强的任务奇异向量合并方法（R-TSVM），该方法通过引入考虑异常值的参数加权和适应稀疏性的秩选择策略，改进了LLMs在多个评估中的对齐效果。研究表明，模型合并方法在平衡对齐权衡方面优于数据混合方法，并强调了通过冗余组件剪枝和异常值缓解进行参数级冲突解决的重要性。

链接: https://arxiv.org/abs/2502.06876
作者: Jinluan Yang,Dingnan Jin,Anke Tang,Li Shen,Didi Zhu,Zhengyu Chen,Daixin Wang,Qing Cui,Zhiqiang Zhang,Jun Zhou,Fei Wu,Kun Kuang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI, with existing methods like data mixture strategies facing limitations including reliance on expert knowledge and conflicting optimization signals. While model merging offers a promising alternative by integrating specialized models, its potential for 3H optimization remains underexplored. This paper establishes the first comprehensive benchmark for model merging in 3H-aligned LLMs, systematically evaluating 15 methods (12 training-free merging and 3 data mixture techniques) across 10 datasets associated with 5 annotation dimensions, 2 LLM families, and 2 training paradigms. Our analysis reveals three pivotal insights: (i) previously overlooked collaborative/conflicting relationships among 3H dimensions, (ii) the consistent superiority of model merging over data mixture approaches in balancing alignment trade-offs, and (iii) the critical role of parameter-level conflict resolution through redundant component pruning and outlier mitigation. Building on these findings, we propose R-TSVM, a Reweighting-enhanced Task Singular Vector Merging method that incorporates outlier-aware parameter weighting and sparsity-adaptive rank selection strategies adapted to the heavy-tailed parameter distribution and sparsity for LLMs, further improving LLM alignment across multiple evaluations. Our models will be available at this https URL.
zh

[NLP-84] Beyond Vision: How Large Language Models Interpret Facial Expressions from Valence-Arousal Values

【速读】：该论文旨在解决大型语言模型（LLMs）难以从面部表情的维度（Valence和Arousal值）中推断情感意义的问题。关键在于探索LLMs是否能够通过处理结构化的数值表示（VA值），而非原始视觉输入，来分类基本及复杂情绪，并生成面部表情的语义描述。实验结果表明，尽管LLMs在将VA值分类到离散的情绪类别方面存在困难，特别是在超出基本情感极性的情绪识别上，但在生成面部表情的自由文本情感推断方面表现出色，其生成的文本描述与人类生成的解释高度一致。

链接: https://arxiv.org/abs/2502.06875
作者: Vaibhav Mehra,Guy Laban,Hatice Gunes
机构: HEI-Lab, Universidade Lusófona (HEI-Lab, 罗西奥诺维斯达大学); Department of Computer Science and Technology, University of Cambridge (计算机科学与技术系, 剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models primarily operate through text-based inputs and outputs, yet human emotion is communicated through both verbal and non-verbal cues, including facial expressions. While Vision-Language Models analyze facial expressions from images, they are resource-intensive and may depend more on linguistic priors than visual understanding. To address this, this study investigates whether LLMs can infer affective meaning from dimensions of facial expressions-Valence and Arousal values, structured numerical representations, rather than using raw visual input. VA values were extracted using Facechannel from images of facial expressions and provided to LLMs in two tasks: (1) categorizing facial expressions into basic (on the IIMI dataset) and complex emotions (on the Emotic dataset) and (2) generating semantic descriptions of facial expressions (on the Emotic dataset). Results from the categorization task indicate that LLMs struggle to classify VA values into discrete emotion categories, particularly for emotions beyond basic polarities (e.g., happiness, sadness). However, in the semantic description task, LLMs produced textual descriptions that align closely with human-generated interpretations, demonstrating a stronger capacity for free text affective inference of facial expressions.
zh

[NLP-85] Group Reasoning Emission Estimation Networks

【速读】：该论文旨在解决中小企业在温室气体（GHG）排放报告中的高实施成本、分散的排放因子数据库以及缺乏稳健的行业分类方法等问题。论文的关键解决方案是引入了Group Reasoning Emission Estimation Networks (GREEN)，这是一个基于AI的碳核算框架，通过标准化企业级排放估算、构建大规模基准数据集，并采用与大型语言模型（LLM）相结合的新推理方法来应对上述挑战。具体而言，GREEN通过将行业分类重新定义为信息检索任务，并利用对比学习损失微调Sentence-BERT模型，从而实现更精确的分类。此外，通过提出一种基于自然NAICS本体的分组推理方法，GREEN能够分解任务为多个子分类步骤，以此降低分类不确定性及计算复杂度。

链接: https://arxiv.org/abs/2502.06874
作者: Yanming Guo,Xiao Qian,Kevin Credit,Jin Ma
机构: University of Sydney(悉尼大学); Maynooth University(梅努斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate greenhouse gas (GHG) emission reporting is critical for governments, businesses, and investors. However, adoption remains limited particularly among small and medium enterprises due to high implementation costs, fragmented emission factor databases, and a lack of robust sector classification methods. To address these challenges, we introduce Group Reasoning Emission Estimation Networks (GREEN), an AI-driven carbon accounting framework that standardizes enterprise-level emission estimation, constructs a large-scale benchmark dataset, and leverages a novel reasoning approach with large language models (LLMs). Specifically, we compile textual descriptions for 20,850 companies with validated North American Industry Classification System (NAICS) labels and align these with an economic model of carbon intensity factors. By reframing sector classification as an information retrieval task, we fine-tune Sentence-BERT models using a contrastive learning loss. To overcome the limitations of single-stage models in handling thousands of hierarchical categories, we propose a Group Reasoning method that ensembles LLM classifiers based on the natural NAICS ontology, decomposing the task into multiple sub-classification steps. We theoretically prove that this approach reduces classification uncertainty and computational complexity. Experiments on 1,114 NAICS categories yield state-of-the-art performance (83.68% Top-1, 91.47% Top-10 accuracy), and case studies on 20 companies report a mean absolute percentage error (MAPE) of 45.88%. The project is available at: this https URL.
zh

[NLP-86] Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在支持认知重构疗法时主要依赖文本而忽视非言语证据的问题。为缓解这一差距，论文引入了多模态方法，将视觉线索纳入其中。关键解决方案在于提出一个新的数据集Multi Modal-Cognitive Support Conversation (M2CoSC)，该数据集结合了GPT-4生成的对话与反映虚拟客户面部表情的图像，并提出了一种多跳心理治疗推理方法，以明确识别和整合微妙的非言语证据。

链接: https://arxiv.org/abs/2502.06873
作者: Subin Kim,Hoonrae Kim,Heejin Do,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院), South Korea(韩国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 Main

点击查看摘要

Abstract:Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client’s facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs’ performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.
zh

[NLP-87] owards Trustworthy Retrieval Augmented Generation for Large Language Models : A Survey

【速读】：该论文旨在解决 Retrieval-Augmented Generation (RAG) 系统在提高其可信度方面存在的挑战，包括可靠性、隐私、安全、公平性、可解释性和问责制等问题。论文的关键解决方案在于提供一个全面的路线图，通过五个核心视角（即可靠性、隐私、安全、公平性、可解释性和问责制）来构建框架和分类法，从而系统地理解和评估现有方法，并识别未来研究的潜在方向。这一综合性的方法旨在促进 RAG 系统的广泛应用与创新。

链接: https://arxiv.org/abs/2502.06872
作者: Bo Ni,Zheyuan Liu,Leyao Wang,Yongjia Lei,Yuying Zhao,Xueqi Cheng,Qingkai Zeng,Luna Dong,Yinglong Xia,Krishnaram Kenthapadi,Ryan Rossi,Franck Dernoncourt,Md Mehrab Tanjim,Nesreen Ahmed,Xiaorui Liu,Wenqi Fan,Erik Blasch,Yu Wang,Meng Jiang,Tyler Derr
机构: Vanderbilt University; University of Notre Dame; University of Oregon; Meta; Oracle Health AI; Adobe Research; Cisco AI Research; North Carolina State University; The Hong Kong Polytechnic University; Air Force Research Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is an advanced technique designed to address the challenges of Artificial Intelligence-Generated Content (AIGC). By integrating context retrieval into content generation, RAG provides reliable and up-to-date external knowledge, reduces hallucinations, and ensures relevant context across a wide range of tasks. However, despite RAG’s success and potential, recent studies have shown that the RAG paradigm also introduces new risks, including robustness issues, privacy concerns, adversarial attacks, and accountability issues. Addressing these risks is critical for future applications of RAG systems, as they directly impact their trustworthiness. Although various methods have been developed to improve the trustworthiness of RAG methods, there is a lack of a unified perspective and framework for research in this topic. Thus, in this paper, we aim to address this gap by providing a comprehensive roadmap for developing trustworthy RAG systems. We place our discussion around five key perspectives: reliability, privacy, safety, fairness, explainability, and accountability. For each perspective, we present a general framework and taxonomy, offering a structured approach to understanding the current challenges, evaluating existing solutions, and identifying promising future research directions. To encourage broader adoption and innovation, we also highlight the downstream applications where trustworthy RAG systems have a significant impact.
zh

[NLP-88] Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject NAACL2025

【速读】：该论文旨在解决在大型语言模型（LLMs）中对同一实体的多属性编辑时存在的挑战，特别是当前最先进的编辑方法在处理同一主体的多个相关知识条目修改时表现出的困难。为了解决传统基准中缺乏针对相同主体的相关编辑数据的问题，论文引入了\textS^2\textRKE（Same-Subject Related Knowledge Editing）基准。关键解决方案在于发现并分析主流的定位后编辑方法（如ROME和MEMIT）存在“相关知识扰动”的现象，即后续编辑干扰先前编辑，导致编辑效果减弱。进一步分析表明，这些方法过度依赖于主体信息而忽视了其他重要因素。

链接: https://arxiv.org/abs/2502.06868
作者: Zenghao Duan,Wenbin Duan,Zhiyi Yin,Yinghan Shen,Shaoling Jing,Jie Zhang,Huawei Shen,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中科院计算技术研究所), Beijing, China; University of Chinese Academy of Sciences (中国科学院大学), Beijing, China; People’s Public Security University of China (中国人民公安大学), Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NAACL 2025

点击查看摘要

Abstract:Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the \textS^2\textRKE (Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit “related knowledge perturbation,” where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.
zh

[NLP-89] Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）的安全基准测试问题，提出了一套开源的数据集和测试框架来评估主要模型在受控物质查询中的安全机制。关键在于通过分析不同模型对系统变化提示的响应，揭示其拒绝有害内容与适度限制合法科学讨论之间的平衡，从而提供一种衡量AI安全实施进展的基础方法。

链接: https://arxiv.org/abs/2502.06867
作者: David Noever,Forrest McKee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models’ responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.
zh

[NLP-90] Knowledge Graph-Guided Retrieval Augmented Generation NAACL2025 ACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成响应过程中存在的幻觉问题。现有研究主要集中在基于语义的方法来检索孤立的相关片段，而忽视了这些片段之间的内在关系。论文的关键解决方案是提出了一种名为知识图谱引导的检索增强生成（Knowledge Graph-Guided Retrieval Augmented Generation, KG $^2$ RAG）框架。该框架利用知识图谱（Knowledge Graphs, KGs）提供片段间的事实级关系，从而提升检索结果的多样性和连贯性。具体而言，KG $^2$ RAG通过知识图谱引导的片段扩展过程和基于知识图谱的片段组织过程，在结构良好的段落中传递相关且重要的知识。

链接: https://arxiv.org/abs/2502.06864
作者: Xiangrong Zhu,Yuexiang Xie,Yi Liu,Yaliang Li,Wei Hu
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China(软件新技术国家重点实验室，南京大学，中国); Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in the 2025 Annual Conference of the Nations of the Americas Chapter of the ACL (NAACL 2025)

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising technology for addressing hallucination issues in the responses generated by large language models (LLMs). Existing studies on RAG primarily focus on applying semantic-based approaches to retrieve isolated relevant chunks, which ignore their intrinsic relationships. In this paper, we propose a novel Knowledge Graph-Guided Retrieval Augmented Generation (KG ^2 RAG) framework that utilizes knowledge graphs (KGs) to provide fact-level relationships between chunks, improving the diversity and coherence of the retrieved results. Specifically, after performing a semantic-based retrieval to provide seed chunks, KG ^2 RAG employs a KG-guided chunk expansion process and a KG-based chunk organization process to deliver relevant and important knowledge in well-organized paragraphs. Extensive experiments conducted on the HotpotQA dataset and its variants demonstrate the advantages of KG ^2 RAG compared to existing RAG-based approaches, in terms of both response quality and retrieval quality.
zh

[NLP-91] LLM -Supported Natural Language to Bash Translation NAACL2025

【速读】：该论文旨在解决通过大型语言模型（Large Language Models, LLMs）进行自然语言到Bash命令（Natural Language to Bash Command, NL2SH）翻译时性能难以评估的问题。为了解决这一问题，论文的关键在于构建了一个包含600个指令-命令对的手动验证测试数据集以及一个包含40,939个对的训练数据集，并提出了一种结合命令执行与LLM输出评估的新颖功能等价启发式方法。这种新方法能够以95%的置信度确定两个Bash命令的功能等价性，比之前的启发式方法提高了16%。

链接: https://arxiv.org/abs/2502.06858
作者: Finnian Westenfelder,Erik Hemberg,Miguel Tulla,Stephen Moskal,Una-May O’Reilly,Silviu Chiricescu
机构: ALFA Group MIT-CSAIL(阿尔法集团MIT-CSAIL); Draper Scholar(德宝学者); Charles Stark Draper Laboratory(查尔斯·斯塔克·德雷珀实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, NAACL 2025

点击查看摘要

Abstract:The Bourne-Again Shell (Bash) command-line interface for Linux systems has complex syntax and requires extensive specialized knowledge. Using the natural language to Bash command (NL2SH) translation capabilities of large language models (LLMs) for command composition circumvents these issues. However, the NL2SH performance of LLMs is difficult to assess due to inaccurate test data and unreliable heuristics for determining the functional equivalence of Bash commands. We present a manually verified test dataset of 600 instruction-command pairs and a training dataset of 40,939 pairs, increasing the size of previous datasets by 441% and 135%, respectively. Further, we present a novel functional equivalence heuristic that combines command execution with LLM evaluation of command outputs. Our heuristic can determine the functional equivalence of two Bash commands with 95% confidence, a 16% increase over previous heuristics. Evaluation of popular LLMs using our test dataset and heuristic demonstrates that parsing, in-context learning, in-weight learning, and constrained decoding can improve NL2SH accuracy by up to 32%. Our findings emphasize the importance of dataset quality, execution-based evaluation and translation method for advancing NL2SH translation. Our code is available at this https URL
zh

[NLP-92] Self-Supervised Prompt Optimization

【速读】：该论文旨在解决手动设计提示词（Prompts）在提升大型语言模型（LLMs）推理能力的同时，满足不同领域任务需求过程中存在的高成本和低效问题。现有的提示词优化方法依赖于外部参考数据如标准答案或人工标注，这限制了它们在实际应用中的适用性，特别是在缺乏此类数据的情况下。为了解决这一问题，论文提出了一种名为自监督提示词优化（Self-Supervised Prompt Optimization, SPO）的成本高效框架。SPO的关键在于它能够无需外部参考数据即可发现适用于封闭式和开放式任务的有效提示词。具体而言，SPO通过语言模型评估器进行成对输出比较来选择更优的提示词，并利用语言模型优化器使输出结果符合任务需求。

链接: https://arxiv.org/abs/2502.06855
作者: Jinyu Xiang,Jiayi Zhang,Zhaoyang Yu,Fengwei Teng,Jinhao Tu,Xinbing Liang,Sirui Hong,Chenglin Wu,Yuyu Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Well-designed prompts are crucial for enhancing Large language models’ (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at this https URL.
zh

[NLP-93] Can Large Language Models Understand Intermediate Representations?

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在理解中间表示（Intermediate Representations, IRs）方面的性能与能力。研究通过四个任务：控制流图（Control Flow Graph, CFG）重建、反编译、代码总结和执行推理，评估了包括GPT-4、GPT-3、Gemma 2、LLaMA 3.1和Code Llama在内的多种LLMs的表现。研究表明，尽管LLMs在解析IR语法和识别高层结构方面表现良好，但在控制流推理、执行语义理解和循环处理方面存在不足。论文的关键在于建议对LLMs进行针对IR的具体增强，包括基于结构化IR数据集的微调以及引入显式的控制流模型，以提升其在处理IR相关任务中的理解和操作能力。

链接: https://arxiv.org/abs/2502.06854
作者: Hailong Jiang,Jianfeng Zhu,Yao Wan,Bo Fang,Hongyu Zhang,Ruoming Jin,Qiang Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intermediate Representations (IRs) are essential in compiler design and program analysis, yet their comprehension by Large Language Models (LLMs) remains underexplored. This paper presents a pioneering empirical study to investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA 3.1, and Code Llama, in understanding IRs. We analyze their performance across four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code summarization, and execution reasoning. Our results indicate that while LLMs demonstrate competence in parsing IR syntax and recognizing high-level structures, they struggle with control flow reasoning, execution semantics, and loop handling. Specifically, they often misinterpret branching instructions, omit critical IR operations, and rely on heuristic-based reasoning, leading to errors in CFG reconstruction, IR decompilation, and execution reasoning. The study underscores the necessity for IR-specific enhancements in LLMs, recommending fine-tuning on structured IR datasets and integration of explicit control flow models to augment their comprehension and handling of IR-related tasks.
zh

[NLP-94] Survey on Vision-Language-Action Models

【速读】：该论文旨在探讨如何利用大型语言模型（Large Language Models, LLMs）自动生成文献综述，并强调确保此类生成内容的准确性、可靠性和适当综合仍然是一个挑战。论文的关键在于开发一种结构化框架，以辅助AI进行文献综述工作，同时探索提高引用准确性、来源可信度和上下文理解的技术。通过分析LLM在学术写作中的潜力与局限性，本研究致力于推动将AI系统地整合到研究流程中，从而提升学术知识综合的效率和可扩展性。

链接: https://arxiv.org/abs/2502.06851
作者: Adilzhan Adilkhanov,Amir Yelenov,Assylkhan Seitzhanov,Ayan Mazhitov,Azamat Abdikarimov,Danissa Sandykbayeva,Daryn Kenzhebek,Daulet Baimukashev,Dinmukhammed Mukashev,Ilyas Umurbekov,Jabrail Chumakov,Kamila Spanova,Karina Burunchina,Rasul Yermagambet,Rustam Chibar,Saltanat Seitzhan,Soibkhon Khajikhanov,Tasbolat Taunyazov,Temirlan Galimzhanov,Temirlan Kaiyrbay,Tleukhan Mussin,Togzhan Syrymova,Valeriya Kostyukova,Yermakhan Kassym,Madina Yergibay,Margulan Issa,Moldir Zabirova,Nurdaulet Zhuzbay,Nurlan Kabdyshev,Nurlan Zhaniyar,Yerkebulan Massalim,Zerde Nurbayeva,Zhanat Kappassov
机构: Tactile Robotics Laboratory, Kazakhstan (触觉机器人实验室, 哈萨克斯坦)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an AI-generated review of Vision-Language-Action (VLA) models, summarizing key methodologies, findings, and future directions. The content is produced using large language models (LLMs) and is intended only for demonstration purposes. This work does not represent original research, but highlights how AI can help automate literature reviews. As AI-generated content becomes more prevalent, ensuring accuracy, reliability, and proper synthesis remains a challenge. Future research will focus on developing a structured framework for AI-assisted literature reviews, exploring techniques to enhance citation accuracy, source credibility, and contextual understanding. By examining the potential and limitations of LLM in academic writing, this study aims to contribute to the broader discussion of integrating AI into research workflows. This work serves as a preliminary step toward establishing systematic approaches for leveraging AI in literature review generation, making academic knowledge synthesis more efficient and scalable.
zh

[NLP-95] Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization

【速读】：该论文旨在解决大型语言模型在保持性能的同时减少内存使用的问题，特别是在超低比特量化（如2比特）中的挑战。论文的关键解决方案是提出InvarExplore框架，该框架通过系统性探索不同模型不变性（包括难以用梯度优化方法处理的置换不变性），利用各类不变性之间的协同效应，从而实现高效量化。

链接: https://arxiv.org/abs/2502.06844
作者: Yuqiao Wen,Yanshuai Cao,Lili Mou
机构: Dept. Computing Science & Alberta Machine Intelligence Institute (Amii), University of Alberta (阿尔伯塔大学); RBC Borealis (RBC Borealis); Canada CIFAR AI Chair (加拿大CIFAR AI研究员)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have been increasing in size due to their success in a wide range of applications. This calls for a pressing need to reduce memory usage to make them more accessible. Post-training quantization is a popular technique which uses fewer bits (e.g., 4–8 bits) to represent the model without retraining it. However, it remains a challenging task to perform quantization in an ultra-low-bit setup (e.g., 2 bits). In this paper, we propose InvarExplore, a unified framework that systematically explores different model invariance at the same time, allowing us to take advantage of the synergy between each type of invariance. Importantly, InvarExplore features a discrete search algorithm that enables us to explore permutation invariance, which is under-studied as it cannot be optimized with gradient-based methods. Results show that InvarExplore is compatible with existing state-of-the-art methods, achieving an add-on performance improvement over strong competing methods.
zh

[NLP-96] Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference

【速读】：该论文旨在解决语言模型推理过程中的计算效率问题。论文的关键解决方案是提出了一种名为熵自适应解码（Entropy Adaptive Decoding, EAD）的方法，通过动态切换不同大小的模型来应对预测不确定性。具体而言，EAD 通过监控模型对数分布中的滚动熵来识别文本区域，当预测不确定性超过阈值时才切换到更大规模的模型，从而在保证适度输出偏差的情况下换取计算效率的提升。实验结果表明，这种方法能够在不同模型家族中显著减少计算成本，同时保持较高的性能。

链接: https://arxiv.org/abs/2502.06833
作者: Toby Simonds
机构: Tufa Labs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit distributions, our method identifies text regions where a smaller model suffices and switches to a larger model only when prediction uncertainty exceeds a threshold. Unlike speculative decoding approaches that maintain perfect output fidelity through verification, EAD accepts controlled output divergence in exchange for computational efficiency. Our experiments on the MATH benchmark demonstrate remarkable efficiency gains across different model families. Using the LLaMA family, we maintain 96.7% of the 11B model’s performance (50.4% vs 52.1%) while using it for only 43% of tokens, decreasing computational cost by 41.5%. These gains become more pronounced with larger size differentials in the Qwen family, where we achieve 92.9% of the 14B model’s performance (74.3% vs 80.0%) while using it for just 25% of tokens, decreasing computational cost by 67%. The consistency of these results across model pairs suggests that language model computation can be significantly optimized by selectively deploying model capacity based on local generation complexity. Our findings indicate that current approaches to model inference may be unnecessarily conservative in their pursuit of perfect output fidelity, and that accepting minor performance trade-offs can enable dramatic reductions in computational costs.
zh

[NLP-97] DiffListener: Discrete Diffusion Model for Listener Generation ICASSP2025

【速读】：该论文旨在解决生成自然非言语倾听反应（Listener Head Generation, LHG）的问题，特别是在基于说话人的多模态线索生成时。传统方法受限于有限的模态（如音频和面部信息）或采用自回归方法，后者存在累积预测误差的局限性。为了解决这些问题，论文提出了一种名为DiffListener的方法，这是一种基于离散扩散模型的非自回归倾听头部生成方案。该方法的关键在于不仅利用说话人的面部信息、音频和文本作为输入，还引入了面部差分信息来表示表情和动作的时间动态，从而实现连贯反应序列的非自回归生成。

链接: https://arxiv.org/abs/2502.06822
作者: Siyeol Jung,Taehwan Kim
机构: Artificial Intelligence Graduate School, UNIST (人工智能研究生院, UNIST), Republic of Korea
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Graphics (cs.GR)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:The listener head generation (LHG) task aims to generate natural nonverbal listener responses based on the speaker’s multimodal cues. While prior work either rely on limited modalities (e.g. audio and facial information) or employ autoregressive approaches which have limitations such as accumulating prediction errors. To address these limitations, we propose DiffListener, a discrete diffusion based approach for non-autoregressive listener head generation. Our model takes the speaker’s facial information, audio, and text as inputs, additionally incorporating facial differential information to represent the temporal dynamics of expressions and movements. With this explicit modeling of facial dynamics, DiffListener can generate coherent reaction sequences in a non-autoregressive manner. Through comprehensive experiments, DiffListener demonstrates state-of-the-art performance in both quantitative and qualitative evaluations. The user study shows that DiffListener generates natural context-aware listener reactions that are well synchronized with the speaker. The code and demo videos are available in this https URL
zh

[NLP-98] Aligning Human and Machine Attention for Enhanced Supervised Learning

【速读】：该论文旨在解决机器学习模型在某些学习任务上无法超越人类的问题，通过将机器注意力机制与人类注意力机制对齐来提升机器性能。论文的关键解决方案是提出了一种名为Human-Machine Attention Learning (HuMAL)的新方法，该方法依赖于人类标注的数据以反映他们在特定任务中的自我感知注意力，并探索了几种策略将这种人类注意力数据集成到机器学习算法中。研究表明，最佳的HuMAL策略显著提升了微调变换器模型（如 BERT、GPT-2 和 XLNET）在情感分析和人格类型分类任务上的表现，尤其是在数据不平衡或稀疏的情况下。

链接: https://arxiv.org/abs/2502.06811
作者: Avihay Chriqui,Inbal Yahav,Dov Teeni,Ahmed Abbasi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention, or prioritization of certain information items over others, is a critical element of any learning process, for both humans and machines. Given that humans continue to outperform machines in certain learning tasks, it seems plausible that machine performance could be enriched by aligning machine attention with human attention mechanisms – yet research on this topic is sparse and has achieved only limited success. This paper proposes a new approach to address this gap, called Human-Machine Attention Learning (HuMAL). This approach involves reliance on data annotated by humans to reflect their self-perceived attention during specific tasks. We evaluate several alternative strategies for integrating such human attention data into machine learning (ML) algorithms, using a sentiment analysis task (review data from Yelp) and a personality-type classification task (data from myPersonality). The best-performing HuMAL strategy significantly enhances the task performance of fine-tuned transformer models (BERT, as well as GPT-2 and XLNET), and the benefit is particularly pronounced under challenging conditions of imbalanced or sparse labeled data. This research contributes to a deeper understanding of strategies for integrating human attention into ML models and highlights the potential of leveraging human cognition to augment ML in real-world applications.
zh

[NLP-99] Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

【速读】：该论文旨在解决大型语言模型（LLMs）内部机制难以精确解读与控制的问题。现有方法主要通过在神经元与语义概念间建立离散映射来识别和操控神经元，但这种方法难以应对LLMs中存在的多义性，即单个神经元编码多个不同的概念，从而导致精准控制的困难。论文的关键在于发现尽管单个神经元编码多个概念，但其激活幅度在不同概念上的分布呈现出明显的高斯样模式。基于这一洞察，作者引入了NeuronLens，这是一种新的基于范围的解释和操作框架，能够更精细地观察神经元激活分布，从而准确定位神经元内的概念归属。该方法通过广泛的实证评估验证，显著减少了非目标干扰，同时保持了针对特定概念操作的精确控制，优于现有方法。

链接: https://arxiv.org/abs/2502.06809
作者: Muhammad Umair Haider,Hammad Rizwan,Hassan Sajjad,Peizhong Ju,A.B. Siddique
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpreting and controlling the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Recent efforts have primarily focused on identifying and manipulating neurons by establishing discrete mappings between neurons and semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. This makes precise control challenging and complicates downstream interventions. Through an in-depth analysis of both encoder and decoder-based LLMs across multiple text classification datasets, we uncover that while individual neurons encode multiple concepts, their activation magnitudes vary across concepts in distinct, Gaussian-like patterns. Building on this insight, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise control for manipulation of targeted concepts, outperforming existing methods.
zh

[NLP-100] Competitive Programming with Large Reasoning Models

链接: https://arxiv.org/abs/2502.06807
作者: OpenAI:Ahmed El-Kishky,Alexander Wei,Andre Saraiva,Borys Minaev,Daniel Selsam,David Dohan,Francis Song,Hunter Lightman,Ignasi Clavera,Jakub Pachocki,Jerry Tworek,Lorenz Kuhn,Lukasz Kaiser,Mark Chen,Max Schwarzer,Mostafa Rohaninejad,Nat McAleese,o3 contributors,Oleg Mürk,Rhythm Garg,Rui Shu,Szymon Sidor,Vineet Kosaraju,Wenda Zhou
机构: OpenAI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-101] Logits are All We Need to Adapt Closed Models

【速读】：该论文旨在解决商业大型语言模型（Large Language Models, LLMs）在封闭源代码环境下，仅通过提示调优（prompt tuning）难以满足特定应用需求的问题。论文的关键在于提出了一种基于对数概率重加权（logit reweighting）的框架，该框架能够在获取到对数概率（logits）以及少量任务特定数据的情况下，有效地调整黑盒LLMs以生成符合特定应用场景的内容。此方法将下一令牌预测视为监督分类问题，并将其形式化为标签噪声校正问题，从而实现对黑盒LLMs的任务适应。

链接: https://arxiv.org/abs/2502.06806
作者: Gaurush Hiranandani,Haolun Wu,Subhojyoti Mukherjee,Sanmi Koyejo
机构: Stanford University (斯坦福大学); Mila - Quebec AI Institute (麦吉尔大学-魁北克人工智能研究所); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, 8 figures

点击查看摘要

Abstract:Many commercial Large Language Models (LLMs) are often closed-source, limiting developers to prompt tuning for aligning content generation with specific applications. While these models currently do not provide access to token logits, we argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering. In this paper, we propose a token-level probability reweighting framework that, given access to logits and a small amount of task-specific data, can effectively steer black-box LLMs toward application-specific content generation. Our approach views next-token prediction through the lens of supervised classification. We show that aligning black-box LLMs with task-specific data can be formulated as a label noise correction problem, leading to \emphPlugin model – an autoregressive probability reweighting model that operates solely on logits. We provide theoretical justification for why reweighting logits alone is sufficient for task adaptation. Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models.
zh

[NLP-102] Solving the Content Gap in Roblox Game Recommendations: LLM -Based Profile Generation and Reranking

【速读】：该论文旨在解决生成高质量、结构化游戏文本特征的问题，并验证这些特征能否提升推荐的相关性。关键在于利用大型语言模型（Large Language Models, LLMs）从玩家互动的原始数据中提取和推断游戏属性，如类型和玩法目标，从而无需大量人工标注。此外，通过引入基于LLMs的重排序机制来评估生成文本特征的有效性，进一步提升了个性化和用户体验。

链接: https://arxiv.org/abs/2502.06802
作者: Chen Wang,Xiaokai Wei,Yexi Jiang,Frank Ong,Kevin Gao,Xiao Yu,Zheng Hui,Se-eun Yoon,Philip Yu,Michelle Gong
机构: University of Illinois Chicago(伊利诺伊大学芝加哥分校); Roblox(罗布乐思); Columbia University(哥伦比亚大学); University of California, San Diego(加州大学圣地亚哥分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the vast and dynamic user-generated content on Roblox, creating effective game recommendations requires a deep understanding of game content. Traditional recommendation models struggle with the inconsistent and sparse nature of game text features such as titles and descriptions. Recent advancements in large language models (LLMs) offer opportunities to enhance recommendation systems by analyzing in-game text data. This paper addresses two challenges: generating high-quality, structured text features for games without extensive human annotation, and validating these features to ensure they improve recommendation relevance. We propose an approach that extracts in-game text and uses LLMs to infer attributes such as genre and gameplay objectives from raw player interactions. Additionally, we introduce an LLM-based re-ranking mechanism to assess the effectiveness of the generated text features, enhancing personalization and user satisfaction. Beyond recommendations, our approach supports applications such as user engagement-based integrity detection, already deployed in production. This scalable framework demonstrates the potential of in-game text understanding to improve recommendation quality on Roblox and adapt recommendations to its unique, user-generated ecosystem.
zh

[NLP-103] owards Efficient and Multifaceted Computer-assisted Pronunciation Training Leverag ing Hierarchical Selective State Space Model and Decoupled Cross-entropy Loss NAACL2025

【速读】：该论文旨在解决计算机辅助发音训练（Computer-Assisted Pronunciation Training, CAPT）系统中自动发音评估（Automatic Pronunciation Assessment, APA）与误发音检测与诊断（Mispronunciation Detection and Diagnosis, MDD）分离的问题。论文的关键解决方案是提出了一种新的CAPT方法HMamba，它将APA和MDD任务并行整合。此外，引入了一种专门针对MDD设计的新型损失函数——解耦交叉熵损失（decoupled cross-entropy loss, deXent），以促进更好的监督学习，从而提升系统的整体性能。

链接: https://arxiv.org/abs/2502.07575
作者: Fu-An Chao,Berlin Chen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Main Conference

点击查看摘要

Abstract:Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts: the former aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while the latter focuses instead on pinpointing the precise phonetic pronunciation errors made by non-native language learners. However, it is generally expected that a full-fledged CAPT system should perform both functionalities simultaneously and efficiently. In response to this surging demand, we in this work first propose HMamba, a novel CAPT approach that seamlessly integrates APA and MDD tasks in parallel. In addition, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for MDD to facilitate better-supervised learning for detecting mispronounced phones, thereby enhancing overall performance. A comprehensive set of empirical results on the speechocean762 benchmark dataset demonstrates the effectiveness of our approach on APA. Notably, our proposed approach also yields a considerable improvement in MDD performance over a strong baseline, achieving an F1-score of 63.85%. Our codes are made available at this https URL
zh

[NLP-104] ScaffoldGPT : A Scaffold-based Large Language Model for Drug Improvement

链接: https://arxiv.org/abs/2502.06891
作者: Xuefeng Liu,Songhao Jiang,Rick Stevens
机构: 未知
类目: Biomolecules (q-bio.BM); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

计算机视觉

[CV-0] Pippo: High-Resolution Multi-View Humans from a Single Image

【速读】：该论文旨在解决从单张照片生成高分辨率多视角视频的问题。解决方案的关键在于提出了一种名为Pippo的多视图扩散变换模型，该模型通过预训练和多阶段微调（包括低分辨率快速吸收和高分辨率精细调整）来实现无需额外输入即可生成高质量的多视角视频。此外，Pippo在推理阶段采用注意力偏置技术，以生成超过训练过程中所见数量更多的视角，从而显著提升了3D一致性。

链接: https://arxiv.org/abs/2502.07785
作者: Yash Kant,Ethan Weber,Jin Kyu Kim,Rawal Khirodkar,Su Zhaoen,Julieta Martinez,Igor Gilitschenski,Shunsuke Saito,Timur Bagautdinov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page - this http URL

点击查看摘要

Abstract:We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater than 5 times as many views as seen during training. Finally, we also introduce an improved metric to evaluate 3D consistency of multi-view generations, and show that Pippo outperforms existing works on multi-view human generation from a single image.
zh

[CV-1] MatSwap: Light-aware material transfers in images

【速读】：本文旨在解决在图像中逼真地转移材料到指定表面的问题。由于材料外观、几何形状和光照在照片中的复杂交织，这一任务极具挑战性。现有方法通常依赖于繁琐的文字工程或需要大量人工标注及艺术家知识和3D场景属性，这在实际应用中难以实现。本文提出的关键解决方案是直接学习输入材料（在平坦表面上观察到的）与其在场景中的表现之间的关系，无需显式的UV映射。为此，作者采用了一个自定义的光度和几何感知扩散模型，并利用合成数据微调大规模预训练的文本到图像模型，以确保有效推广至真实图像。最终，该方法能够无缝地将所需材料整合到照片的目标位置，同时保持场景的一致性。

链接: https://arxiv.org/abs/2502.07784
作者: Ivan Lopes,Valentin Deschaintre,Yannick Hold-Geoffroy,Raoul de Charette
机构: Inria(France); Adobe Research(UK); Adobe Research(Canada)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically. Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph. In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain. In contrast, we propose to directly learn the relationship between the input material – as observed on a flat surface – and its appearance within the scene, without the need for explicit UV mapping. To achieve this, we rely on a custom light- and geometry-aware diffusion model. We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images. As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene. We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively. We will release our code and data upon publication.
zh

[CV-2] A Flag Decomposition for Hierarchical Datasets

【速读】：该论文旨在解决如何对高维数据进行层次化分解与处理的问题。关键在于提出了一种基于旗（Flag）的新型方法，能够将任意层次化的实值数据分解为在Stiefel坐标系下保持层次结构的旗表示形式。这一方法拓展了旗流形（Flag manifold）在去噪、聚类及少样本学习等应用中的潜力。

链接: https://arxiv.org/abs/2502.07782
作者: Nathan Mankovich,Ignacio Santamaria,Gustau Camps-Valls,Tolga Birdal
机构: University of Valencia(瓦伦西亚大学); University of Cantabria(坎塔布里亚大学); Imperial College London(帝国理工学院伦敦)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flag manifolds encode hierarchical nested sequences of subspaces and serve as powerful structures for various computer vision and machine learning applications. Despite their utility in tasks such as dimensionality reduction, motion averaging, and subspace clustering, current applications are often restricted to extracting flags using common matrix decomposition methods like the singular value decomposition. Here, we address the need for a general algorithm to factorize and work with hierarchical datasets. In particular, we propose a novel, flag-based method that decomposes arbitrary hierarchical real-valued data into a hierarchy-preserving flag representation in Stiefel coordinates. Our work harnesses the potential of flag manifolds in applications including denoising, clustering, and few-shot learning.
zh

[CV-3] Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection

【速读】：该论文旨在解决检测生成式人工智能（Generative AI）图像时因依赖虚假模式如压缩伪影而导致误判的问题。论文的关键在于提出Stay Positive算法，该算法通过约束检测器仅关注由生成模型引入的伪影，而非与真实数据相关的伪影，从而减少检测器对虚假相关性的依赖，提高其泛化能力和鲁棒性，尤其是在处理修补的真实图像时表现更佳。

链接: https://arxiv.org/abs/2502.07778
作者: Anirudh Sundara Rajan,Yong Jae Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting AI generated images is a challenging yet essential task. A primary difficulty arises from the detectors tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay Positive, an algorithm designed to constrain the detectors focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images.
zh

[CV-4] Novel computational workflows for natural and biomedical image processing based on hypercomplex algebras

【速读】：该论文旨在解决自然图像和生物医学图像处理中的多种挑战，包括重新着色、去色、对比度增强、组织学图像的计算复染和染色分离等问题。论文的关键在于利用四元数和二维正交平面分割框架（quaternions and the two-dimensional orthogonal planes split framework），通过基本算术和矩阵运算实现图像处理任务的统一范式，从而在自然图像和生物医学图像领域展示出广泛的应用潜力和一致性。这种方法不仅能够调节颜色外观和对比度，还能够作为自动化图像处理管道的一部分，并有助于数字病理学应用。

链接: https://arxiv.org/abs/2502.07758
作者: Nektarios A. Valous,Eckhard Hitzer,Dragoş Duşe,Rodrigo Rojas Moraleda,Ferdinand Popp,Meggy Suarez-Carmona,Anna Berthel,Ismini Papageorgiou,Carlo Fremd,Alexander Rölle,Christina C. Westhoff,Bénédicte Lenoir,Niels Halama,Inka Zörnig,Dirk Jäger
机构: National Center for Tumor Diseases (NCT) Heidelberg; German Cancer Research Center (DKFZ); Medical Faculty Heidelberg, Heidelberg University; Center for Quantitative Analysis of Molecular and Cellular Biosystems (Bioquant), Heidelberg University; Synaptiq; Institute of Radiology, Südharz Hospital Nordhausen; Jena University Hospital; Institute of Diagnostic and Interventional Radiology, Jena University Hospital; Division of Gynecological Oncology, National Center for Tumor Diseases (NCT) Heidelberg; Division of Molecular Genetics, German Cancer Research Center (DKFZ); Institute of Pathology, Philipps University of Marburg and University Hospital Giessen and Marburg GmbH (UKGM); Systems Immunology and Single-Cell Biology Group, German Cancer Research Center (DKFZ); University Center for Tumor Diseases Mainz (UCT Mainz), University Medical Center of Johannes Gutenberg University Mainz; Department of Hematology and Medical Oncology, III. Medical Clinic and Polyclinic, University Medical Center of Johannes Gutenberg University Mainz; Medical Faculty Heidelberg, Heidelberg University; Department of Medical Oncology, Heidelberg University Hospital (UKHD)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 18 figures, 14 tables

点击查看摘要

Abstract:Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic arithmetic and matrix operations, this work offers a computationally accessible methodology - in the hypercomplex domain - that showcases versatility and consistency across image processing tasks and a range of computer vision and biomedical applications. The proposed non-data-driven methods achieve comparable or better results (particularly in cases involving well-known methods) to those reported in the literature, showcasing the potential of robust theoretical frameworks with practical effectiveness. Results, methods, and limitations are detailed alongside discussion of promising extensions, emphasizing the potential of feature-rich mathematical/computational frameworks for natural and biomedical images.
zh

[CV-5] MeshSplats: Mesh-Based Rendering with Gaussian Splatting Initialization

【速读】：该论文旨在解决Gaussian Splatting (GS)在处理复杂光照效果如阴影和反射时的局限性。解决方案的关键在于引入MeshSplats方法，将GS转换为类似网格的格式。通过这种方法，训练完成后，GS中的高斯元素被转化为网格面，从而能够利用光线追踪方法进行渲染，并享受其带来的所有优势。此外，通过专门的优化算法进一步提升重建质量，该算法作用于网格面而非高斯组件。

链接: https://arxiv.org/abs/2502.07754
作者: Rafał Tobiasz,Grzegorz Wilczyński,Marcin Mazur,Sławomir Tadeja,Przemysław Spurek
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS) is a recent and pivotal technique in 3D computer graphics. GS-based algorithms almost always bypass classical methods such as ray tracing, which offers numerous inherent advantages for rendering. For example, ray tracing is able to handle incoherent rays for advanced lighting effects, including shadows and reflections. To address this limitation, we introduce MeshSplats, a method which converts GS to a mesh-like format. Following the completion of training, MeshSplats transforms Gaussian elements into mesh faces, enabling rendering using ray tracing methods with all their associated benefits. Our model can be utilized immediately following transformation, yielding a mesh of slightly reduced quality without additional training. Furthermore, we can enhance the reconstruction quality through the application of a dedicated optimization algorithm that operates on mesh faces rather than Gaussian components. The efficacy of our method is substantiated by experimental results, underscoring its extensive applications in computer graphics and image processing.
zh

[CV-6] Direct Ascent Synthesis: Revealing Hidden Generative Capabilities in Discriminative Models

【速读】：该论文旨在挑战判别模型与生成模型之间传统区分，并探索判别模型中的生成能力。关键解决方案是提出Direct Ascent Synthesis (DAS)方法，通过多分辨率优化CLIP模型表示揭示这些潜在的生成能力，从而实现高质量图像合成，无需额外训练。这种方法不仅支持多种应用，还能保持自然图像统计特性并避免非鲁棒的对抗模式。

链接: https://arxiv.org/abs/2502.07753
作者: Stanislav Fort,Jonathan Whitaker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications – from text-to-image generation to style transfer – but maintains natural image statistics ( 1/f^2 spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.
zh

[CV-7] CausalGeD: Blending Causality and Diffusion for Spatial Gene Expression Generation

【速读】：该论文旨在解决单细胞RNA测序（scRNA-seq）与空间转录组学（ST）数据整合中因忽视基因间因果关系而导致的现有方法性能受限的问题。关键解决方案在于提出CausalGeD模型，该模型结合扩散过程和自回归过程以利用这些因果关系，并通过推广因果注意变换器（Causal Attention Transformer）从图像生成到基因表达数据，从而捕捉调控机制而无需预定义关系。

链接: https://arxiv.org/abs/2502.07751
作者: Rabeya Tus Sadia,Md Atik Ahamed,Qiang Cheng
机构: University of Kentucky(肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:The integration of single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) data is crucial for understanding gene expression in spatial context. Existing methods for such integration have limited performance, with structural similarity often below 60%, We attribute this limitation to the failure to consider causal relationships between genes. We present CausalGeD, which combines diffusion and autoregressive processes to leverage these relationships. By generalizing the Causal Attention Transformer from image generation to gene expression data, our model captures regulatory mechanisms without predefined relationships. Across 10 tissue datasets, CausalGeD outperformed state-of-the-art baselines by 5- 32% in key metrics, including Pearson’s correlation and structural similarity, advancing both technical and biological insights.
zh

[CV-8] Next Block Prediction: Video Generation via Semi-Auto-Regressive Modeling UAI

【速读】：该论文旨在解决传统自回归（Autoregressive, AR）视频生成方法中的次优单向依赖性和缓慢推理速度问题。解决方案的关键在于提出了一种半自回归（semi-autoregressive, semi-AR）框架，称为Next-Block Prediction (NBP)，通过将视频内容均匀分解为等大小的块（如行或帧），将生成单元从单一令牌转换为块，使得当前块中的每个令牌能够同时预测下一个块中对应的令牌。不同于传统的AR模型，NBP框架在每个块内采用双向注意力机制，使令牌能够捕捉更稳健的空间依赖关系。此外，通过并行预测多个令牌，NBP显著减少了生成步骤的数量，从而实现了更快且更高效的推理。

链接: https://arxiv.org/abs/2502.07737
作者: Shuhuai Ren,Shuming Ma,Xu Sun,Furu Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: project page: this https URL

点击查看摘要

Abstract:Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
zh

[CV-9] EdgeEar: Efficient and Accurate Ear Recognition for Edge Devices

【速读】：该论文旨在解决在资源受限设备上部署高性能耳识别模型的挑战，这限制了其应用和广泛采用。解决方案的关键在于引入EdgeEar，这是一种基于提议的混合CNN-Transformer架构的轻量级模型。通过在特定线性层中引入低秩近似，EdgeEar将参数数量减少到五十分之一，降至两百万以下，同时保持竞争力的准确性。

链接: https://arxiv.org/abs/2502.07734
作者: Camile Lendering,Bernardo Perrone Ribeiro,Žiga Emeršič,Peter Peer
机构: Faculty of Computer and Information Science, University of Ljubljana (卢布尔雅那大学计算机与信息科学学院); Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE FG 2025

点击查看摘要

Abstract:Ear recognition is a contactless and unobtrusive biometric technique with applications across various domains. However, deploying high-performing ear recognition models on resource-constrained devices is challenging, limiting their applicability and widespread adoption. This paper introduces EdgeEar, a lightweight model based on a proposed hybrid CNN-transformer architecture to solve this problem. By incorporating low-rank approximations into specific linear layers, EdgeEar reduces its parameter count by a factor of 50 compared to the current state-of-the-art, bringing it below two million while maintaining competitive accuracy. Evaluation on the Unconstrained Ear Recognition Challenge (UERC2023) benchmark shows that EdgeEar achieves the lowest EER while significantly reducing computational costs. These findings demonstrate the feasibility of efficient and accurate ear recognition, which we believe will contribute to the wider adoption of ear biometrics.
zh

[CV-10] PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization

【速读】：该论文旨在解决第一人称视频中目标定位的问题，特别是在物体外观变化显著和背景杂乱的情况下。论文的关键在于引入了PRVQL（Progressive knowledge-guided Refinement framework），通过从视频中连续提取与目标相关的信息，并将其作为指导来优化查询和视频特征，从而逐步改进目标定位的准确性。这种渐进式的知识引导精化方法使得目标信息逐渐改善，最终提高目标定位的精度。

链接: https://arxiv.org/abs/2502.07707
作者: Bing Fan,Yunhe Feng,Yapeng Tian,Yuewei Lin,Yan Huang,Heng Fan
机构: University of North Texas; University of Texas at Dallas; Brookhaven National Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves state-of-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at this https URL.
zh

[CV-11] Magic 1-For-1: Generating One Minute Video Clips within One Minute

【速读】：该论文旨在解决视频生成模型在内存消耗和推理延迟方面的效率问题。解决方案的关键在于将文本到视频生成任务分解为两个更简单的子任务：文本到图像生成和图像到视频生成，并通过扩散步蒸馏技术优化这两个子任务。此外，论文探索了一系列优化技巧，从三个方面减少训练图像到视频（Image-to-Video, I2V）模型的计算成本：使用多模态先验条件注入加速模型收敛；应用对抗性步骤蒸馏加速推理延迟；以及通过参数稀疏化优化推理内存成本。这些技术使得生成5秒视频片段的时间缩短至3秒，且能够以显著提升的视觉质量和动态效果生成一分钟长的视频，平均每秒视频生成耗时少于1秒。

链接: https://arxiv.org/abs/2502.07701
作者: Hongwei Yi,Shitong Shao,Tian Ye,Jiantong Zhao,Qingyu Yin,Michael Lingelbach,Li Yuan,Yonghong Tian,Enze Xie,Daquan Zhou
机构: Hedra Inc.; Peking University; Nvidia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at this https URL.
zh

[CV-12] Matrix3D: Large Photogrammetry Model All-in-One

【速读】：该论文旨在解决多视角几何中的多个子任务，包括姿态估计、深度预测以及新视图合成等问题。解决方案的关键在于Matrix3D模型利用多模态扩散变换器（DiT）整合不同模态数据间的转换，并采用了一种掩码学习策略以实现大规模多模态训练，即使在部分完整数据（如图像-姿态对和图像-深度对的双模态数据）条件下也能进行全模态模型训练，从而显著增加了可用训练数据量。

链接: https://arxiv.org/abs/2502.07685
作者: Yuanxun Lu,Jingyang Zhang,Tian Fang,Jean-Daniel Nahmias,Yanghai Tsin,Long Quan,Xun Cao,Yao Yao,Shiwei Li
机构: Nanjing University (南京大学); Apple (苹果); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D’s large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: this https URL.
zh

[CV-13] Multiview Point Cloud Registration Based on Minimum Potential Energy for Free-Form Blade Measurement

【速读】：该论文旨在解决点云配准在工业测量中自由曲面叶片重建过程中因三维采集系统缺陷导致的噪声和不完整点云数据所引起的高效且精确配准难题。解决方案的关键在于提出了一种基于最小势能（Minimum Potential Energy, MPE）方法的新颖全局配准策略。该策略通过定义一个目标函数作为物理配准系统的最小势能优化函数，从而更重视多数内点而减少噪声和异常值的影响，本质上减少了数学公式中扰动的影响。此外，通过将解分解为全局最优近似过程和使用修整迭代最近点算法的精细配准过程，提高了收敛性。

链接: https://arxiv.org/abs/2502.07680
作者: Zijie Wu,Yaonan Wang,Yang Mo,Qing Zhu,He Xie,Haotian Wu,Mingtao Feng,Ajmal Mian
机构: College of Electrical and Information Engineering, Hunan University, Changsha 410082, China (湖南大学电气与信息工程学院); School of Computer Science and Technology, Xidian University, Xi’an 710071, China (西安电子科技大学计算机科学与技术学院); Department of Computer Science and Software Engineering, The University of Western Australia, Perth, Crawley, WA 6009, Australia (澳大利亚西澳大利亚大学计算机科学与软件工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注:

点击查看摘要

Abstract:Point cloud registration is an essential step for free-form blade reconstruction in industrial measurement. Nonetheless, measuring defects of the 3D acquisition system unavoidably result in noisy and incomplete point cloud data, which renders efficient and accurate registration challenging. In this paper, we propose a novel global registration method that is based on the minimum potential energy (MPE) method to address these problems. The basic strategy is that the objective function is defined as the minimum potential energy optimization function of the physical registration system. The function distributes more weight to the majority of inlier points and less weight to the noise and outliers, which essentially reduces the influence of perturbations in the mathematical formulation. We decompose the solution into a globally optimal approximation procedure and a fine registration process with the trimmed iterative closest point algorithm to boost convergence. The approximation procedure consists of two main steps. First, according to the construction of the force traction operator, we can simply compute the position of the potential energy minimum. Second, to find the MPE point, we propose a new theory that employs two flags to observe the status of the registration procedure. We demonstrate the performance of the proposed algorithm on four types of blades. The proposed method outperforms the other global methods in terms of both accuracy and noise resistance.
zh

[CV-14] Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving

【速读】：该论文旨在解决多任务学习中负迁移（negative transfer）的问题，特别是在自动驾驶场景下，同时处理语义理解和运动任务（如预测和规划）时导致检测和跟踪性能下降的现象。论文的关键解决方案是提出了一种名为Neural-Bayes运动解码的方法，通过分离语义和运动学习，采用并行检测、跟踪和预测机制，并引入交互式语义解码以增强信息交换，从而促进正向迁移（positive transfer）。这种方法通过共享一组递归更新的参考点，利用一组学习到的运动查询与检测和跟踪查询并行工作，实现更高效的多任务处理。

链接: https://arxiv.org/abs/2502.07631
作者: Yinzhe Shen,Ömer Şahin Taş,Kaiwen Wang,Royden Wagner,Christoph Stiller
机构: Karlsruhe Institute of Technology (KIT); FZI Research Center for Information Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion tasks, such as prediction and planning, always impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method separating semantic and motion learning, similar to the Bayes filter. Specifically, we employ a set of learned motion queries that operate in parallel with the detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset show improvements of 5% in detection and 11% in tracking. Our method achieves state-of-the-art collision rates in open-loop planning evaluation without any modifications to the planning module.
zh

[CV-15] Causal-Informed Contrastive Learning: Towards Bias-Resilient Pre-training under Concept Drift

【速读】：该论文旨在解决在概念漂移（Concept Drift）环境中，对比预训练方法受到显著影响并产生特征空间偏置的问题。解决方案的关键在于通过因果推理构建结构因果图来系统性分析概念漂移的影响，并提出因果干预对比目标（causal interventional contrastive objective）。基于此，论文设计了一种具有鲁棒性的对比预训练方法，以适应概念漂移的数据流，实现简单且可扩展的实施。

链接: https://arxiv.org/abs/2502.07620
作者: Xiaoyu Yang,Jie Lu,En Yu
机构: Australian Artificial Intelligence Institute (AAII) (澳大利亚人工智能研究所); Faculty of Engineering and Information Technology (工程与信息技术学院); University of Technology Sydney (悉尼科技大学); Australia (澳大利亚)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages, 3 figures

点击查看摘要

Abstract:The evolution of large-scale contrastive pre-training propelled by top-tier datasets has reached a transition point in the scaling law. Consequently, sustaining and enhancing a model’s pre-training capabilities in drift environments have surfaced as a notable challenge. In this paper, we initially uncover that contrastive pre-training methods are significantly impacted by concept drift wherein distributions change unpredictably, resulting in notable biases in the feature space of the pre-trained model. Empowered by causal inference, we construct a structural causal graph to analyze the impact of concept drift to contrastive pre-training systemically, and propose the causal interventional contrastive objective. Upon achieving this, we devise a resilient contrastive pre-training approach to accommodate the data stream of concept drift, with simple and scalable implementation. Extensive experiments on various downstream tasks demonstrate our resilient contrastive pre-training effectively mitigates the bias stemming from the concept drift data stream. Codes are available at this https URL.
zh

[CV-16] Scaling Pre-training to One Hundred Billion Data for Vision Language Models

【速读】：该论文旨在探究大规模预训练视觉-语言模型（Vision-Language Models, VLMs）在1000亿样本规模下的潜力，并评估其在常见分类和检索基准上的表现。关键在于发现虽然这些模型在许多西方中心化任务上性能趋于饱和，但在文化多样性任务上却能获得显著提升，这得益于大规模网络数据对长尾概念的覆盖。此外，研究还分析了模型的多语言能力，并指出质量过滤如使用CLIP可能会无意中减少数据集中的文化多样性。因此，论文的关键解决方案在于强调使用大规模、多样化的网络数据对于构建真正包容性的多模态系统的重要性。

链接: https://arxiv.org/abs/2502.07617
作者: Xiao Wang,Ibrahim Alabdulmohsin,Daniel Salz,Zhe Li,Keran Rong,Xiaohua Zhai
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model’s multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.
zh

[CV-17] Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors ICLR2025

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS)在优化过程中缺乏显式几何约束的问题，导致稀疏或无观测输入视角区域的几何重建效果不佳。解决方案的关键在于引入Flow Distillation Sampling (FDS)，通过利用预训练的几何知识，增强高斯辐射场的准确性。FDS采用一种策略性的采样技术，针对与输入视角相邻的未观测视角，利用匹配模型计算出的Prior Flow（前向流）指导从3DGS几何中解析出的Radiance Flow（辐射流），从而提升几何精度和渲染质量。

链接: https://arxiv.org/abs/2502.07615
作者: Lin-Zhuo Chen,Kangjie Liu,Youtian Lin,Siyu Zhu,Zhihao Li,Xun Cao,Yao Yao
机构: Nanjing University (南京大学); Fudan University (复旦大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has achieved excellent rendering quality with fast training and rendering speed. However, its optimization process lacks explicit geometric constraints, leading to suboptimal geometric reconstruction in regions with sparse or no observational input views. In this work, we try to mitigate the issue by incorporating a pre-trained matching prior to the 3DGS optimization process. We introduce Flow Distillation Sampling (FDS), a technique that leverages pre-trained geometric knowledge to bolster the accuracy of the Gaussian radiance field. Our method employs a strategic sampling technique to target unobserved views adjacent to the input views, utilizing the optical flow calculated from the matching model (Prior Flow) to guide the flow analytically calculated from the 3DGS geometry (Radiance Flow). Comprehensive experiments in depth rendering, mesh reconstruction, and novel view synthesis showcase the significant advantages of FDS over state-of-the-art methods. Additionally, our interpretive experiments and analysis aim to shed light on the effects of FDS on geometric accuracy and rendering quality, potentially providing readers with insights into its performance. Project page: this https URL
zh

[CV-18] An Improved Optimal Proximal Gradient Algorithm for Non-Blind Image Deblurring

【速读】：该论文旨在解决非盲去模糊（non-blind image deblurring）问题，即在已知模糊核（blur kernel）的情况下恢复清晰图像。论文的关键在于引入了一种改进的最优近端梯度算法（Improved Optimal Proximal Gradient Algorithm, IOptISTA），该算法结合了最优梯度方法和权重矩阵，以高效处理非盲去模糊问题。通过两种正则化情况，即L1范数和全变分范数，验证了所提算法的有效性，并展示了其在峰值信噪比（PSNR）和结构相似性指数（SSIM）方面的提升以及更高的精度。

链接: https://arxiv.org/abs/2502.07602
作者: Qingsong Wang,Shengze Xu,Xiaojiao Tong,Tieyong Zeng
机构: School of Mathematics and Computational Science, Xiangtan University (湘潭大学); Department of Mathematics, The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Image deblurring remains a central research area within image processing, critical for its role in enhancing image quality and facilitating clearer visual representations across diverse applications. This paper tackles the optimization problem of image deblurring, assuming a known blurring kernel. We introduce an improved optimal proximal gradient algorithm (IOptISTA), which builds upon the optimal gradient method and a weighting matrix, to efficiently address the non-blind image deblurring problem. Based on two regularization cases, namely the l_1 norm and total variation norm, we perform numerical experiments to assess the performance of our proposed algorithm. The results indicate that our algorithm yields enhanced PSNR and SSIM values, as well as a reduced tolerance, compared to existing methods.
zh

[CV-19] PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning

【速读】：该论文旨在解决利用大量未标注视频数据进行未来场景预测的问题。现有方法大多依赖于带有精确动作标注的视频序列和仿真，限制了其利用未标注视频数据的能力。为了解决这一挑战，论文提出了一种以对象为中心的视频预测模型PlaySlot。该模型通过从未标注视频序列中推断对象表示和潜在动作，进而预测未来的对象状态和视频帧。关键在于PlaySlot能够基于潜在动作生成多种可能的未来场景，这些潜在动作可以从视频动态中推断、由用户提供或通过学习得到的动作策略生成，从而实现灵活且可解释的世界建模。

链接: https://arxiv.org/abs/2502.07600
作者: Angel Villar-Corrales,Sven Behnke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on video sequences and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available at this https URL.
zh

[CV-20] YOLO Network For Defect Detection In Optical lenses

【速读】：该论文旨在解决光学镜头生产过程中由于缺陷导致的散射特性改变及质量标准降低的问题。手动检测方法因准确性低、错误率高且扩展性有限而不被推荐。为应对这些挑战，论文提出的关键解决方案是基于YOLOv8深度学习模型的自动化缺陷检测系统。通过创建包含标注缺陷和镜头区域的定制数据集来训练该模型，从而实现高效且准确的光学镜头缺陷检测。

链接: https://arxiv.org/abs/2502.07592
作者: Habib Yaseen
机构: School of Computing and Engineering, University of Huddersfield (哈德斯菲尔德大学计算与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mass-produced optical lenses often exhibit defects that alter their scattering properties and compromise quality standards. Manual inspection is usually adopted to detect defects, but it is not recommended due to low accuracy, high error rate and limited scalability. To address these challenges, this study presents an automated defect detection system based on the YOLOv8 deep learning model. A custom dataset of optical lenses, annotated with defect and lens regions, was created to train the model. Experimental results obtained in this study reveal that the system can be used to efficiently and accurately detect defects in optical lenses. The proposed system can be utilized in real-time industrial environments to enhance quality control processes by enabling reliable and scalable defect detection in optical lens manufacturing.
zh

[CV-21] DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training

【速读】：该论文旨在解决扩散变换器（Diffusion Transformers, DiTs）在训练高清晰度和长视频时，由于三维全注意力机制导致的计算复杂度呈二次增长的问题。特别是，注意力机制可能占据端到端时间的高达95%，从而需要专门的通信范式来处理大规模输入。为了解决这一问题，论文提出了一种名为DSV的新框架，通过利用训练过程中固有的动态注意力稀疏性来加速和扩展视频DiTs的训练。DSV的关键在于其采用的两阶段训练算法，该算法通过高效的定制内核利用稀疏模式，专注于关键元素。此外，为了适应新的稀疏维度，开发了混合稀疏感知上下文并行性，有效解决了注意力头和块之间稀疏性的异质性，从而实现了优化的稀疏计算和通信。评估结果表明，DSV能够在几乎不降低质量的情况下将训练吞吐量提高3.02倍。

链接: https://arxiv.org/abs/2502.07590
作者: Xin Tan,Yuetao Chen,Yimin Jiang,Xing Chen,Kun Yan,Nan Duan,Yibo Zhu,Daxin Jiang,Hong Xu
机构: The Chinese University of Hong Kong(香港中文大学); StepFun; Unaffiliated
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have shown remarkable performance in modeling and generating high-quality videos. However, the quadratic computational complexity of 3D full attention mechanism presents significant challenges in scaling video DiT training, especially for high-definition and lengthy videos, where attention can dominate up to 95% of the end-to-end time and necessitate specialized communication paradigms to handle large input sizes. This paper introduces DSV, a novel framework designed to accelerate and scale the training of video DiTs by leveraging the inherent dynamic attention sparsity throughout the training process. DSV employs a two-stage training algorithm that exploits sparsity patterns, focusing on critical elements supported by efficient, tailored kernels. To accommodate the new sparsity dimension, we develop a hybrid sparsity-aware context parallelism that effectively scales to large inputs by addressing the heterogeneity of sparsity across attention heads and blocks, resulting in optimized sparse computation and communication. Extensive evaluations demonstrate that DSV achieves up to 3.02x gain in training throughput with nearly no quality degradation. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.07590 [cs.DC] (or arXiv:2502.07590v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2502.07590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-22] An Elliptic Curve Based Solution to the Perspective-Three-Point Problem

【速读】：该论文旨在解决透视三点问题（Perspective-Three-Point Problem, P3P），其关键在于首先确定控制点对相对于相机的方向，而非计算相机到控制点的距离。通过这种方法，论文提出了一种高效、精确且相对简单的P3P求解器，并将其与现有的先进方法“Lambda Twist”进行了比较。论文发现P3P问题与一类特殊的椭圆曲线存在密切联系，这为未来的研究提供了新的方向。

链接: https://arxiv.org/abs/2502.07564
作者: Michael Q. Rieck
机构: Drake University (德鲁克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:The Perspective-Three-Point Problem (P3P) is solved by first focusing on determining the directions of the lines through pairs of control points, relative to the camera, rather than the distances from the camera to the control points. The analysis of this produces an efficient, accurate and reasonably simple P3P solver, which is compared with a state-of-the-art P3P solver, “Lambda Twist.” Both methods depend on the accurate computation of a single root of a cubic polynomial. They have been implemented and tested for a wide range of control-point triangles, and under certain reasonable restrictions, the new method is noticably more accurate than Lambda Twist, though it is slower. However, the principal value of the present work is not in introducing yet another P3P solver, but lies rather in the discovery of an intimate connection between the P3P problem and a special family of elliptic curves that includes curves utilized in cryptography. This holds the potential for further advances in a number of directions. To make this connection, an interesting spherical analogue of an ancient “sliding” problem is stated and solved.
zh

[CV-23] Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning

【速读】：该论文旨在解决类增量学习（Class-Incremental Learning, CIL）中平衡模型灵活性与稳定性的问题，特别是在任务ID未知的情况下。论文的关键在于揭示了新旧任务特征分布差异主要由均值和协方差矩的不同所驱动，并提出了一种新的语义漂移校准方法，该方法结合了均值偏移补偿和协方差校准。具体而言，通过计算每个类别的样本嵌入均值并利用加权嵌入变化来估计任务偏移，从而有效捕捉所有已学类别在新任务中的均值变化。同时，采用马氏距离约束进行协方差校准，以缓解协方差偏移问题。此外，还引入了特征层面的自蒸馏方法以增强泛化能力。

链接: https://arxiv.org/abs/2502.07560
作者: Fangwen Wu,Lechao Cheng,Shengeng Tang,Xiaofeng Zhu,Chaowei Fang,Dingwen Zhang,Meng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Class-incremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class’s mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance constraint for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach. The source code is available at \hrefthis https URLthis https URL.
zh

[CV-24] SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches

【速读】：该论文旨在解决非专家用户在使用文本到图像模型生成语义连贯图像时，难以构建合适的提示词（prompt）以及指定精细的空间条件（如深度或边界参考）的问题，特别是在处理多个对象时。论文的关键解决方案是引入SketchFlex系统，该系统通过利用众包对象属性和关系丰富语义空间来自动推断合理的用户提示，并将用户的粗略草图精炼为基于边界的形状锚点，从而提高生成图像的质量与用户意图的一致性。

链接: https://arxiv.org/abs/2502.07556
作者: Haichuan Lin,Yilin Ye,Jiazhi Xia,Wei Zeng
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Guangzhou (广州), Guangdong (广东); The Hong Kong University of Science and Technology (香港科技大学); Hong Kong SAR (中国香港); Central South University (中南大学); Changsha (长沙); China (中国)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: conference: CHI2025

点击查看摘要

Abstract:Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically cohesive images, especially when multiple objects are involved. In response, we introduce SketchFlex, an interactive system designed to improve the flexibility of spatially conditioned image generation using rough region sketches. The system automatically infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships. Additionally, SketchFlex refines users’ rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions. Experimental results demonstrate that SketchFlex achieves more cohesive image generations than end-to-end models, meanwhile significantly reducing cognitive load and better matching user intentions compared to region-based generation baseline.
zh

[CV-25] VidCRAFT3: Camera Object and Lighting Control for Image-to-Video Generation

【速读】：该论文旨在解决现有图像到视频生成方法在控制多个视觉元素（如相机运动、物体运动和光照方向）方面的能力不足的问题。关键解决方案在于引入了VidCRAFT3框架，并提出了空间三重注意力变换器（Spatial Triple-Attention Transformer），以实现对相机运动、物体运动和光照方向的同时控制。此外，构建了一个包含光照方向标注的高质量合成视频数据集VideoLightingDirection (VLD) 数据集，以及采用了一种三阶段训练策略，从而无需多元素标注的训练数据即可提升生成视频的质量和视觉一致性。

链接: https://arxiv.org/abs/2502.07531
作者: Sixiao Zheng,Zimian Peng,Yanpeng Zhou,Yi Zhu,Hang Xu,Xiangru Huang,Yanwei Fu
机构: Fudan University(复旦大学); Zhejiang University(浙江大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室); Westlake University(西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera trajectory or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. To better decouple control over each visual element, we propose the Spatial Triple-Attention Transformer, which integrates lighting direction, text, and image in a symmetric way. Since most real-world video datasets lack lighting annotations, we construct a high-quality synthetic video dataset, the VideoLightingDirection (VLD) dataset. This dataset includes lighting direction annotations and objects of diverse appearance, enabling VidCRAFT3 to effectively handle strong light transmission and reflection effects. Additionally, we propose a three-stage training strategy that eliminates the need for training data annotated with multiple visual elements (camera motion, object motion, and lighting direction) simultaneously. Extensive experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content, surpassing existing state-of-the-art methods in terms of control granularity and visual coherence. All code and data will be publicly available. Project page: this https URL.
zh

[CV-26] CodePhys: Robust Video-based Remote Physiological Measurement through Latent Codebook Querying

【速读】：该论文旨在解决在实际应用场景中，远程光体积描记法（rPPG）从面部视频中提取生理信号时面临的干扰和退化问题。现有的方法在处理非生理因素（如相机噪声、失焦和运动模糊）导致的面部视频扭曲时面临挑战，从而产生噪声较大的rPPG信号。论文的关键解决方案是提出了一种名为CodePhys的新方法，它将rPPG测量视为在由真实心率信号构建的无噪声代理空间（即码本）中的编码查询任务。这种方法通过匹配噪声rPPG特征与码本中的无噪声PPG特征来生成高保真的rPPG特征，并结合空间感知编码网络和空间注意力机制来突出生理活跃区域，同时利用蒸馏损失减少非周期性视觉干扰的影响。

链接: https://arxiv.org/abs/2502.07526
作者: Shuyang Chu,Menghan Xia,Mengyao Yuan,Xin Liu,Tapio Seppanen,Guoying Zhao,Jingang Shi
机构: School of Software Engineering, Xi’an Jiaotong University (西安交通大学软件工程学院), China; Tencent AI Lab (腾讯人工智能实验室), Shenzhen, China; Computer Vision and Pattern Recognition Laboratory, Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-拉赫蒂理工大学计算机视觉与模式识别实验室), Finland; Center for Machine Vision and Signal Analysis, University of Oulu (奥卢大学机器视觉与信号分析中心), Finland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) aims to measure non-contact physiological signals from facial videos, which has shown great potential in many applications. Most existing methods directly extract video-based rPPG features by designing neural networks for heart rate estimation. Although they can achieve acceptable results, the recovery of rPPG signal faces intractable challenges when interference from real-world scenarios takes place on facial video. Specifically, facial videos are inevitably affected by non-physiological factors (e.g., camera device noise, defocus, and motion blur), leading to the distortion of extracted rPPG signals. Recent rPPG extraction methods are easily affected by interference and degradation, resulting in noisy rPPG signals. In this paper, we propose a novel method named CodePhys, which innovatively treats rPPG measurement as a code query task in a noise-free proxy space (i.e., codebook) constructed by ground-truth PPG signals. We consider noisy rPPG features as queries and generate high-fidelity rPPG features by matching them with noise-free PPG features from the codebook. Our approach also incorporates a spatial-aware encoder network with a spatial attention mechanism to highlight physiologically active areas and uses a distillation loss to reduce the influence of non-periodic visual interference. Experimental results on four benchmark datasets demonstrate that CodePhys outperforms state-of-the-art methods in both intra-dataset and cross-dataset settings.
zh

[CV-27] Enhance-A-Video: Better Generated Video for Free

【速读】：该论文旨在解决DiT（Diffusion-based Image Transformer）模型在视频生成中的连贯性和质量提升的问题。解决方案的关键在于提出了一种无需训练的方法——Enhance-A-Video，通过基于非对角时间注意力分布增强跨帧相关性，从而改善视频生成的时序一致性及视觉质量。此方法设计简单，可轻松应用于大多数DiT基础的视频生成框架，无需重新训练或微调。

链接: https://arxiv.org/abs/2502.07508
作者: Yang Luo,Xuanlei Zhao,Mengzhao Chen,Kaipeng Zhang,Wenqi Shao,Kai Wang,Zhangyang Wang,Yang You
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.
zh

[CV-28] Efficient Continuous Group Convolutions for Local SE(3) Equivariance in 3D Point Clouds

【速读】：该论文旨在解决在处理三维点云数据时，扩展卷积层的等变性（equivariance）至SE(3)群（旋转和平移）所带来的高计算成本问题。现有方法通常依赖离散化或仅引入全局旋转等变性，这限制了它们在包含多个物体的场景点云中的应用。论文的关键解决方案在于提出了一种基于广义群卷积和局部参考帧的高效、连续且局部的SE(3)等变卷积层，从而在保持计算效率的同时，增强了网络的表达能力，并在多种数据集和任务中展示了与现有方法相当或更优的性能。

链接: https://arxiv.org/abs/2502.07505
作者: Lisa Weijler,Pedro Hermosilla
机构: TU Wien (维也纳技术大学); Austria (奥地利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extending the translation equivariance property of convolutional neural networks to larger symmetry groups has been shown to reduce sample complexity and enable more discriminative feature learning. Further, exploiting additional symmetries facilitates greater weight sharing than standard convolutions, leading to an enhanced network expressivity without an increase in parameter count. However, extending the equivariant properties of a convolution layer comes at a computational cost. In particular, for 3D data, expanding equivariance to the SE(3) group (rotation and translation) results in a 6D convolution operation, which is not tractable for larger data samples such as 3D scene scans. While efforts have been made to develop efficient SE(3) equivariant networks, existing approaches rely on discretization or only introduce global rotation equivariance. This limits their applicability to point clouds representing a scene composed of multiple objects. This work presents an efficient, continuous, and local SE(3) equivariant convolution layer for point cloud processing based on general group convolution and local reference frames. Our experiments show that our approach achieves competitive or superior performance across a range of datasets and tasks, including object classification and semantic segmentation, with negligible computational overhead.
zh

[CV-29] RoMA: Robust Malware Attribution via Byte-level Adversarial Training with Global Perturbations and Adversarial Consistency Regularization

【速读】：该论文旨在解决APT恶意软件归因模型在对抗性攻击下的脆弱性问题。现有基于机器学习的归因模型，如MalConv，在PGD攻击下准确性显著下降。为应对这一挑战，论文提出了一种名为RoMA的新方法，关键在于其引入了全局扰动生成增强型对抗样本，并采用对抗一致性正则化来提升表示质量和鲁棒性。通过在新构建的AMG18数据集上的实验验证，RoMA不仅大幅提升了模型的对抗鲁棒性和训练效率，还在非对抗场景中保持了较高的标准准确性。

链接: https://arxiv.org/abs/2502.07492
作者: Yuxia Sun,Huihong Chen,Jingcai Guo,Aoxiang Sun,Zhetao Li,Haolin Liu
机构: College of Information Science and Technology, Jinan University (暨南大学); Department of Computing, The Hong Kong Polytechnic University (香港理工大学); College of Information and Computational Science, Jilin University (吉林大学); School of Computer Science, Xiangtan University (湘潭大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Attributing APT (Advanced Persistent Threat) malware to their respective groups is crucial for threat intelligence and cybersecurity. However, APT adversaries often conceal their identities, rendering attribution inherently adversarial. Existing machine learning-based attribution models, while effective, remain highly vulnerable to adversarial attacks. For example, the state-of-the-art byte-level model MalConv sees its accuracy drop from over 90% to below 2% under PGD (projected gradient descent) attacks. Existing gradient-based adversarial training techniques for malware detection or image processing were applied to malware attribution in this study, revealing that both robustness and training efficiency require significant improvement. To address this, we propose RoMA, a novel single-step adversarial training approach that integrates global perturbations to generate enhanced adversarial samples and employs adversarial consistency regularization to improve representation quality and resilience. A novel APT malware dataset named AMG18, with diverse samples and realistic class imbalances, is introduced for evaluation. Extensive experiments show that RoMA significantly outperforms seven competing methods in both adversarial robustness (e.g., achieving over 80% robust accuracy-more than twice that of the next-best method under PGD attacks) and training efficiency (e.g., more than twice as fast as the second-best method in terms of accuracy), while maintaining superior standard accuracy in non-adversarial scenarios.
zh

[CV-30] Automated Road Extraction and Centreline Fitting in LiDAR Point Clouds

【速读】：该论文旨在解决从三维点云中提取道路信息的问题，特别是在城市规划和交通管理中的应用。现有方法通常依赖局部特征和路缘的激光折射角度，这使得它们在面对多变的路缘设计和高密度区域的数据同质性问题时变得敏感。论文提出的关键解决方案是从基于激光雷达地面采集点云的鸟瞰图视角进行道路点提取和中心线拟合。这种方法减少了对特定路缘设计的依赖，并提高了道路提取的效果。具体步骤包括统计离群点去除、基于密度的聚类以减少噪声，以及采用适应不同道路场景和地形特征的基于网格的分割方法进行地面点过滤。通过将过滤后的点投影到二维平面上，利用骨架化算法提取道路，并结合计算法线指导的区域生长算法来找到邻近的道路点，最后使用Savitzky-Golay滤波器平滑提取的道路点以生成最终的中心线。这一方法不仅提升了IoU（Intersection over Union）值至73%，还减少了23%的处理时间。

链接: https://arxiv.org/abs/2502.07486
作者: Xinyu Wang,Muhammad Ibrahim,Atif Mansoor,Hasnein Tareque,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); Department of Primary Industries and Regional Development (初级产业与地区发展部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, accepted in DICTA 2024

点击查看摘要

Abstract:Road information extraction from 3D point clouds is useful for urban planning and traffic management. Existing methods often rely on local features and the refraction angle of lasers from kerbs, which makes them sensitive to variable kerb designs and issues in high-density areas due to data homogeneity. We propose an approach for extracting road points and fitting centrelines using a top-down view of LiDAR based ground-collected point clouds. This prospective view reduces reliance on specific kerb design and results in better road extraction. We first perform statistical outlier removal and density-based clustering to reduce noise from 3D point cloud data. Next, we perform ground point filtering using a grid-based segmentation method that adapts to diverse road scenarios and terrain characteristics. The filtered points are then projected onto a 2D plane, and the road is extracted by a skeletonisation algorithm. The skeleton is back-projected onto the 3D point cloud with calculated normals, which guide a region growing algorithm to find nearby road points. The extracted road points are then smoothed with the Savitzky-Golay filter to produce the final centreline. Our initial approach without post-processing of road skeleton achieved 67% in IoU by testing on the Perth CBD dataset with different road types. Incorporating the post-processing of the road skeleton improved the extraction of road points around the smoothed skeleton. The refined approach achieved a higher IoU value of 73% and with 23% reduction in the processing time. Our approach offers a generalised and computationally efficient solution that combines 3D and 2D processing techniques, laying the groundwork for future road reconstruction and 3D-to-2D point cloud alignment.
zh

[CV-31] Less is More: Masking Elements in Image Condition Features Avoids Content Leakages in Style Transfer Diffusion Models

【速读】：该论文旨在解决文本到图像扩散模型在利用风格参考图像生成图像时难以解耦内容与风格的问题，从而导致内容泄漏等问题。关键解决方案在于提出了一种基于掩码的方法，通过简单地遮蔽风格参考图像特征中的特定元素，有效地实现了内容与风格的解耦，而无需调整任何模型参数。这种方法通过选择性地使用较少的条件（如丢弃某些图像特征元素），能够高效地避免不必要的内容流入扩散模型，从而提升文本到图像扩散模型的风格迁移性能。

链接: https://arxiv.org/abs/2502.07466
作者: Lin Zhu,Xinbing Wang,Chenghu Zhou,Qinying Gu,Nanyang Ye
机构: Shanghai Jiao Tong University(上海交通大学); Chinese Academy of Sciences(中国科学院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference’s image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.
zh

[CV-32] Bidirectional Uncertainty-Aware Region Learning for Semi-Supervised Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割中由于低质量未标记数据和模型预测不确定性导致的错误伪标签累积问题，进而削弱模型性能。解决方案的关键在于提出了一种双向不确定性感知区域学习策略。此策略在训练标注数据时关注高不确定性区域，并利用精确的标签信息引导模型学习；而在训练未标注数据时，则集中于低不确定性区域以减少错误伪标签的干扰，从而显著提升了模型的整体性能。

链接: https://arxiv.org/abs/2502.07457
作者: Shiwei Zhou,Haifeng Zhao,Dengdi Sun
机构: School of Artificial Intelligence, Anhui University(安徽大学人工智能学院); School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In semi-supervised medical image segmentation, the poor quality of unlabeled data and the uncertainty in the model’s predictions lead to models that inevitably produce erroneous pseudo-labels. These errors accumulate throughout model training, thereby weakening the model’s performance. We found that these erroneous pseudo-labels are typically concentrated in high-uncertainty regions. Traditional methods improve performance by directly discarding pseudo-labels in these regions, but this can also result in neglecting potentially valuable training data. To alleviate this problem, we propose a bidirectional uncertainty-aware region learning strategy. In training labeled data, we focus on high-uncertainty regions, using precise label information to guide the model’s learning in potentially uncontrollable areas. Meanwhile, in the training of unlabeled data, we concentrate on low-uncertainty regions to reduce the interference of erroneous pseudo-labels on the model. Through this bidirectional learning strategy, the model’s overall performance has significantly improved. Extensive experiments show that our proposed method achieves significant performance improvement on different medical image segmentation tasks.
zh

[CV-33] FedAPA: Server-side Gradient-Based Adaptive Personalized Aggregation for Federated Learning on Heterogeneous Data

【速读】：该论文旨在解决个性化联邦学习（Personalized Federated Learning, PFL）在处理异构数据时面临的准确性、计算效率和通信开销挑战。解决方案的关键在于提出了一种名为FedAPA的新方法，该方法采用基于梯度的自适应聚合策略，在服务器端通过集中方式更新聚合权重，以反映客户端参数变化与聚合权重之间的关系，从而生成个性化模型。这种方法保证了理论上的收敛性，并在三个数据集上展示了优于10个PFL竞争对手的卓越准确性和计算效率，同时保持了可比的通信开销。

链接: https://arxiv.org/abs/2502.07456
作者: Yuxia Sun,Aoxiang Sun,Siyi Pan,Zhixiao Fu,Jingcai Guo
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Personalized federated learning (PFL) tailors models to clients’ unique data distributions while preserving privacy. However, existing aggregation-weight-based PFL methods often struggle with heterogeneous data, facing challenges in accuracy, computational efficiency, and communication overhead. We propose FedAPA, a novel PFL method featuring a server-side, gradient-based adaptive aggregation strategy to generate personalized models, by updating aggregation weights based on gradients of client-parameter changes with respect to the aggregation weights in a centralized manner. FedAPA guarantees theoretical convergence and achieves superior accuracy and computational efficiency compared to 10 PFL competitors across three datasets, with competitive communication overhead.
zh

[CV-34] Optimizing Knowledge Distillation in Transformers: Enabling Multi-Head Attention without Alignment Barriers

【速读】：该论文旨在解决知识蒸馏（Knowledge Distillation, KD）在变换器模型中因注意力头数量不匹配所面临的挑战。现有方法要么需要相同的头数量，要么引入投影器来弥补维度差异，从而限制了灵活性和效率。论文的关键解决方案是提出了挤压头蒸馏（Squeezing-Heads Distillation, SHD），这是一种新型方法，通过高效的线性近似压缩多头注意力图，实现了不同头数量模型间无缝的知识转移。SHD消除了对齐障碍，无需额外参数或架构修改。其核心创新包括灵活的头压缩、无投影器设计以及线性时间复杂度，使其成为一种通用且可扩展的现代变换器模型蒸馏方案。

链接: https://arxiv.org/abs/2502.07436
作者: Zhaodong Bing,Linze Li,Jiajun Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) in transformers often faces challenges due to misalignment in the number of attention heads between teacher and student models. Existing methods either require identical head counts or introduce projectors to bridge dimensional gaps, limiting flexibility and efficiency. We propose Squeezing-Heads Distillation (SHD), a novel approach that enables seamless knowledge transfer between models with varying head counts by compressing multi-head attention maps via efficient linear approximation. Unlike prior work, SHD eliminates alignment barriers without additional parameters or architectural modifications. Our method dynamically approximates the combined effect of multiple teacher heads into fewer student heads, preserving fine-grained attention patterns while reducing redundancy. Experiments across language (LLaMA, GPT) and vision (DiT, MDT) generative and vision (DeiT) discriminative tasks demonstrate SHD’s effectiveness: it outperforms logit-based and feature-alignment KD baselines, achieving state-of-the-art results in image classification, image generation language fine-tuning, and language pre-training. The key innovations of flexible head compression, projector-free design, and linear-time complexity make SHD a versatile and scalable solution for distilling modern transformers. This work bridges a critical gap in KD, enabling efficient deployment of compact models without compromising performance.
zh

[CV-35] ArthroPhase: A Novel Dataset and Method for Phase Recognition in Arthroscopic Video

【速读】：该论文旨在通过引入首个关节镜数据集和开发新型基于Transformer的模型，推进前交叉韧带（ACL）重建手术中的手术阶段识别。解决方案的关键在于利用时空特征来应对关节镜视频特有的挑战，如有限的视野、遮挡和视觉畸变。研究团队开发了包含27个ACL手术视频的ACL27数据集，并采用基于Transformer的架构，通过ResNet-50进行时间感知的逐帧特征提取及Transformer层处理，以整合时空特征。此外，引入了手术进展指数（Surgical Progress Index, SPI）来量化手术进程，从而提升手术阶段识别的精度与可靠性。

链接: https://arxiv.org/abs/2502.07431
作者: Ali Bahari Malayeri,Matthias Seibold,Nicola Cavalcanti,Jonas Hein,Sascha Jecklin,Lazaros Vlachopoulos,Sandro Fucentese,Sandro Hodel,Philipp Furnstahl
机构: Balgrist University Hospital, University of Zurich (巴塞尔格里斯特大学医院, 苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study aims to advance surgical phase recognition in arthroscopic procedures, specifically Anterior Cruciate Ligament (ACL) reconstruction, by introducing the first arthroscopy dataset and developing a novel transformer-based model. We aim to establish a benchmark for arthroscopic surgical phase recognition by leveraging spatio-temporal features to address the specific challenges of arthroscopic videos including limited field of view, occlusions, and visual distortions. We developed the ACL27 dataset, comprising 27 videos of ACL surgeries, each labeled with surgical phases. Our model employs a transformer-based architecture, utilizing temporal-aware frame-wise feature extraction through a ResNet-50 and transformer layers. This approach integrates spatio-temporal features and introduces a Surgical Progress Index (SPI) to quantify surgery progression. The model’s performance was evaluated using accuracy, precision, recall, and Jaccard Index on the ACL27 and Cholec80 datasets. The proposed model achieved an overall accuracy of 72.91% on the ACL27 dataset. On the Cholec80 dataset, the model achieved a comparable performance with the state-of-the-art methods with an accuracy of 92.4%. The SPI demonstrated an output error of 10.6% and 9.86% on ACL27 and Cholec80 datasets respectively, indicating reliable surgery progression estimation. This study introduces a significant advancement in surgical phase recognition for arthroscopy, providing a comprehensive dataset and a robust transformer-based model. The results validate the model’s effectiveness and generalizability, highlighting its potential to improve surgical training, real-time assistance, and operational efficiency in orthopedic surgery. The publicly available dataset and code will facilitate future research and development in this critical field.
zh

[CV-36] MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate Fair and Robust Edge Deep Neural Networks

【速读】：该论文旨在解决现有边缘深度神经网络（Edge DNNs）在优化过程中忽视公平性（fairness）、鲁棒性（robustness）和泛化能力（generalization）的问题。通过使用传统的优化技术如剪枝（pruning），以及最近的自动设计方法，尽管提高了精度和效率，但这些模型在公平性和鲁棒性方面仍存在显著不足。论文指出，SOTA边缘DNNs在FACET数据集上的图像分类任务中表现出高达14.09%的肤色精度差异，并且存在非鲁棒性和较差的泛化能力。

为了解决这些问题，论文提出了一种名为混合专家神经架构搜索（MoENAS）的方法。这种方法通过探索混合专家（mixture of experts）的空间来发现更精准、更公平、更鲁棒且更具泛化能力的边缘DNNs。关键在于MoENAS不仅提升了准确性，还减少了肤色精度差异，增强了鲁棒性，并减少了过拟合现象，同时保持了接近SOTA模型的大小。

链接: https://arxiv.org/abs/2502.07422
作者: Lotfi Abdelkrim Mecharbat,Alberto Marchisio,Muhammad Shafique,Mohammad M. Ghassemi,Tuka Alhanai
机构: Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute, NYUAD, Abu Dhabi, UAE; eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD), Abu Dhabi, UAE; Department of Computer Science, Michigan State University, MI USA
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There has been a surge in optimizing edge Deep Neural Networks (DNNs) for accuracy and efficiency using traditional optimization techniques such as pruning, and more recently, employing automatic design methodologies. However, the focus of these design techniques has often overlooked critical metrics such as fairness, robustness, and generalization. As a result, when evaluating SOTA edge DNNs’ performance in image classification using the FACET dataset, we found that they exhibit significant accuracy disparities (14.09%) across 10 different skin tones, alongside issues of non-robustness and poor generalizability. In response to these observations, we introduce Mixture-of-Experts-based Neural Architecture Search (MoENAS), an automatic design technique that navigates through a space of mixture of experts to discover accurate, fair, robust, and general edge DNNs. MoENAS improves the accuracy by 4.02% compared to SOTA edge DNNs and reduces the skin tone accuracy disparities from 14.09% to 5.60%, while enhancing robustness by 3.80% and minimizing overfitting to 0.21%, all while keeping model size close to state-of-the-art models average size (+0.4M). With these improvements, MoENAS establishes a new benchmark for edge DNN design, paving the way for the development of more inclusive and robust edge DNNs.
zh

[CV-37] Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving

【速读】：该论文旨在解决在计算资源受限的情况下，驾驶场景中实时检测精度与处理速度难以兼顾的问题。解决方案的关键在于提出了一种名为Fast-COS的新型单阶段目标检测框架，其中包含一个重新参数化的注意力视觉变换器（Reparameterized Attention Vision Transformer, RAViT）。RAViT通过采用重新参数化的多尺度深度可分离卷积（Reparameterized Multi-Scale Depth-Wise Convolution, RepMSDW）和重新参数化的自注意力机制（Reparameterized Self-Attention, RepSA）来增强计算效率和特征提取能力。此外，将RepMSDW集成到特征金字塔网络中形成RepFPN，实现了快速且多尺度的特征融合。这些改进使得Fast-COS在保持高精度的同时，显著提升了检测速度，在GPU和边缘设备上的推理速度分别提高了75.9%和1.38倍，从而证明了其在实时应用中的高效性和可靠性。

链接: https://arxiv.org/abs/2502.07417
作者: Novendra Setyawan,Ghufron Wahyu Kurniawan,Chi-Chia Sun,Wen-Kai Kuo,Jun-Wei Hsieh
机构: Department of Electro-Optics, National Formosa University (台湾科技大学), Taiwan; Department of Electrical Engineering University of Muhammadiyah Malang (穆罕默迪亚大学电气工程系), Indonesia; Department of Electrical Engineering, National Formosa University (台湾科技大学), Taiwan; Department of Electrical Engineering, National Taipei University (台北大学电气工程系), Taiwan; College of Artificial Intelligence and Green Energy, National Yang Ming Chiao Tung University (阳明交通大学人工智能与绿色能源学院), Taiwan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review on IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:The perception system is a a critical role of an autonomous driving system for ensuring safety. The driving scene perception system fundamentally represents an object detection task that requires achieving a balance between accuracy and processing speed. Many contemporary methods focus on improving detection accuracy but often overlook the importance of real-time detection capabilities when computational resources are limited. Thus, it is vital to investigate efficient object detection strategies for driving scenes. This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scene applications. The research initiates with an analysis of the backbone, considering both macro and micro architectural designs, yielding the Reparameterized Attention Vision Transformer (RAViT). RAViT utilizes Reparameterized Multi-Scale Depth-Wise Convolution (RepMSDW) and Reparameterized Self-Attention (RepSA) to enhance computational efficiency and feature extraction. In extensive tests across GPU, edge, and mobile platforms, RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset, demonstrating significant throughput improvements over comparable backbone models such as ResNet, FastViT, RepViT, and EfficientFormer. Additionally, integrating RepMSDW into a feature pyramid network forms RepFPN, enabling fast and multi-scale feature fusion. Fast-COS enhances object detection in driving scenes, attaining an AP50 score of 57.2% on the BDD100K dataset and 80.0% on the TJU-DHD Traffic dataset. It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices compared to FCOS, YOLOF, and RetinaNet. These findings establish Fast-COS as a highly scalable and reliable solution suitable for real-time applications, especially in resource-limited environments like autonomous driving systems
zh

[CV-38] EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

【速读】：该论文旨在解决第一人称视角下场景文本问答（Egocentric Question Answering, Egocentric QA）辅助技术中的问题。当前多模态大型语言模型在这一任务上的准确率仅为约33%，表明现有技术存在严重不足。研究的关键在于精确的时间定位（temporal grounding）与多帧推理（multi-frame reasoning），以及高分辨率和辅助场景文本输入。通过构建EgoTextVQA数据集并评估现有模型，作者希望为未来的研究提供坚实的基础。

链接: https://arxiv.org/abs/2502.07411
作者: Sheng Zhou,Junbin Xiao,Qingyun Li,Yicong Li,Xun Yang,Dan Guo,Meng Wang,Tat-Seng Chua,Angela Yao
机构: National University of Singapore; Hefei University of Technology; University of Science and Technology of China
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real-user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance.
zh

[CV-39] MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification

【速读】：该论文旨在解决全幻灯片病理图像分类中的挑战，包括图像尺寸巨大及标注标签有限的问题，这些问题阻碍了模型的泛化能力。论文提出了一种提示学习方法，通过多粒度注意力机制和基于最优传输的视觉-文本距离来改进大规模视觉-语言模型在少量样本下的病理分类性能。关键在于引入了可学习的提示嵌入，并结合多粒度注意力机制来同时捕捉细粒度细节和更广泛上下文信息，从而提高复杂模式识别的能力。此外，通过优化运输方法增强模型稳健性，以减轻数据增强过程中可能发生的扰动影响。

链接: https://arxiv.org/abs/2502.07409
作者: Anh-Tien Nguyen,Duy Minh Ho Nguyen,Nghiem Tuong Diep,Trung Quoc Nguyen,Nhat Ho,Jacqueline Michelle Metsch,Miriam Cindy Maurer,Daniel Sonntag,Hanibal Bohnenberger,Anne-Christin Hauschild
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: first version

点击查看摘要

Abstract:Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this MGPATH.
zh

[CV-40] No Data No Optimization: A Lightweight Method To Disrupt Neural Networks With Sign-Flips

【速读】：该论文旨在解决深度神经网络（DNNs）对参数微小扰动的高度敏感性问题，即通过仅翻转少量符号位即可导致网络性能严重下降。论文的关键解决方案是引入了一种名为深度神经病变（DNL）的方法，该方法无需训练数据或优化过程，通过定位这些关键参数位点，能够触发显著的准确率下降。DNL 方法利用常见的软件、固件或硬件攻击向量实施，并且其增强版本通过单次前向和后向传递进一步放大了破坏效果。

链接: https://arxiv.org/abs/2502.07408
作者: Ido Galil,Moshe Kimhi,Ran El-Yaniv
机构: Technion; NVIDIA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) can be catastrophically disrupted by flipping only a handful of sign bits in their parameters. We introduce Deep Neural Lesion (DNL), a data-free, lightweight method that locates these critical parameters and triggers massive accuracy drops. We validate its efficacy on a wide variety of computer vision models and datasets. The method requires no training data or optimization and can be carried out via common exploits software, firmware or hardware based attack vectors. An enhanced variant that uses a single forward and backward pass further amplifies the damage beyond DNL’s zero-pass approach. Flipping just two sign bits in ResNet50 on ImageNet reduces accuracy by 99.8%. We also show that selectively protecting a small fraction of vulnerable sign bits provides a practical defense against such attacks.
zh

[CV-41] Human-in-the-Loop Annotation for Image-Based Engagement Estimation: Assessing the Impact of Model Reliability on Annotation Accuracy

【速读】：该论文旨在解决如何通过整合高性能图像情感模型到人机协同标注框架中，提升情感估计系统的标注准确性。研究的关键在于探究模型可靠性和认知框架如何影响标注者在人机协同系统中的信任、认知负荷及标注行为。研究表明，模型的可靠性以及心理因素显著影响标注者的信任度、参与度和一致性，从而优化人机协同框架。通过三个实验场景（基线模型可靠性、引入错误、负向框架下的认知偏差）分析行为和质性数据，揭示了可靠的机器输出和心理因素对于有效人机协作的重要性。

链接: https://arxiv.org/abs/2502.07404
作者: Sahana Yadnakudige Subramanya,Ko Watanabe,Andreas Dengel,Shoya Ishimaru
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-in-the-loop (HITL) frameworks are increasingly recognized for their potential to improve annotation accuracy in emotion estimation systems by combining machine predictions with human expertise. This study focuses on integrating a high-performing image-based emotion model into a HITL annotation framework to evaluate the collaborative potential of human-machine interaction and identify the psychological and practical factors critical to successful collaboration. Specifically, we investigate how varying model reliability and cognitive framing influence human trust, cognitive load, and annotation behavior in HITL systems. We demonstrate that model reliability and psychological framing significantly impact annotators’ trust, engagement, and consistency, offering insights into optimizing HITL frameworks. Through three experimental scenarios with 29 participants–baseline model reliability (S1), fabricated errors (S2), and cognitive bias introduced by negative framing (S3)–we analyzed behavioral and qualitative data. Reliable predictions in S1 yielded high trust and annotation consistency, while unreliable outputs in S2 led to increased critical evaluations but also heightened frustration and response variability. Negative framing in S3 revealed how cognitive bias influenced participants to perceive the model as more relatable and accurate, despite misinformation regarding its reliability. These findings highlight the importance of both reliable machine outputs and psychological factors in shaping effective human-machine collaboration. By leveraging the strengths of both human oversight and automated systems, this study establishes a scalable HITL framework for emotion annotation and lays the foundation for broader applications in adaptive learning and human-computer interaction.
zh

[CV-42] Extended monocular 3D imaging

【速读】：该论文旨在解决现有3D成像硬件体积大、复杂度高以及图像分辨率低的问题，并克服在低纹理、高反射或近透明场景下3D成像的常见失败。解决方案的关键在于引入了一种扩展的单目3D成像（Extended Monocular 3D Imaging, EM3D）框架，通过融合衍射和偏振深度线索，利用配备衍射-折射混合透镜的紧凑型单目相机，实现了百万像素精度的3D点云快速获取，适用于传统上具有挑战性的场景，且无需数据先验。此外，该框架结合深度和偏振信息，为材料识别提供了新的机遇，从而扩展了机器智能在目标识别和防伪等应用中的潜力。

链接: https://arxiv.org/abs/2502.07403
作者: Zicheng Shen,Feng Zhao,Yibo Ni,Yuanmu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:3D vision is of paramount importance for numerous applications ranging from machine intelligence to precision metrology. Despite much recent progress, the majority of 3D imaging hardware remains bulky and complicated and provides much lower image resolution compared to their 2D counterparts. Moreover, there are many well-known scenarios that existing 3D imaging solutions frequently fail. Here, we introduce an extended monocular 3D imaging (EM3D) framework that fully exploits the vectorial wave nature of light. Via the multi-stage fusion of diffraction- and polarization-based depth cues, using a compact monocular camera equipped with a diffractive-refractive hybrid lens, we experimentally demonstrate the snapshot acquisition of a million-pixel and accurate 3D point cloud for extended scenes that are traditionally challenging, including those with low texture, being highly reflective, or nearly transparent, without a data prior. Furthermore, we discover that the combination of depth and polarization information can unlock unique new opportunities in material identification, which may further expand machine intelligence for applications like target recognition and face anti-spoofing. The straightforward yet powerful architecture thus opens up a new path for a higher-dimensional machine vision in a minimal form factor, facilitating the deployment of monocular cameras for applications in much more diverse scenarios.
zh

[CV-43] FADE: Forecasting for Anomaly Detection on ECG

【速读】：该论文旨在解决心血管疾病早期准确检测的问题，以改善患者预后。传统方法依赖于人工解读心电图（ECG）信号，耗时且受限于医疗专业人员的专业水平。论文的关键解决方案是提出了一种名为FADE的深度学习系统，该系统通过自监督方式训练，并采用新颖的形态学启发损失函数，能够预测正常ECG信号的未来状态，从而避免了对大规模标记数据集的需求。此外，FADE通过新的距离函数有效地识别心脏异常，同时具备通过领域适应技术应用于新场景的能力。实验结果表明，FADE在异常检测中的平均准确率为83.84%，在正常ECG信号分类中的准确率为85.46%，优于先前主要识别有限范围异常的方法。

链接: https://arxiv.org/abs/2502.07389
作者: Paula Ruiz-Barroso,Francisco M. Castro,José Miranda,Denisa-Andreea Constantinescu,David Atienza,Nicolás Guil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiovascular diseases, a leading cause of noncommunicable disease-related deaths, require early and accurate detection to improve patient outcomes. Taking advantage of advances in machine learning and deep learning, multiple approaches have been proposed in the literature to address the challenge of detecting ECG anomalies. Typically, these methods are based on the manual interpretation of ECG signals, which is time consuming and depends on the expertise of healthcare professionals. The objective of this work is to propose a deep learning system, FADE, designed for normal ECG forecasting and anomaly detection, which reduces the need for extensive labeled datasets and manual interpretation. FADE has been trained in a self-supervised manner with a novel morphological inspired loss function. Unlike conventional models that learn from labeled anomalous ECG waveforms, our approach predicts the future of normal ECG signals, thus avoiding the need for extensive labeled datasets. Using a novel distance function to compare forecasted ECG signals with actual sensor data, our method effectively identifies cardiac anomalies. Additionally, this approach can be adapted to new contexts through domain adaptation techniques. To evaluate our proposal, we performed a set of experiments using two publicly available datasets: MIT-BIH NSR and MIT-BIH Arrythmia. The results demonstrate that our system achieves an average accuracy of 83.84% in anomaly detection, while correctly classifying normal ECG signals with an accuracy of 85.46%. Our proposed approach exhibited superior performance in the early detection of cardiac anomalies in ECG signals, surpassing previous methods that predominantly identify a limited range of anomalies. FADE effectively detects both abnormal heartbeats and arrhythmias, offering significant advantages in healthcare through cost reduction or processing of large-scale ECG data.
zh

[CV-44] Spatial Degradation-Aware and Temporal Consistent Diffusion Model for Compressed Video Super-Resolution

【速读】：该论文旨在解决低质量视频（Low-Quality Video）在存储和传输过程中由于压缩和低分辨率导致的细节丢失和时间一致性问题。解决方案的关键在于提出了一种空间退化感知和时间一致性（Spatial Degradation-Aware and Temporal Consistent, SDATC）扩散模型。该模型通过引入失真控制模块（Distortion Control Module, DCM）来调节扩散模型的输入，并指导生成过程。此外，通过精细调整的空间提示式压缩感知模块（Prompt-Based Compression-Aware Module, PCAM）和时空注意力模块（Spatio-Temporal Attention Module, STAM），执行去噪过程以生成纹理，并提取动态的特定压缩信息，捕捉时间相关性。实验结果验证了所提模块在提升压缩视频质量方面的有效性。

链接: https://arxiv.org/abs/2502.07381
作者: Hongyu An,Xinfeng Zhang,Shijie Zhao,Li Zhang
机构: School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院); Bytedance Inc.(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to limitations of storage and bandwidth, videos stored and transmitted on the Internet are usually low-quality with low-resolution and compression noise. Although video super-resolution (VSR) is an efficient technique to enhance video resolution, relatively VSR methods focus on compressed videos. Directly applying general VSR approaches leads to the failure of improving practical videos, especially when frames are highly compressed at a low bit rate. Recently, diffusion models have achieved superior performance in low-level visual tasks, and their high-realism generation capability enables them to be applied in VSR. To synthesize more compression-lost details and refine temporal consistency, we propose a novel Spatial Degradation-Aware and Temporal Consistent (SDATC) diffusion model for compressed VSR. Specifically, we introduce a distortion Control module (DCM) to modulate diffusion model inputs and guide the generation. Next, the diffusion model executes the denoising process for texture generation with fine-tuned spatial prompt-based compression-aware module (PCAM) and spatio-temporal attention module (STAM). PCAM extracts features to encode specific compression information dynamically. STAM extends the spatial attention mechanism to a spatio-temporal dimension for capturing temporal correlation. Extensive experimental results on benchmark datasets demonstrate the effectiveness of the proposed modules in enhancing compressed videos.
zh

[CV-45] USRNet: Unified Scene Recovery Network for Enhancing Traffic Imaging under Multiple Adverse Weather Conditions

【速读】：该论文旨在解决因恶劣天气条件（如雾霾、雨、雪等）导致的图像退化问题，这些问题严重影响了智能交通系统和视觉监控系统的准确性和可靠性。论文的关键解决方案是引入统一场景恢复网络（USRNet），它能够处理多种类型的图像退化。USRNet的核心在于其复杂的架构，包括场景编码器、注意力驱动的节点独立学习机制（NILM）、边缘解码器和场景恢复模块。特别是NILM使得USRNet能够精确地学习和响应不同的场景，从而增强其在不同天气条件下的适应性和鲁棒性。

链接: https://arxiv.org/abs/2502.07372
作者: Yuxu Lu,Ai Chen,Dong Yang,Ryan Wen Liu
机构: PolyU(香港理工大学); UESTC(电子科技大学); WHUT(武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancements in computer vision technology have facilitated the extensive deployment of intelligent transportation systems and visual surveillance systems across various applications, including autonomous driving, public safety, and environmental monitoring. However, adverse weather conditions such as haze, rain, snow, and more complex mixed degradation can significantly degrade image quality. The degradation compromises the accuracy and reliability of these systems across various scenarios. To tackle the challenge of developing adaptable models for scene restoration, we introduce the unified scene recovery network (USRNet), capable of handling multiple types of image degradation. The USRNet features a sophisticated architecture consisting of a scene encoder, an attention-driven node independent learning mechanism (NILM), an edge decoder, and a scene restoration module. The scene encoder, powered by advanced residual blocks, extracts deep features from degraded images in a progressive manner, ensuring thorough encoding of degradation information. To enhance the USRNet’s adaptability in diverse weather conditions, we introduce NILM, which enables the network to learn and respond to different scenarios with precision, thereby increasing its robustness. The edge decoder is designed to extract edge features with precision, which is essential for maintaining image sharpness. Experimental results demonstrate that USRNet surpasses existing methods in handling complex imaging degradations, thereby improving the accuracy and reliability of visual systems across diverse scenarios. The code resources for this work can be accessed in this https URL.
zh

[CV-46] Multi-Task-oriented Nighttime Haze Imaging Enhancer for Vision-driven Measurement Systems

【速读】：该论文旨在解决因恶劣成像条件（如白天雾霾、低光照和夜间雾霾）导致的显著目标检测（SOD）性能下降的问题。关键解决方案在于提出了一种多任务导向夜间雾霾成像增强器（MToIE），它集成了白天去雾、低光增强和夜间去雾三个任务，并采用了任务导向节点学习机制和多感受野增强模块。任务导向节点学习机制能够处理三种特定的退化类型，而嵌入的自注意力模块则提升了夜间成像的效果。此外，多感受野增强模块通过三个具有不同扩张率的深度可分离卷积分支高效提取多尺度特征，从而以最小的计算开销捕捉全面的空间信息。

链接: https://arxiv.org/abs/2502.07351
作者: Ai Chen,Yuxu Lu,Dong Yang,Junlin Zhou,Yan Fu,Duanbing Chen
机构: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Union Big Data Technology Incorporation, China(电子科技大学计算机科学与工程学院，成都联合大数据技术有限公司，中国); Department of Logistics and Maritime Studies, the Hong Kong Polytechnic University, Hong Kong(香港理工大学物流与海运系，香港)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Salient object detection (SOD) plays a critical role in vision-driven measurement systems (VMS), facilitating the detection and segmentation of key visual elements in an image. However, adverse imaging conditions such as haze during the day, low light, and haze at night severely degrade image quality, and complicating the SOD process. To address these challenges, we propose a multi-task-oriented nighttime haze imaging enhancer (MToIE), which integrates three tasks: daytime dehazing, low-light enhancement, and nighttime dehazing. The MToIE incorporates two key innovative components: First, the network employs a task-oriented node learning mechanism to handle three specific degradation types: day-time haze, low light, and night-time haze conditions, with an embedded self-attention module enhancing its performance in nighttime imaging. In addition, multi-receptive field enhancement module that efficiently extracts multi-scale features through three parallel depthwise separable convolution branches with different dilation rates, capturing comprehensive spatial information with minimal computational overhead. To ensure optimal image reconstruction quality and visual characteristics, we suggest a hybrid loss function. Extensive experiments on different types of weather/imaging conditions illustrate that MToIE surpasses existing methods, significantly enhancing the accuracy and reliability of vision systems across diverse imaging scenarios. The code is available at this https URL.
zh

[CV-47] ERANet: Edge Replacement Augmentation for Semi-Supervised Meniscus Segmentation with Prototype Consistency Alignment and Conditional Self-Training

【速读】：该论文旨在解决手动分割劳动强度大以及由于半月板形态变化、部分容积效应及半月板与周围组织之间对比度低导致的自动分割难题。解决方案的关键在于ERANet框架，它通过边缘替换增强（ERA）、原型一致性对齐（PCA）和条件自训练（CST）策略，有效利用标注和未标注图像，实现了鲁棒且可扩展的半月板分割。ERA引入解剖相关的扰动以模拟半月板变异，PCA通过对齐类内特征来提升分割性能，而CST则通过迭代优化伪标签来增强模型的鲁棒性，从而缓解标签噪声的影响。这些创新共同确立了ERANet作为半监督半月板分割的有效方案。

链接: https://arxiv.org/abs/2502.07331
作者: Siyue Li,Yongcheng Yao,Junru Zhong,Shutian Zhao,Yudong Zhang,Shuihua Wang,Jin Hong,Weitian Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manual segmentation is labor-intensive, and automatic segmentation remains challenging due to the inherent variability in meniscal morphology, partial volume effects, and low contrast between the meniscus and surrounding tissues. To address these challenges, we propose ERANet, an innovative semi-supervised framework for meniscus segmentation that effectively leverages both labeled and unlabeled images through advanced augmentation and learning strategies. ERANet integrates three key components: edge replacement augmentation (ERA), prototype consistency alignment (PCA), and a conditional self-training (CST) strategy within a mean teacher architecture. ERA introduces anatomically relevant perturbations by simulating meniscal variations, ensuring that augmentations align with the structural context. PCA enhances segmentation performance by aligning intra-class features and promoting compact, discriminative feature representations, particularly in scenarios with limited labeled data. CST improves segmentation robustness by iteratively refining pseudo-labels and mitigating the impact of label noise during training. Together, these innovations establish ERANet as a robust and scalable solution for meniscus segmentation, effectively addressing key barriers to practical implementation. We validated ERANet comprehensively on 3D Double Echo Steady State (DESS) and 3D Fast/Turbo Spin Echo (FSE/TSE) MRI sequences. The results demonstrate the superior performance of ERANet compared to state-of-the-art methods. The proposed framework achieves reliable and accurate segmentation of meniscus structures, even when trained on minimal labeled data. Extensive ablation studies further highlight the synergistic contributions of ERA, PCA, and CST, solidifying ERANet as a transformative solution for semi-supervised meniscus segmentation in medical imaging.
zh

[CV-48] Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos

【速读】：该论文旨在探讨在视频检索任务中是否存在对AI生成视频（AI-generated videos）的偏见，并研究这种偏见的根源。论文的关键在于构建了一个包含真实视频和AI生成视频的综合基准数据集，并设计了一系列严格的评估指标来量化这种偏见。通过应用三种现成的视频检索模型在混合数据集上进行检索任务，发现确实存在对AI生成视频的偏好。进一步研究表明，将AI生成视频纳入检索模型的训练集会加剧这种偏见。论文指出视频检索中的偏见源自未见过的视觉和时间信息的复杂交互作用。为缓解这一偏见，论文采用对比学习方法微调检索模型。

链接: https://arxiv.org/abs/2502.07327
作者: Haowen Gao,Liang Pang,Shicheng Xu,Leigang Qu,Tat-Seng Chua,Huawei Shen,Xueqi Cheng
机构: Institute for Clarity in Documentation(文档清晰研究所); The Thørväld Group(索瓦尔德集团)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of AI-generated content (AIGC), the creation of high-quality AI-generated videos has become faster and easier, resulting in the Internet being flooded with all kinds of video content. However, the impact of these videos on the content ecosystem remains largely unexplored. Video information retrieval remains a fundamental approach for accessing video content. Building on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks, we investigate whether similar biases emerge in the context of challenging video retrieval, where temporal and visual factors may further influence model behavior. To explore this, we first construct a comprehensive benchmark dataset containing both real and AI-generated videos, along with a set of fair and rigorous metrics to assess bias. This benchmark consists of 13,000 videos generated by two state-of-the-art open-source video generation models. We meticulously design a suite of rigorous metrics to accurately measure this preference, accounting for potential biases arising from the limited frame rate and suboptimal quality of AIGC videos. We then applied three off-the-shelf video retrieval models to perform retrieval tasks on this hybrid dataset. Our findings reveal a clear preference for AI-generated videos in retrieval. Further investigation shows that incorporating AI-generated videos into the training set of retrieval models exacerbates this bias. Unlike the preference observed in image modalities, we find that video retrieval bias arises from both unseen visual and temporal information, making the root causes of video bias a complex interplay of these two factors. To mitigate this bias, we fine-tune the retrieval models using a contrastive learning approach. The results of this study highlight the potential implications of AI-generated videos on retrieval systems.
zh

[CV-49] Semantic to Structure: Learning Structural Representations for Infringement Detection

【速读】：该论文旨在解决AI生成内容中“结构侵权”(Structural Infringement)的检测问题。由于现有方法的不足，论文提出了一种基于扩散模型(Diffusion Models)和大语言模型(Large Language Models, LLM)的新数据合成策略，成功训练了一个结构侵权检测模型。该方案的关键在于开发了定量评估指标，并创建了人工标注的数据集（SIA和SIR）用于评估，从而有效提升了检测性能。实验结果表明，所提出的方法能够成功检测结构侵权，并在标注测试集中取得了显著改进。

链接: https://arxiv.org/abs/2502.07323
作者: Chuanwei Huang,Zexi Jia,Hongyan Fei,Yeshuang Zhu,Zhiqiang Yuan,Jinchao Zhang,Jie Zhou
机构: Peking University (北京大学), Beijing, China; Wechat AI (微信AI), Tencent (腾讯), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structural information in images is crucial for aesthetic assessment, and it is widely recognized in the artistic field that imitating the structure of other works significantly infringes on creators’ rights. The advancement of diffusion models has led to AI-generated content imitating artists’ structural creations, yet effective detection methods are still lacking. In this paper, we define this phenomenon as “structural infringement” and propose a corresponding detection method. Additionally, we develop quantitative metrics and create manually annotated datasets for evaluation: the SIA dataset of synthesized data, and the SIR dataset of real data. Due to the current lack of datasets for structural infringement detection, we propose a new data synthesis strategy based on diffusion models and LLM, successfully training a structural infringement detection model. Experimental results show that our method can successfully detect structural infringements and achieve notable improvements on annotated test sets.
zh

[CV-50] Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving ICLR2025

【速读】：该论文旨在解决自动驾驶领域中理解世界动态的需求，特别是通过学习3D占用世界模型来预测未来的周围场景。由于生成高质量的3D占用标签成本较高，论文提出了一种半监督的视觉中心3D占用世界模型PreWorld。该方法的关键在于引入了一种新颖的两阶段训练范式：自监督预训练阶段和全监督微调阶段。在预训练阶段，通过属性投影头生成场景的不同属性字段（如RGB、密度、语义），并通过体渲染技术利用2D标签进行时间上的监督。此外，论文还引入了一个简单的状态条件预测模块，以直接递归的方式预测未来占用情况和自车轨迹。

链接: https://arxiv.org/abs/2502.07309
作者: Xiang Li,Pengfei Li,Yupeng Zheng,Wei Sun,Yan Wang,Yilun Chen
机构: Institute for AI Industry Research (AIR)(人工智能产业研究院), Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.
zh

[CV-51] CASC-AI: Consensus-aware Self-corrective AI Agents for Noise Cell Segmentation

【速读】：该论文旨在解决多类别细胞分割在高分辨率全片扫描图像（Whole Slide Images, WSI）中的标注难题。传统方法依赖于领域专家进行像素级标注，耗时且劳动密集。尽管近期研究通过引入非医学背景的普通标注员使这一过程民主化，但这些方法在处理标注噪声方面仍显不足，尤其缺乏有效机制来减少假阳性（FP）和假阴性（FN）错误。为应对这一挑战，本文提出了一种共识感知自纠正AI代理，其核心在于利用共识矩阵引导学习过程。共识矩阵优先关注AI与标注员在细胞及非细胞标注上达成一致的区域，同时通过对比学习策略增强对不一致区域的关注，并分离出噪声区域与可靠区域的特征，从而实现标签的迭代精进。此方法在真实世界和模拟数据集上的实验结果表明，其能够显著提升分割性能，有效修正FP和FN错误，展示出在处理噪声标注数据集时训练稳健模型的巨大潜力。

链接: https://arxiv.org/abs/2502.07302
作者: Ruining Deng,Yihe Yang,David J. Pisapia,Benjamin Liechty,Junchao Zhu,Juming Xiong,Junlin Guo,Zhengyi Lu,Jiacheng Wang,Xing Yao,Runxuan Yu,Rendong Zhang,Gaurav Rudravaram,Mengmeng Yin,Pinaki Sarder,Haichun Yang,Yuankai Huo,Mert R. Sabuncu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-class cell segmentation in high-resolution gigapixel whole slide images (WSI) is crucial for various clinical applications. However, training such models typically requires labor-intensive, pixel-wise annotations by domain experts. Recent efforts have democratized this process by involving lay annotators without medical expertise. However, conventional non-agent-based approaches struggle to handle annotation noise adaptively, as they lack mechanisms to mitigate false positives (FP) and false negatives (FN) at both the image-feature and pixel levels. In this paper, we propose a consensus-aware self-corrective AI agent that leverages the Consensus Matrix to guide its learning process. The Consensus Matrix defines regions where both the AI and annotators agree on cell and non-cell annotations, which are prioritized with stronger supervision. Conversely, areas of disagreement are adaptively weighted based on their feature similarity to high-confidence agreement regions, with more similar regions receiving greater attention. Additionally, contrastive learning is employed to separate features of noisy regions from those of reliable agreement regions by maximizing their dissimilarity. This paradigm enables the AI to iteratively refine noisy labels, enhancing its robustness. Validated on one real-world lay-annotated cell dataset and two simulated noisy datasets, our method demonstrates improved segmentation performance, effectively correcting FP and FN errors and showcasing its potential for training robust models on noisy datasets. The official implementation and cell annotations are publicly available at this https URL.
zh

[CV-52] Learning Inverse Laplacian Pyramid for Progressive Depth Completion

【速读】：该论文旨在解决深度完成任务中通过稀疏深度测量重建密集深度图的问题。现有方法主要依赖于单尺度传播策略，通过像素级消息传递逐步改进初始粗略深度估计，但这些方法常受限于计算效率低下及场景上下文理解有限。论文的关键解决方案是引入LP-Net框架，采用多尺度渐进预测范式，基于拉普拉斯金字塔分解实现从低分辨率全局场景上下文预测到高分辨率细节恢复的过程。该框架通过两个创新模块——多路径特征金字塔模块和选择性深度滤波模块——来强化这一策略，从而不仅在KITTI、NYUv2和TOFDC等基准测试中达到当前最优性能，还展示了卓越的计算效率。

链接: https://arxiv.org/abs/2502.07289
作者: Kun Wang,Zhiqiang Yan,Junkai Fan,Jun Li,Jian Yang
机构: PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院; 高维信息智能感知系统教育部重点实验室; 江苏省图像视频理解与社会安全重点实验室; 南京理工大学PCA实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth completion endeavors to reconstruct a dense depth map from sparse depth measurements, leveraging the information provided by a corresponding color image. Existing approaches mostly hinge on single-scale propagation strategies that iteratively ameliorate initial coarse depth estimates through pixel-level message passing. Despite their commendable outcomes, these techniques are frequently hampered by computational inefficiencies and a limited grasp of scene context. To circumvent these challenges, we introduce LP-Net, an innovative framework that implements a multi-scale, progressive prediction paradigm based on Laplacian Pyramid decomposition. Diverging from propagation-based approaches, LP-Net initiates with a rudimentary, low-resolution depth prediction to encapsulate the global scene context, subsequently refining this through successive upsampling and the reinstatement of high-frequency details at incremental scales. We have developed two novel modules to bolster this strategy: 1) the Multi-path Feature Pyramid module, which segregates feature maps into discrete pathways, employing multi-scale transformations to amalgamate comprehensive spatial information, and 2) the Selective Depth Filtering module, which dynamically learns to apply both smoothness and sharpness filters to judiciously mitigate noise while accentuating intricate details. By integrating these advancements, LP-Net not only secures state-of-the-art (SOTA) performance across both outdoor and indoor benchmarks such as KITTI, NYUv2, and TOFDC, but also demonstrates superior computational efficiency. At the time of submission, LP-Net ranks 1st among all peer-reviewed methods on the official KITTI leaderboard.
zh

[CV-53] KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level

【速读】：该论文旨在解决慢性肾脏病（CKD）病理分割缺乏全面基准的问题，阻碍了该领域的进展。关键解决方案是组织了肾病理图像分割（KPIs）挑战赛，并引入了一个包含超过10,000个注释肾小球的数据集，涵盖60多张经Periodic Acid Schiff（PAS）染色的全切片图像。该挑战包括斑块级分割和全切片图像分割与检测两项任务，评估指标采用Dice相似性系数（DSC）和F1分数。通过鼓励适应不同CKD模型和组织条件的创新分割方法，KPIs挑战旨在推进肾病理分析，建立新的基准，并实现疾病研究和诊断的精确大规模量化。

链接: https://arxiv.org/abs/2502.07288
作者: Ruining Deng,Tianyuan Yao,Yucheng Tang,Junlin Guo,Siqi Lu,Juming Xiong,Lining Yu,Quan Huu Cap,Pengzhou Cai,Libin Lan,Ze Zhao,Adrian Galdran,Amit Kumar,Gunjan Deotale,Dev Kumar Das,Inyoung Paik,Joonho Lee,Geongyu Lee,Yujia Chen,Wangkai Li,Zhaoyang Li,Xuege Hou,Zeyuan Wu,Shengjin Wang,Maximilian Fischer,Lars Kramer,Anghong Du,Le Zhang,Maria Sanchez Sanchez,Helena Sanchez Ulloa,David Ribalta Heredia,Carlos Perez de Arenaza Garcia,Shuoyu Xu,Bingdou He,Xinping Cheng,Tao Wang,Noemie Moreau,Katarzyna Bozek,Shubham Innani,Ujjwal Baid,Kaura Solomon Kefas,Bennett A. Landman,Yu Wang,Shilin Zhao,Mengmeng Yin,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学), Nashville, TN 37215, USA; Weill Cornell Medicine (威尔康奈尔医学), New York, NY 10021, USA; NVIDIA Corporation (英伟达公司), Redmond, WA 98052, USA; Vanderbilt University Medical Center (范德比尔特大学医学中心), Nashville, TN 37232, USA; Aillis, Inc. (埃利里斯公司), Tokyo 1010042, Japan; Chongqing University of Technology (重庆理工大学), Chongqing 400054, China; Chongqing Zhijian Life Technology Co. LTD (重庆至简生命科技有限公司), Chongqing 400039, China; Jinfeng Laboratory (金凤实验室), Chongqing 401329, China; Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所), Beijing 100190, China; Universitat Pompeu Fabra (庞培法布拉大学), Ciutat Vella, Barcelona 08002, Spain; Aira Matrix Private Limited (艾拉矩阵私人有限公司), Thane, Maharashtra 400604, India; Deep Bio Inc. (深度生物公司), Research Team, Seoul, KR 08380, Republic of Korea; University of Science and Technology of China (中国科学技术大学), Hefei, 230026, China; Tsinghua University & Beijing National Research Center for Information Science and Technology (清华大学&北京信息科学与技术国家研究中心), Beijing, 100084, China; German Cancer Research Center (德国癌症研究中心), Heidelberg 69120, Germany; University of Birmingham (伯明翰大学), Birmingham, B15 2TT, UK; Bio-totem Pte Ltd (生物图腾有限公司), Suzhou 215000, China; Nanjing University of Science and Technology (南京理工大学), Nanjing, Jiangsu 210094, China; University of Cologne (科隆大学), Cologne 50931, Germany; Indiana University (印第安纳大学), Indianapolis, IN 46202, USA; Xi’an Jiaotong University (西安交通大学), Xi’an, Shaanxi 710049, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chronic kidney disease (CKD) is a major global health issue, affecting over 10% of the population and causing significant mortality. While kidney biopsy remains the gold standard for CKD diagnosis and treatment, the lack of comprehensive benchmarks for kidney pathology segmentation hinders progress in the field. To address this, we organized the Kidney Pathology Image Segmentation (KPIs) Challenge, introducing a dataset that incorporates preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes two tasks, patch-level segmentation and whole slide image segmentation and detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. By encouraging innovative segmentation methods that adapt to diverse CKD models and tissue conditions, the KPIs Challenge aims to advance kidney pathology analysis, establish new benchmarks, and enable precise, large-scale quantification for disease research and diagnosis.
zh

[CV-54] Articulate That Object Part (ATOP): 3D Part Articulation from Text and Motion Personalization

【速读】：该论文旨在解决通过文本提示控制3D物体特定部分运动的问题。关键在于首先通过少量样本微调预训练的多视角图像生成模型，以实现类别特定的运动生成；随后利用多视角渲染图像进行视频个性化处理，并通过可微渲染优化部分运动参数，以转移个性化的视频运动至目标3D物体。这种方法能够更准确且泛化地生成逼真的运动视频及预测3D运动参数。

链接: https://arxiv.org/abs/2502.07278
作者: Aditya Vora,Sauradip Nag,Hao Zhang
机构: Simon Fraser University(西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report, 16 pages

点击查看摘要

Abstract:We present ATOP (Articulate That Object Part), a novel method based on motion personalization to articulate a 3D object with respect to a part and its motion as prescribed in a text prompt. Specifically, the text input allows us to tap into the power of modern-day video diffusion to generate plausible motion samples for the right object category and part. In turn, the input 3D object provides image prompting to personalize the generated video to that very object we wish to articulate. Our method starts with a few-shot finetuning for category-specific motion generation, a key first step to compensate for the lack of articulation awareness by current video diffusion models. For this, we finetune a pre-trained multi-view image generation model for controllable multi-view video generation, using a small collection of video samples obtained for the target object category. This is followed by motion video personalization that is realized by multi-view rendered images of the target 3D object. At last, we transfer the personalized video motion to the target 3D object via differentiable rendering to optimize part motion parameters by a score distillation sampling loss. We show that our method is capable of generating realistic motion videos and predict 3D motion parameters in a more accurate and generalizable way, compared to prior works.
zh

[CV-55] Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis

【速读】：该论文旨在探索视频中的时空特征及深度神经网络在视频理解方面的最新进展。论文将回顾视频理解模型的主要趋势及其结构设计，并探讨该领域面临的主要问题及提出的一些解决方案。此外，还将回顾和比较重要的视频理解和动作识别数据集。解决方案的关键在于通过深度神经网络有效提取和分类视频中的时空特征，从而实现对视频内容的准确描述与理解。

链接: https://arxiv.org/abs/2502.07277
作者: Amir Hosein Fadaei,Mohammad-Reza A. Dehaqani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 25 figures

点击查看摘要

Abstract:It’s no secret that video has become the primary way we share information online. That’s why there’s been a surge in demand for algorithms that can analyze and understand video content. It’s a trend going to continue as video continues to dominate the digital landscape. These algorithms will extract and classify related features from the video and will use them to describe the events and objects in the video. Deep neural networks have displayed encouraging outcomes in the realm of feature extraction and video description. This paper will explore the spatiotemporal features found in videos and recent advancements in deep neural networks in video understanding. We will review some of the main trends in video understanding models and their structural design, the main problems, and some offered solutions in this topic. We will also review and compare significant video understanding and action recognition datasets.
zh

[CV-56] Dataset Ownership Verification in Contrastive Pre-trained Models ICLR2025

【速读】：该论文旨在解决数据集所有权验证的问题，特别是在自监督预训练模型中的应用。现有方法主要局限于有监督模型，无法直接应用于日益流行的无监督预训练模型。为了解决这一问题，论文提出了一种新的数据集所有权验证方法，专门针对自监督预训练模型，通过对比学习实现。其关键是利用实证观察到的现象：当模型使用目标数据集进行训练时，嵌入空间内的单样本和双样本实例关系会表现出显著差异，从而可以有效地区分是否使用过特定的数据集。实验结果表明，该方法在SimCLR、BYOL、SimSiam、MOCO v3和DINO等多种对比预训练模型中均表现出色，显著优于现有方法。

链接: https://arxiv.org/abs/2502.07276
作者: Yuechen Xie,Jie Song,Mengqi Xue,Haofei Zhang,Xingen Wang,Bingde Hu,Genlang Chen,Mingli Song
机构: Zhejiang University(浙江大学); Hangzhou City University(杭州城市大学); NingboTech University(宁波技术大学); Bangsheng Technology Co., Ltd.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR2025

点击查看摘要

Abstract:High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a p -value markedly below 0.05 , surpassing all previous methodologies. Our code is available at this https URL.
zh

[CV-57] Exploring Active Data Selection Strategies for Continuous Training in Deepfake Detection

【速读】：该论文旨在解决在深度伪造检测中，随着新型伪造方法不断涌现，如何动态调整检测模型参数以维持高性能的问题。关键解决方案在于提出一种方法，能够自动且主动地从包含大量由新型深度伪造方法生成的图像和真实图像的冗余池集中选择少量新增训练数据，通过使用深度伪造检测模型的置信度分数作为度量标准。实验结果表明，采用这种方法进行持续训练的深度伪造检测模型显著且高效地提升了检测性能，在仅使用池集中15%的数据量的情况下达到了2.5%的等错误率（EER）。

链接: https://arxiv.org/abs/2502.07269
作者: Yoshihiko Furuhashi,Junichi Yamagishi,Xin Wang,Huy H. Nguyen,Isao Echizen
机构: National Institute of Informatics (国立情报学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In deepfake detection, it is essential to maintain high performance by adjusting the parameters of the detector as new deepfake methods emerge. In this paper, we propose a method to automatically and actively select the small amount of additional data required for the continuous training of deepfake detection models in situations where deepfake detection models are regularly updated. The proposed method automatically selects new training data from a \textitredundant pool set containing a large number of images generated by new deepfake methods and real images, using the confidence score of the deepfake detection model as a metric. Experimental results show that the deepfake detection model, continuously trained with a small amount of additional data automatically selected and added to the original training set, significantly and efficiently improved the detection performance, achieving an EER of 2.5% with only 15% of the amount of data in the pool set.
zh

[CV-58] Robust Indoor Localization in Dynamic Environments: A Multi-source Unsupervised Domain Adaptation Framework

【速读】：该论文旨在解决指纹定位在动态室内环境中的鲁棒性和适应性问题。传统方法在静态数据中表现良好，但在数据分布和特征空间随时间演变的动态环境中往往难以维持性能。为了解决这一挑战，论文提出了一种端到端的动态指纹定位系统DF-Loc，基于多源无监督领域自适应（Multi-Source Unsupervised Domain Adaptation, MUSA）。DF-Loc的关键在于利用多个时间尺度的历史数据进行知识转移，并通过引入质量控制模块和图像处理技术来增强CSI数据预处理和指纹特征重构能力。此外，设计了一个多尺度注意力机制特征融合主干网络以提取可迁移的指纹特征，并采用双阶段对齐模型优化目标域的分布对齐。这些措施显著提升了系统的定位精度和鲁棒性。

链接: https://arxiv.org/abs/2502.07246
作者: Jiyu Jiao,Xiaojun Wang,Chengpei Han
机构: National Mobile Communications Research Laboratory, School of Information science and Engineering, Southeast University(东南大学); Purple Mountain Laboratories(紫金山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Popular Physics (physics.pop-ph)
备注: 19 pages, 21 figures

点击查看摘要

Abstract:Fingerprint localization has gained significant attention due to its cost-effective deployment, low complexity, and high efficacy. However, traditional methods, while effective for static data, often struggle in dynamic environments where data distributions and feature spaces evolve-a common occurrence in real-world scenarios. To address the challenges of robustness and adaptability in fingerprint localization for dynamic indoor environments, this paper proposes DF-Loc, an end-to-end dynamic fingerprint localization system based on multi-source unsupervised domain adaptation (MUDA). DF-Loc leverages historical data from multiple time scales to facilitate knowledge transfer in specific feature spaces, thereby enhancing generalization capabilities in the target domain and reducing reliance on labeled data. Specifically, the system incorporates a Quality Control (QC) module for CSI data preprocessing and employs image processing techniques for CSI fingerprint feature reconstruction. Additionally, a multi-scale attention-based feature fusion backbone network is designed to extract multi-level transferable fingerprint features. Finally, a dual-stage alignment model aligns the distributions of multiple source-target domain pairs, improving regression characteristics in the target domain. Extensive experiments conducted in office and classroom environments demonstrate that DF-Loc outperforms comparative methods in terms of both localization accuracy and robustness. With 60% of reference points used for training, DF-Loc achieves average localization errors of 0.79m and 3.72m in “same-test” scenarios, and 0.94m and 4.39m in “different-test” scenarios, respectively. This work pioneers an end-to-end multi-source transfer learning approach for fingerprint localization, providing valuable insights for future research in dynamic environments.
zh

[CV-59] Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation

【速读】：该论文旨在解决共发言手势（Co-speech gesture）生成过程中难以准确识别音频中的节奏或语义触发点，以生成上下文相关的手势模式及实现像素级真实感的问题。论文的关键解决方案在于引入Contextual Gesture框架，通过三个创新组件：(1) 时间对齐的语音-手势同步，(2) 通过蒸馏将语音上下文融入运动模式表示的上下文化手势标记化，以及(3) 利用边缘连接结构感知细化模块来链接手势关键点，从而改善共发言手势视频生成。

链接: https://arxiv.org/abs/2502.07239
作者: Pinxin Liu,Pengfei Zhang,Hyeongwoo Kim,Pablo Garrido,Ari Sharpio,Kyle Olszewski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech. Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. Our extensive experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications, shown in Fig.1 Project Page: this https URL.
zh

[CV-60] Diffusion Suction Grasping with Large-Scale Parcel Dataset

【速读】：该论文旨在解决在杂乱和复杂包裹处理场景中，吸盘抓取技术面临的两个主要挑战：缺乏专门针对包裹操作任务的全面吸盘抓取数据集，以及对物体尺寸变化、几何复杂性和纹理多样性适应性不足。为了解决这些问题，论文提出了关键解决方案包括创建Parcel-Suction-Dataset数据集，并引入Diffusion-Suction框架。Parcel-Suction-Dataset是一个包含25000个杂乱场景和4.1亿个精确标注吸盘抓取位置的大规模合成数据集。Diffusion-Suction框架通过去噪扩散概率模型将吸盘抓取预测重新表述为条件生成任务，从而迭代地从点云观测中生成视觉引导的吸盘抓取评分图，有效学习空间逐点可用性。

链接: https://arxiv.org/abs/2502.07238
作者: Ding-Tao Huang,Xinyi He,Debei Hua,Dongfang Yu,En-Te Lin,Long Zeng
机构: Tsinghua University(清华大学); Shenzhen International Graduate School(深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent advances in object suction grasping have shown remarkable progress, significant challenges persist particularly in cluttered and complex parcel handling scenarios. Two fundamental limitations hinder current approaches: (1) the lack of a comprehensive suction grasp dataset tailored for parcel manipulation tasks, and (2) insufficient adaptability to diverse object characteristics including size variations, geometric complexity, and textural diversity. To address these challenges, we present Parcel-Suction-Dataset, a large-scale synthetic dataset containing 25 thousand cluttered scenes with 410 million precision-annotated suction grasp poses. This dataset is generated through our novel geometric sampling algorithm that enables efficient generation of optimal suction grasps incorporating both physical constraints and material properties. We further propose Diffusion-Suction, an innovative framework that reformulates suction grasp prediction as a conditional generation task through denoising diffusion probabilistic models. Our method iteratively refines random noise into suction grasp score maps through visual-conditioned guidance from point cloud observations, effectively learning spatial point-wise affordances from our synthetic dataset. Extensive experiments demonstrate that the simple yet efficient Diffusion-Suction achieves new state-of-the-art performance compared to previous models on both Parcel-Suction-Dataset and the public SuctionNet-1Billion benchmark.
zh

[CV-61] CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protective Perturbations in Latent Diffusion Models

【速读】：该论文旨在解决通过未经授权的数据定制潜扩散模型（Latent Diffusion Models）可能严重侵犯数据所有者的隐私和知识产权的问题。论文的关键解决方案是提出对比对抗训练（Contrastive Adversarial Training, CAT），利用适配器（adapters）作为自适应攻击手段，揭示现有保护方法的脆弱性，并显著降低保护扰动在定制配置中的有效性。

链接: https://arxiv.org/abs/2502.07225
作者: Sen Peng,Mingyue Wang,Jianfei He,Jijia Yang,Xiaohua Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners. Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them. In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments. We then propose the Contrastive Adversarial Training (CAT) utilizing adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization configurations, urging the community to reconsider and enhance the robustness of existing protective perturbation methods. Code is available at \hyperlinkherethis https URL.
zh

[CV-62] MLLM 4PUE: Toward Universal Embeddings in Computational Pathology through Multimodal LLM s

【速读】：该论文旨在解决病理学诊断中多模态数据处理的多样性和复杂性问题，以及现有方法因依赖大量标注数据而面临的可持续性挑战。解决方案的关键在于提出了一种名为MLLM4PUE的新框架，该框架利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）生成病理学通用嵌入（Pathology Universal Embeddings），从而实现图像和文本的稳健集成，并增强跨多种任务的理解与融合能力。此外，论文还引入了病理学多模态嵌入基准（Pathology Multimodal Embedding Benchmark, PMEB），以全面评估病理学多模态嵌入的质量。

链接: https://arxiv.org/abs/2502.07221
作者: Qifeng Zhou,Thao M. Dang,Wenliang Zhong,Yuzhi Guo,Hehuan Ma,Saiyang Na,Junzhou Huang
机构: The University of Texas at Arlington
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology plays a critical role in diagnosing a wide range of diseases, yet existing approaches often rely heavily on task-specific models trained on extensive, well-labeled datasets. These methods face sustainability challenges due to the diversity of pathologies and the labor-intensive nature of data collection. To address these limitations, we highlight the need for universal multimodal embeddings that can support multiple downstream tasks. Previous approaches often involve fine-tuning CLIP-based models, which handle images and text separately, limiting their ability to capture complex multimodal relationships. Additionally, these models are evaluated across diverse datasets without a unified benchmark for assessing multimodal embeddings in pathology. To address these challenges, we propose MLLM4PUE, a novel framework that leverages Multimodal Large Language Models (MLLMs) to generate Pathology Universal Embeddings. The MLLM4PUE framework not only facilitates robust integration of images and text but also enhances understanding and fusion capabilities across various tasks. We further introduce the Pathology Multimodal Embedding Benchmark (PMEB), a comprehensive benchmark designed to assess the quality of pathology multimodal embeddings. PMEB comprises 15 original tasks drawn from 14 datasets, organized into three meta-tasks: retrieval, classification, and composed retrieval. Experimental results demonstrate the superiority of MLLM4PUE, illustrating MLLM-based models can effectively support a wide range of downstream tasks and unify the research direction for foundation models in pathology.
zh

[CV-63] SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer ACM-MM2024

【速读】：该论文旨在解决高分辨率宽视野（HRW）图像中的目标检测挑战，特别是由于极高分辨率和宽视场引起的极端稀疏性和巨大尺度变化导致现有近距离检测器在精度和效率上的不足。论文的关键解决方案是提出了一种名为SparseFormer的新型模型不可知稀疏视觉变换器，它通过选择性地使用注意力标记来仔细检查可能包含物体的稀疏分布窗口，从而实现全局和局部注意力的联合探索。此外，SparseFormer还采用了一种新颖的跨切片非极大值抑制（C-NMS）算法和一种简单有效的多尺度策略，以提高检测精度和速度。

链接: https://arxiv.org/abs/2502.07216
作者: Wenxi Li,Yuchen Guo,Jilai Zheng,Haozhe Lin,Chao Ma,Lu Fang,Xiaokang Yang
机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University(教育部人工智能重点实验室, 上海交通大学人工智能研究院); Beijing National Research Center for Information Science and Technology, Tsinghua University(北京信息科学与技术国家研究中心, 清华大学); Department of Electronic Engineering, BNRist, Tsinghua University(电子工程系, 北京信息国家研究中心, 清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is accepted to ACM MM 2024

点击查看摘要

Abstract:Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
zh

[CV-64] PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval

【速读】：该论文旨在解决零样本组合图像检索（ZS-CIR）中的三个关键挑战：静态查询嵌入表示、图像嵌入利用不足以及文本和图像嵌入融合效果不佳。论文的关键解决方案是引入提示方向向量（Prompt Directional Vector, PDV），这是一种无需训练的增强方法，能够捕捉用户提示引起的语义变化。PDV通过实现动态组合文本嵌入、从文本提示到图像特征的语义转移以及优化文本和图像嵌入的加权融合，显著提升了检索性能。

链接: https://arxiv.org/abs/2502.07215
作者: Osman Tursun,Sinan Kalkan,Simon Denman,Clinton Fookes
机构: Queensland University of Technology (昆士兰科技大学); Middle East Technical University (中东技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot composed image retrieval (ZS-CIR) enables image search using a reference image and text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches face three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be publicly available.
zh

[CV-65] Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

【速读】：该论文旨在解决现有基于扩散模型的语音驱动面部生成方法中存在的问题，如唇形同步不准确、头部姿态不当以及缺乏精细的表情控制。论文提出了一种名为Playmate的两阶段训练框架来生成更为逼真的面部表情和说话视频。关键解决方案在于第一阶段引入解耦的隐式三维表示和精心设计的运动解耦模块，以实现更精确的属性解耦，并直接从音频线索生成具有表现力的视频。第二阶段则通过引入情感控制模块将情感控制信息编码到潜在空间，从而实现对情感和头部姿态的精细控制。

链接: https://arxiv.org/abs/2502.07203
作者: Xingpei Ma,Jiaran Cai,Yuansheng Guan,Shenneng Huang,Qiang Zhang,Shunsi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at this https URL.
zh

[CV-66] Dense Object Detection Based on De-homogenized Queries

【速读】：该论文旨在解决基于贪婪算法（如非极大值抑制，NMS）的密集目标检测方法在密集场景下产生的重复预测和漏检问题。论文的关键解决方案在于引入可学习的差异化编码（Learnable Differentiated Encoding），以去同质化查询（de-homogenize the queries），同时使查询之间能够通过差异化编码信息进行通信，取代先前查询间的自注意力机制。此外，论文还采用了联合损失函数（joint loss）来优化位置和置信度预测，从而提供更高质量的查询初始化。这些改进使得所提出的端到端检测框架更为简洁，并减少了约8%的参数量，同时在CrowdHuman数据集上取得了93.6%的平均精度（AP）、39.2%的MR-2和84.3%的JI，超越了现有最先进方法。

链接: https://arxiv.org/abs/2502.07194
作者: Yueming Huang,Chenrui Ma,Hao Zhou,Hao Wu,Guowu Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 15 figures

点击查看摘要

Abstract:Dense object detection is widely used in automatic driving, video surveillance, and other fields. This paper focuses on the challenging task of dense object detection. Currently, detection methods based on greedy algorithms, such as non-maximum suppression (NMS), often produce many repetitive predictions or missed detections in dense scenarios, which is a common problem faced by NMS-based algorithms. Through the end-to-end DETR (DEtection TRansformer), as a type of detector that can incorporate the post-processing de-duplication capability of NMS, etc., into the network, we found that homogeneous queries in the query-based detector lead to a reduction in the de-duplication capability of the network and the learning efficiency of the encoder, resulting in duplicate prediction and missed detection problems. To solve this problem, we propose learnable differentiated encoding to de-homogenize the queries, and at the same time, queries can communicate with each other via differentiated encoding information, replacing the previous self-attention among the queries. In addition, we used joint loss on the output of the encoder that considered both location and confidence prediction to give a higher-quality initialization for queries. Without cumbersome decoder stacking and guaranteeing accuracy, our proposed end-to-end detection framework was more concise and reduced the number of parameters by about 8% compared to deformable DETR. Our method achieved excellent results on the challenging CrowdHuman dataset with 93.6% average precision (AP), 39.2% MR-2, and 84.3% JI. The performance overperformed previous SOTA methods, such as Iter-E2EDet (Progressive End-to-End Object Detection) and MIP (One proposal, Multiple predictions). In addition, our method is more robust in various scenarios with different densities.
zh

[CV-67] OscNet: Machine Learning on CMOS Oscillator Networks

【速读】：该论文旨在解决机器学习和人工智能领域中计算资源消耗大及高能耗的问题。为应对这一挑战，论文提出了一种基于互补金属氧化物半导体（CMOS）振荡器网络（OscNet）的新型节能机器学习框架。OscNet的关键在于模仿大脑中的脉冲神经元，并利用CMOS振荡器直接进行计算，从而实现高效能的学习过程。通过采用受生物启发的赫布规则（Hebbian rule）更新权重，OscNet不仅在模拟实验中验证了其架构设计的有效性，而且在标准机器学习任务中展示了与传统算法相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2502.07192
作者: Wenxiao Cai,Thomas H. Lee
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning and AI have achieved remarkable advancements but at the cost of significant computational resources and energy consumption. This has created an urgent need for a novel, energy-efficient computational fabric to replace the current computing pipeline. Recently, a promising approach has emerged by mimicking spiking neurons in the brain and leveraging oscillators on CMOS for direct computation. In this context, we propose a new and energy efficient machine learning framework implemented on CMOS Oscillator Networks (OscNet). We model the developmental processes of the prenatal brain’s visual system using OscNet, updating weights based on the biologically inspired Hebbian rule. This same pipeline is then directly applied to standard machine learning tasks. OscNet is a specially designed hardware and is inherently energy-efficient. Its reliance on forward propagation alone for training further enhances its energy efficiency while maintaining biological plausibility. Simulation validates our designs of OscNet architectures. Experimental results demonstrate that Hebbian learning pipeline on OscNet achieves performance comparable to or even surpassing traditional machine learning algorithms, highlighting its potential as a energy efficient and effective computational paradigm.
zh

[CV-68] Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired ICRA2025

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在理解物理环境中的空间关系方面存在的不足，特别是在复杂环境中如街道交叉口的导航。关键解决方案在于引入了空间感知指令调优（Space-Aware Instruction Tuning, SAIT）数据集和空间感知基准（Space-Aware Benchmark, SA-Bench），以及一个自动化数据生成管道，该管道专注于三维空间中的路径及周围环境，从而增强环境理解能力，并使VLMs能够提供更精确的导航指导给视障人士。

链接: https://arxiv.org/abs/2502.07183
作者: ByungOk Han,Woo-han Yun,Beom-Su Seo,Jaehong Kim
机构: ETRI(电子通信研究院), Daejeon, Republic of Korea; The Open AI Dataset Project (AI-Hub, S. Korea)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at this https URL
zh

[CV-69] ab2Visual: Overcoming Limited Data in Tabular Data Classification Using Deep Learning with Visual Representations

【速读】：该论文旨在解决表格数据分类中数据量有限的挑战，尤其是在医疗等受约束领域。解决方案的关键在于提出了一种名为Tab2Visual的新方法，该方法将异构表格数据转换为视觉表示，从而能够应用强大的深度学习模型。Tab2Visual通过引入新颖的图像增强技术和促进迁移学习，有效应对了数据稀缺的问题。

链接: https://arxiv.org/abs/2502.07181
作者: Ahmed Mamdouh,Moumen El-Melegy,Samia Ali,Ron Kikinis
机构: aun.edu.eg(B威利斯顿·斯奈特学院); bwh.harvard.edu(布莱根妇女医院); bwh.harvard.edu(布莱根妇女医院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This research addresses the challenge of limited data in tabular data classification, particularly prevalent in domains with constraints like healthcare. We propose Tab2Visual, a novel approach that transforms heterogeneous tabular data into visual representations, enabling the application of powerful deep learning models. Tab2Visual effectively addresses data scarcity by incorporating novel image augmentation techniques and facilitating transfer learning. We extensively evaluate the proposed approach on diverse tabular datasets, comparing its performance against a wide range of machine learning algorithms, including classical methods, tree-based ensembles, and state-of-the-art deep learning models specifically designed for tabular data. We also perform an in-depth analysis of factors influencing Tab2Visual’s performance. Our experimental results demonstrate that Tab2Visual outperforms other methods in classification problems with limited tabular data.
zh

[CV-70] Improved YOLOv7 model for insulator defect detection

【速读】：该论文旨在解决多类型绝缘子缺陷检测中的低检测精度问题，特别是在复杂背景下不同颜色和材料的绝缘子缺陷共存的情况下。现有方法难以满足实际应用需求，尤其是在平均精度均值（mAP0.5）方面存在不足。论文的关键解决方案在于提出了一种改进的YOLOv7模型，具体包括：用Receptive Field Block (RFB)模块替换空间金字塔池化交叉阶段卷积（SPPCSPC）模块以增强特征提取能力；引入通道注意力机制（CA）来提升特征表示能力，从而提高检测精度；采用加权交并比（WIoU）损失函数处理训练过程中低质量样本对模型泛化的影响，进而提升整体性能。

链接: https://arxiv.org/abs/2502.07179
作者: Zhenyue Wang,Guowu Yuan,Hao Zhou,Yi Ma,Yutang Ma,Dong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 13 figures

点击查看摘要

Abstract:Insulators are crucial insulation components and structural supports in power grids, playing a vital role in the transmission lines. Due to temperature fluctuations, internal stress, or damage from hail, insulators are prone to injury. Automatic detection of damaged insulators faces challenges such as diverse types, small defect targets, and complex backgrounds and shapes. Most research for detecting insulator defects has focused on a single defect type or a specific material. However, the insulators in the grid’s transmission lines have different colors and materials. Various insulator defects coexist, and the existing methods have difficulty meeting the practical application requirements. Current methods suffer from low detection accuracy and mAP0.5 cannot meet application requirements. This paper proposes an improved YOLOv7 model for multi-type insulator defect detection. First, our model replaces the SPPCSPC module with the RFB module to enhance the network’s feature extraction capability. Second, a CA mechanism is introduced into the head part to enhance the network’s feature representation ability and to improve detection accuracy. Third, a WIoU loss function is employed to address the low-quality samples hindering model generalization during training, thereby improving the model’s overall performance. The experimental results indicate that the proposed model exhibits enhancements across various performance metrics. Specifically, there is a 1.6% advancement in mAP_0.5, a corresponding 1.6% enhancement in mAP_0.5:0.95, a 1.3% elevation in precision, and a 1% increase in recall. Moreover, the model achieves parameter reduction by 3.2 million, leading to a decrease of 2.5 GFLOPS in computational cost. Notably, there is also an improvement of 2.81 milliseconds in single-image detection speed.
zh

[CV-71] Foreign-Object Detection in High-Voltage Transmission Line Based on Improved YOLOv8m

【速读】：该论文旨在解决高压输电线路上复杂异物检测精度低的问题，这些异物包括气球、风筝及筑巢鸟类等。解决方案的关键在于提出了一种改进的YOLOv8m模型，通过引入全局注意力模块（Global Attention Module, GAM）来聚焦于被遮挡的异物，替换空间金字塔池化后处理模块（SPPF）为跨阶段上下文空间金字塔池化模块（SPPCSPC），以增强多尺度特征提取能力，并引入焦点-EIoU损失函数（Focal-EIoU loss function）来应对高质量与低质量样本不平衡的问题。这些改进加速了模型收敛并提升了检测准确性。

链接: https://arxiv.org/abs/2502.07175
作者: Zhenyue Wang,Guowu Yuan,Hao Zhou,Yi Ma,Yutang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 16 figures

点击查看摘要

Abstract:The safe operation of high-voltage transmission lines ensures the power grid’s security. Various foreign objects attached to the transmission lines, such as balloons, kites and nesting birds, can significantly affect the safe and stable operation of high-voltage transmission lines. With the advancement of computer vision technology, periodic automatic inspection of foreign objects is efficient and necessary. Existing detection methods have low accuracy because foreign objects at-tached to the transmission lines are complex, including occlusions, diverse object types, significant scale variations, and complex backgrounds. In response to the practical needs of the Yunnan Branch of China Southern Power Grid Co., Ltd., this paper proposes an improved YOLOv8m-based model for detecting foreign objects on transmission lines. Experiments are conducted on a dataset collected from Yunnan Power Grid. The proposed model enhances the original YOLOv8m by in-corporating a Global Attention Module (GAM) into the backbone to focus on occluded foreign objects, replacing the SPPF module with the SPPCSPC module to augment the model’s multiscale feature extraction capability, and introducing the Focal-EIoU loss function to address the issue of high- and low-quality sample imbalances. These improvements accelerate model convergence and enhance detection accuracy. The experimental results demonstrate that our proposed model achieves a 2.7% increase in mAP_0.5, a 4% increase in mAP_0.5:0.95, and a 6% increase in recall.
zh

[CV-72] SemiHMER: Semi-supervised Handwritten Mathematical Expression Recognition using pseudo-labels

【速读】：该论文旨在解决手写数学表达式识别（HMER）在有限标记训练数据条件下性能提升困难的问题。关键解决方案在于引入了一种新颖的双分支半监督学习框架，并提出了全局动态计数模块（GDCM），以增强识别准确性，特别是在长距离公式识别和重复字符出现方面。该框架通过简化一致性正则化为交叉监督学习，利用一个分支的预测作为另一分支的伪标签进行端到端的直接监督，同时采用不同级别的数据增强策略，模拟扩大训练数据集的效果，从而提高网络训练的质量。

链接: https://arxiv.org/abs/2502.07172
作者: Kehua Chen,Haoyang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages,3 figures

点击查看摘要

Abstract:In recent years, deep learning with Convolutional Neural Networks (CNNs) has achieved remarkable results in the field of HMER (Handwritten Mathematical Expression Recognition). However, it remains challenging to improve performance with limited labeled training data. This paper presents, for the first time, a simple yet effective semi-supervised HMER framework by introducing dual-branch semi-supervised learning. Specifically, we simplify the conventional deep co-training from consistency regularization to cross-supervised learning, where the prediction of one branch is used as a pseudo-label to supervise the other branch directly end-to-end. Considering that the learning of the two branches tends to converge in the later stages of model optimization, we also incorporate a weak-to-strong strategy by applying different levels of augmentation to each branch, which behaves like expanding the training data and improving the quality of network training. Meanwhile, We propose a novel module, Global Dynamic Counting Module(GDCM), to enhance the performance of the HMER decoder, which alleviates recognition inaccuracies in long-distance formula recognition and the occurrence of repeated characters. We release our code at this https URL.
zh

[CV-73] A Survey on Mamba Architecture for Vision Applications

【速读】：该论文旨在解决Transformer在视觉任务中由于注意力机制导致的二次复杂性问题，限制其在大规模数据集上的可扩展性。为了解决这些问题，论文提出了Mamba架构，利用状态空间模型（State-Space Models, SSMs）实现线性可扩展性和高效处理，并增强上下文感知能力。关键解决方案在于Mamba架构及其衍生版本如Vision Mamba (ViM) 和VideoMamba引入的双向扫描、选择性扫描机制以及时空处理技术，这些创新优化了全局和局部特征提取，从而提升了图像和视频理解能力。

链接: https://arxiv.org/abs/2502.07161
作者: Fady Ibrahim,Guangjun Liu,Guanghui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.
zh

[CV-74] HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates

【速读】：该论文旨在解决超低比特率下图像压缩中的挑战，主要问题是传统学习图像压缩（LIC）方法在高压缩率下产生严重伪影，而生成式向量量化（VQ）建模则由于生成先验与特定输入之间的不匹配导致保真度较低。论文的关键解决方案是提出Hybrid-Diffusion图像压缩（HDCompression），这是一种双流框架，结合了生成式VQ建模、扩散模型以及传统的LIC方法，以实现高保真度和高感知质量。不同于以往直接利用预训练的LIC模型从高压缩的潜码中生成保真度信息的方法，本文使用扩散模型从原始输入中提取高质量的补充保真信息，从而在多个方面提升系统性能：改进索引图预测、增强LIC流的保真度输出，并通过VQ潜码校正优化条件图像重建。此外，所提出的扩散模型基于轻量级的密集代表性向量（DRV），具有非常简单的采样调度器。

链接: https://arxiv.org/abs/2502.07160
作者: Lei Lu,Yize Li,Yanzhi Wang,Wei Wang,Wei Jiang
机构: Department of Electrical and Computer Engineering, Northeastern University (东北大学); Futurewei Technologies Inc. (华为未来技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Under Review

点击查看摘要

Abstract:Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complimentary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving indices map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.
zh

[CV-75] Explaining 3D Computed Tomography Classifiers with Counterfactuals

【速读】：该论文旨在解决在高分辨率三维医学影像中生成可解释反事实（Counterfactual）解释的问题。针对三维数据特有的挑战，如有限的训练样本和高内存需求，论文提出了一种基于切片的方法。这种方法利用在CT切片上训练的二维编码器，并将这些切片组合以保持三维上下文。关键在于采用切片处理方式以提高内存效率并有效生成高分辨率三维医学影像中的可解释反事实。

链接: https://arxiv.org/abs/2502.07156
作者: Joseph Paul Cohen,Louis Blankemeier,Akshay Chaudhari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and models: this https URL

点击查看摘要

Abstract:Counterfactual explanations in medical imaging are critical for understanding the predictions made by deep learning models. We extend the Latent Shift counterfactual generation method from 2D applications to 3D computed tomography (CT) scans. We address the challenges associated with 3D data, such as limited training samples and high memory demands, by implementing a slice-based approach. This method leverages a 2D encoder trained on CT slices, which are subsequently combined to maintain 3D context. We demonstrate this technique on two models for clinical phenotype prediction and lung segmentation. Our approach is both memory-efficient and effective for generating interpretable counterfactuals in high-resolution 3D medical imaging.
zh

[CV-76] Mesh2SSM: A Probabilistic Framework for Unsupervised Learning of Statistical Shape Model of Anatomies from Surface Meshes

【速读】：该论文旨在解决统计形状模型（Statistical Shape Modeling, SSM）在解剖学评估中的质量和鲁棒性问题。现有SSM方法依赖于预建立的形状模型进行训练，且在处理复杂非线性形状表示方面存在局限。论文的关键解决方案是提出Mesh2SSM++方法，通过无监督方式学习从网格估计对应关系，并利用无监督、排列不变的表征学习来量化数据固有的不确定性。这种方法能够直接操作网格，并通过概率框架提高计算效率和可解释性，从而提升模型预测的可靠性和临床任务中的决策稳健性。

链接: https://arxiv.org/abs/2502.07145
作者: Krithika Iyer,Mokshagna Sai Teja Karanam,Shireen Elhabian
机构: Scientific Computing and Imaging Institute (计算科学与成像研究所); Kahlert School of Computing (Kahlert 计算机学院), University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anatomy evaluation is crucial for understanding the physiological state, diagnosing abnormalities, and guiding medical interventions. Statistical shape modeling (SSM) is vital in this process. By enabling the extraction of quantitative morphological shape descriptors from MRI and CT scans, SSM provides comprehensive descriptions of anatomical variations within a population. However, the effectiveness of SSM in anatomy evaluation hinges on the quality and robustness of the shape models. While deep learning techniques show promise in addressing these challenges by learning complex nonlinear representations of shapes, existing models still have limitations and often require pre-established shape models for training. To overcome these issues, we propose Mesh2SSM++, a novel approach that learns to estimate correspondences from meshes in an unsupervised manner. This method leverages unsupervised, permutation-invariant representation learning to estimate how to deform a template point cloud into subject-specific meshes, forming a correspondence-based shape model. Additionally, our probabilistic formulation allows learning a population-specific template, reducing potential biases associated with template selection. A key feature of Mesh2SSM++ is its ability to quantify aleatoric uncertainty, which captures inherent data variability and is essential for ensuring reliable model predictions and robust decision-making in clinical tasks, especially under challenging imaging conditions. Through extensive validation across diverse anatomies, evaluation metrics, and downstream tasks, we demonstrate that Mesh2SSM++ outperforms existing methods. Its ability to operate directly on meshes, combined with computational efficiency and interpretability through its probabilistic framework, makes it an attractive alternative to traditional and deep learning-based SSM approaches.
zh

[CV-77] Few-Shot Multi-Human Neural Rendering Using Geometry Constraints

【速读】：该论文旨在解决从少量图像中恢复包含多人场景的形状和辐射率的问题。论文的关键解决方案在于提出了一种神经隐式重建方法，通过利用预计算的SMPL（Simple Human Pose and Mesh）网格施加几何约束，并采用边界框改进渲染效果；同时引入射线正则化方案以最小化渲染不一致，并采用饱和度正则化以增强在变化光照条件下的优化鲁棒性。这些贡献共同克服了从稀疏视角估计多个人体的固有挑战。

链接: https://arxiv.org/abs/2502.07140
作者: Qian li,Victoria Fernàndez Abrevaya,Franck Multon,Adnane Boukhayma
机构: Inria(英立克); University Rennes(雷恩大学); IRISA(英瑞萨); CNRS(法国国家科学研究中心), France(法国); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所), Germany(德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We present a method for recovering the shape and radiance of a scene consisting of multiple people given solely a few images. Multi-human scenes are complex due to additional occlusion and clutter. For single-human settings, existing approaches using implicit neural representations have achieved impressive results that deliver accurate geometry and appearance. However, it remains challenging to extend these methods for estimating multiple humans from sparse views. We propose a neural implicit reconstruction method that addresses the inherent challenges of this task through the following contributions: First, we propose to use geometry constraints by exploiting pre-computed meshes using a human body model (SMPL). Specifically, we regularize the signed distances using the SMPL mesh and leverage bounding boxes for improved rendering. Second, we propose a ray regularization scheme to minimize rendering inconsistencies, and a saturation regularization for robust optimization in variable illumination. Extensive experiments on both real and synthetic datasets demonstrate the benefits of our approach and show state-of-the-art performance against existing neural reconstruction methods.
zh

[CV-78] Unconstrained Body Recognition at Altitude and Range: Comparing Four Approaches

【速读】：该论文旨在解决长期人体身份识别的问题，重点关注持久的身体形状特征而非短期的临时特征（如衣物）。关键解决方案在于引入了基于Vision Transformer (ViT) 的Body Identification from Diverse Datasets (BIDDS) 模型以及改进版的Swin-ViT模型(Swin-BIDDS)，同时优化了基于Linguistic和Non-linguistic Core ResNet Identity Models (LCRIM和NLCRIM)的先前方法。所有模型均在包含超过190万张图像的大规模多样化数据集上进行训练，并在标准再识别基准数据集及具有挑战性的实际条件下进行了评估。

链接: https://arxiv.org/abs/2502.07130
作者: Blake A Myers,Matthew Q Hill,Veda Nandan Gandi,Thomas M Metz,Alice J O’Toole
机构: School of Behavioral and Brain Sciences, The University of Texas at Dallas (行为与脑科学学院，德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents an investigation of four distinct approaches to long-term person identification using body shape. Unlike short-term re-identification systems that rely on temporary features (e.g., clothing), we focus on learning persistent body shape characteristics that remain stable over time. We introduce a body identification model based on a Vision Transformer (ViT) (Body Identification from Diverse Datasets, BIDDS) and on a Swin-ViT model (Swin-BIDDS). We also expand on previous approaches based on the Linguistic and Non-linguistic Core ResNet Identity Models (LCRIM and NLCRIM), but with improved training. All models are trained on a large and diverse dataset of over 1.9 million images of approximately 5k identities across 9 databases. Performance was evaluated on standard re-identification benchmark datasets (MARS, MSMT17, Outdoor Gait, DeepChange) and on an unconstrained dataset that includes images at a distance (from close-range to 1000m), at altitude (from an unmanned aerial vehicle, UAV), and with clothing change. A comparative analysis across these models provides insights into how different backbone architectures and input image sizes impact long-term body identification performance across real-world conditions.
zh

[CV-79] Is Long Range Sequential Modeling Necessary For Colorectal Tumor Segmentation?

【速读】：该论文旨在评估长程序列建模机制（如Transformers和Mamba）在三维医学影像分割中的有效性，并提出了一种新的方法MambaOutUNet。关键在于研究发现，在感兴趣区域较小且解剖结构复杂的情况下，稳健的局部令牌交互（local token interactions）可以优于长程建模技术，这可能预示着三维肿瘤分割研究方向的潜在转变。

链接: https://arxiv.org/abs/2502.07120
作者: Abhishek Srivastava,Koushik Biswas,Gorkem Durak,Gulsah Ozden,Mustafa Adli,Ulas Bagci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figures

点击查看摘要

Abstract:Segmentation of colorectal cancer (CRC) tumors in 3D medical imaging is both complex and clinically critical, providing vital support for effective radiation therapy planning and survival outcome assessment. Recently, 3D volumetric segmentation architectures incorporating long-range sequence modeling mechanisms, such as Transformers and Mamba, have gained attention for their capacity to achieve high accuracy in 3D medical image segmentation. In this work, we evaluate the effectiveness of these global token modeling techniques by pitting them against our proposed MambaOutUNet within the context of our newly introduced colorectal tumor segmentation dataset (CTS-204). Our findings suggest that robust local token interactions can outperform long-range modeling techniques in cases where the region of interest is small and anatomically complex, proposing a potential shift in 3D tumor segmentation research.
zh

[CV-80] Lotus: Creating Short Videos From Long Videos With Abstractive and Extractive Summarization

【速读】：该论文旨在解决短视频创作者在将长视频转化为短视频过程中面临的规划、剪辑和编排难题。解决方案的关键在于Lotus系统，它结合了生成式（abstractive）和抽取式（extractive）两种方法，通过自动生成脚本及其对应的语音，并匹配相应的长视频片段，从而在保留原始内容的同时提供灵活度。创作者可以利用Lotus的自动化工具或编辑界面进一步添加和精炼短视频内容。

链接: https://arxiv.org/abs/2502.07096
作者: Aadit Barua,Karim Benharrak,Meng Chen,Mina Huh,Amy Pavel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, ACM IUI 2025

点击查看摘要

Abstract:Short-form videos are popular on platforms like TikTok and Instagram as they quickly capture viewers’ attention. Many creators repurpose their long-form videos to produce short-form videos, but creators report that planning, extracting, and arranging clips from long-form videos is challenging. Currently, creators make extractive short-form videos composed of existing long-form video clips or abstractive short-form videos by adding newly recorded narration to visuals. While extractive videos maintain the original connection between audio and visuals, abstractive videos offer flexibility in selecting content to be included in a shorter time. We present Lotus, a system that combines both approaches to balance preserving the original content with flexibility over the content. Lotus first creates an abstractive short-form video by generating both a short-form script and its corresponding speech, then matching long-form video clips to the generated narration. Creators can then add extractive clips with an automated method or Lotus’s editing interface. Lotus’s interface can be used to further refine the short-form video. We compare short-form videos generated by Lotus with those using an extractive baseline method. In our user study, we compare creating short-form videos using Lotus to participants’ existing practice.
zh

[CV-81] PrismAvatar: Real-time animated 3D neural head avatars on edge devices

【速读】：该论文旨在解决在资源受限的边缘设备上实现高质量实时3D头像动画和渲染的问题。关键在于通过整合有绑定的棱柱晶格与三维可变形头部模型，使用混合渲染模型同时重建基于网格的头部和可变形的NeRF模型。进一步，将可变形的NeRF蒸馏为绑定的网格和神经纹理，以在传统三角形渲染管线的限制内高效地进行动画和渲染。

链接: https://arxiv.org/abs/2502.07030
作者: Prashant Raina,Felix Taubner,Mathieu Tuli,Eu Wern Teh,Kevin Ferreira
机构: LG Electronics(乐金电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:We present PrismAvatar: a 3D head avatar model which is designed specifically to enable real-time animation and rendering on resource-constrained edge devices, while still enjoying the benefits of neural volumetric rendering at training time. By integrating a rigged prism lattice with a 3D morphable head model, we use a hybrid rendering model to simultaneously reconstruct a mesh-based head and a deformable NeRF model for regions not represented by the 3DMM. We then distill the deformable NeRF into a rigged mesh and neural textures, which can be animated and rendered efficiently within the constraints of the traditional triangle rendering pipeline. In addition to running at 60 fps with low memory usage on mobile devices, we find that our trained models have comparable quality to state-of-the-art 3D avatar models on desktop devices.
zh

[CV-82] Detecting Neurodegenerative Diseases using Frame-Level Handwriting Embeddings

【速读】：该论文旨在探索使用声谱图（Spectrograms）表示书写信号以评估神经退行性疾病，包括帕金森病（Parkinson’s Disease, PD）、阿尔茨海默病（Alzheimer’s Disease, AD）以及帕金森病模仿症（Parkinson’s Disease Mimics, PDM）。研究的关键在于应用卷积神经网络（CNN）和卷积神经网络-双向长短期记忆网络（CNN-BLSTM）模型进行二分类任务，通过多通道固定大小和基于帧的声谱图来评估不同书写任务和声谱图通道组合对分类性能的影响。研究表明，书写任务和声谱图通道组合显著影响分类效果，其中CNN模型在所有分类任务中表现优于CNN-BLSTM。

链接: https://arxiv.org/abs/2502.07025
作者: Sarah Laouedj,Yuzhe Wang,Jesus Villalba,Thomas Thebaud,Laureano Moro-Velazquez,Najim Dehak
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, we explored the use of spectrograms to represent handwriting signals for assessing neurodegenerative diseases, including 42 healthy controls (CTL), 35 subjects with Parkinson’s Disease (PD), 21 with Alzheimer’s Disease (AD), and 15 with Parkinson’s Disease Mimics (PDM). We applied CNN and CNN-BLSTM models for binary classification using both multi-channel fixed-size and frame-based spectrograms. Our results showed that handwriting tasks and spectrogram channel combinations significantly impacted classification performance. The highest F1-score (89.8%) was achieved for AD vs. CTL, while PD vs. CTL reached 74.5%, and PD vs. PDM scored 77.97%. CNN consistently outperformed CNN-BLSTM. Different sliding window lengths were tested for constructing frame-based spectrograms. A 1-second window worked best for AD, longer windows improved PD classification, and window length had little effect on PD vs. PDM.
zh

[CV-83] Early Operative Difficulty Assessment in Laparoscopic Cholecystectomy via Snapshot-Centric Video Analysis

【速读】：该论文旨在解决腹腔镜胆囊切除术（Laparoscopic Cholecystectomy, LC）操作难度（LCOD）评估的问题，这一评估在术中具有高度变异性且影响手术结果。现有研究虽已广泛分析手术流程，但鲜有利用术中视频数据来预测LCOD的工作。论文提出了一项新任务：早期LCOD评估，并设计了一个名为SurgPrOD的深度学习模型。该模型通过分析观察到的LC视频的全局和局部时间分辨率（快照）特征来评估LCOD，并引入了一种新颖的基于快照的注意力模块（Snapshot-Centric Attention, SCA），以增强LCOD预测能力。关键在于SurgPrOD模型及其SCA模块能够在早期和稳定的基础上准确预测LCOD，从而超越基准方法至少0.22分，并在F1分数和前1准确率方面分别提高至少9和5个百分点。

链接: https://arxiv.org/abs/2502.07008
作者: Saurav Sharma,Maria Vannucci,Leonardo Pestana Legori,Mario Scaglia,Giovanni Guglielmo Laracca,Didier Mutter,Sergio Alfieri,Pietro Mascagni,Nicolas Padoy
机构: University of Strasbourg(斯特拉斯堡大学), CNRS, INSERM, ICube, UMR7357, France; IHU Strasbourg(斯特拉斯堡IHU), Strasbourg, France; Università degli Studi di Milano(米兰大学); General Surgery Department, University of Torino(都灵大学), Turin, Italy; Department of Medical Surgical Science and Translational Medicine, Sant’Andrea Hospital, Sapienza University of Rome(罗马一大), Rome, Italy; Fondazione Policlinico Universitario A. Gemelli IRCCS(阿格梅利综合医院基金会), Rome, Italy; Università Cattolica del Sacro Cuore(天主教大学), Rome, Italy; University Hospital of Strasbourg(斯特拉斯堡大学医院), France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IPCAI, 2025

点击查看摘要

Abstract:Purpose: Laparoscopic cholecystectomy (LC) operative difficulty (LCOD) is highly variable and influences outcomes. Despite extensive LC studies in surgical workflow analysis, limited efforts explore LCOD using intraoperative video data. Early recog- nition of LCOD could allow prompt review by expert surgeons, enhance operating room (OR) planning, and improve surgical outcomes. Methods: We propose the clinical task of early LCOD assessment using limited video observations. We design SurgPrOD, a deep learning model to assess LCOD by analyzing features from global and local temporal resolutions (snapshots) of the observed LC video. Also, we propose a novel snapshot-centric attention (SCA) module, acting across snapshots, to enhance LCOD prediction. We introduce the CholeScore dataset, featuring video-level LCOD labels to validate our method. Results: We evaluate SurgPrOD on 3 LCOD assessment scales in the CholeScore dataset. On our new metric assessing early and stable correct predictions, SurgPrOD surpasses baselines by at least 0.22 points. SurgPrOD improves over baselines by at least 9 and 5 percentage points in F1 score and top1-accuracy, respectively, demonstrating its effectiveness in correct predictions. Conclusion: We propose a new task for early LCOD assessment and a novel model, SurgPrOD analyzing surgical video from global and local perspectives. Our results on the CholeScore dataset establishes a new benchmark to study LCOD using intraoperative video data. Comments: Accepted at IPCAI, 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.07008 [cs.CV] (or arXiv:2502.07008v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.07008 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Saurav Sharma [view email] [v1] Mon, 10 Feb 2025 20:14:01 UTC (14,703 KB)
zh

[CV-84] Grounding Creativity in Physics: A Brief Survey of Physical Priors in AIGC

【速读】：该论文旨在解决AI生成内容在3D和4D生成过程中，由于忽视物理原理而导致的不真实变形、不稳定动态和不合理物体交互等问题。关键在于将物理先验知识融入生成模型，通过不同表示方法（如基于视觉的、NeRF（Neural Radiance Fields）的以及高斯散射的）来增强结构完整性和运动真实性，从而提升生成内容的物理一致性。

链接: https://arxiv.org/abs/2502.07007
作者: Siwei Meng,Yawei Luo,Ping Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in AI-generated content have significantly improved the realism of 3D and 4D generation. However, most existing methods prioritize appearance consistency while neglecting underlying physical principles, leading to artifacts such as unrealistic deformations, unstable dynamics, and implausible objects interactions. Incorporating physics priors into generative models has become a crucial research direction to enhance structural integrity and motion realism. This survey provides a review of physics-aware generative methods, systematically analyzing how physical constraints are integrated into 3D and 4D generation. First, we examine recent works in incorporating physical priors into static and dynamic 3D generation, categorizing methods based on representation types, including vision-based, NeRF-based, and Gaussian Splatting-based approaches. Second, we explore emerging techniques in 4D generation, focusing on methods that model temporal dynamics with physical simulations. Finally, we conduct a comparative analysis of major methods, highlighting their strengths, limitations, and suitability for different materials and motion dynamics. By presenting an in-depth analysis of physics-grounded AIGC, this survey aims to bridge the gap between generative models and physical realism, providing insights that inspire future research in physically consistent content generation.
zh

[CV-85] AstroLoc: Robust Space to Ground Image Localizer

【速读】：该论文旨在解决宇航员从国际空间站拍摄的大量地球照片的自动定位问题。现有的方法仅使用卫星图像进行训练，而忽视了数百万开源的宇航员照片。论文的关键解决方案在于提出了一种新的Astronaut Photography Localization (APL) 管道，称为AstroLoc，能够利用宇航员照片进行训练。AstroLoc通过两种损失函数学习地球表面特征的稳健表示：一种是将宇航员照片与其匹配的卫星图像配对的成对损失，另一种是对卫星图像聚类并在聚类过程中考虑其与宇航员摄影的相关性的无监督挖掘加权损失。这一方法显著提升了定位召回率，在未进行微调的情况下，对于相关任务如“迷失在太空中的卫星问题”和“历史空间图像定位”也表现出色。

链接: https://arxiv.org/abs/2502.07003
作者: Gabriele Berton,Alex Stoken,Carlo Masone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Astronauts take thousands of photos of Earth per day from the International Space Station, which, once localized on Earth’s surface, are used for a multitude of tasks, ranging from climate change research to disaster management. The localization process, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, find its most similar match among a large database of geo-tagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth’s surface features through two losses: astronaut photos paired with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography via unsupervised mining. We find that AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, pushing the limits of existing datasets with a recall@100 consistently over 99%. Finally, we note that AstroLoc, without any fine-tuning, provides excellent results for related tasks like the lost-in-space satellite problem and historical space imagery localization.
zh

[CV-86] From Image to Video: An Empirical Study of Diffusion Representations

【速读】：该论文旨在解决视频扩散模型（Video Diffusion Models）在视觉理解任务中的表现是否优于图像扩散模型（Image Diffusion Models）的问题。关键在于通过系统性地比较相同架构的模型在视频生成与图像生成任务上的表现，分析它们在多种下游任务如图像分类、动作识别、深度估计和跟踪中的潜表示（latent representations）性能，从而揭示时序信息在表征学习中的作用。

链接: https://arxiv.org/abs/2502.07001
作者: Pedro Vélez,Luisa F. Polanía,Yi Yang,Chuhan Zhang,Rishab Kabra,Anurag Arnab,Mehdi S. M. Sajjadi
机构: Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
zh

[CV-87] Indoor Light and Heat Estimation from a Single Panorama

【速读】：该论文旨在解决自动估计室内光和热分布的问题。解决方案的关键在于利用从捕获的室内外高动态范围全景图像中直接估算室内光照和热量传输的方法。具体而言，通过将室内全景用于估计三维房间布局，并使用相应的室外全景作为环境图来推断空间变化的光照和材料属性。此外，论文还建立了室内光传输与热传输之间的联系，并实施瞬态热模拟以生成室内热全景图像。这一方法能够实现无需人工输入和繁琐现场测量的自动化室内光和热估计。

链接: https://arxiv.org/abs/2502.06973
作者: Guanzhou Ji,Sriram Narayanan,Azadeh Sawyer,Srinivasa Narasimhan
机构: Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel application for directly estimating indoor light and heat maps from captured indoor-outdoor High Dynamic Range (HDR) panoramas. In our image-based rendering method, the indoor panorama is used to estimate the 3D room layout, while the corresponding outdoor panorama serves as an environment map to infer spatially-varying light and material properties. We establish a connection between indoor light transport and heat transport and implement transient heat simulation to generate indoor heat panoramas. The sensitivity analysis of various thermal parameters is conducted, and the resulting heat maps are compared with the images captured by the thermal camera in real-world scenarios. This digital application enables automatic indoor light and heat estimation without manual inputs and cumbersome field measurements.
zh

[CV-88] GAS: Generative Avatar Synthesis from a Single Image

【速读】：本文旨在解决单张图像生成一致性和时间连续性视图一致的虚拟形象（Avatar）的问题。现有方法主要利用基于扩散模型（Diffusion Model）的深度图或法线图等人像模板进行条件生成，但由于驱动信号的稀疏性和实际人体之间的差异，导致多视角和时间上的不一致性。本文的关键在于结合基于回归的人体三维重建（3D Human Reconstruction）的重构能力与扩散模型的生成能力，通过密集驱动信号提供全面的条件输入，确保合成结果忠实于参考图像的外观和结构。此外，提出了一种统一框架，使从野外视频中学习到的新姿态生成泛化能力能够自然地转移到新视角生成中，从而实现高质量的视角一致渲染和逼真的非刚性变形。

链接: https://arxiv.org/abs/2502.06957
作者: Yixing Lu,Junting Dong,Youngjoong Kwon,Qin Zhao,Bo Dai,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学); Shanghai AI Laboratory (上海人工智能实验室); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image, addressing the challenging problem of single-image avatar generation. While recent methods employ diffusion models conditioned on human templates like depth or normal maps, they often struggle to preserve appearance information due to the discrepancy between sparse driving signals and the actual human subject, resulting in multi-view and temporal inconsistencies. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. The dense driving signal from the initial reconstructed human provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Additionally, we propose a unified framework that enables the generalization learned from novel pose synthesis on in-the-wild videos to naturally transfer to novel view synthesis. Our video-based diffusion model enhances disentangled synthesis with high-quality view-consistent renderings for novel views and realistic non-rigid deformations in novel pose animation. Results demonstrate the superior generalization ability of our method across in-domain and out-of-domain in-the-wild datasets. Project page: this https URL
zh

[CV-89] PyPotteryInk: One-Step Diffusion Model for Sketch to Publication-ready Archaeological Drawings

【速读】：该论文旨在解决考古陶器文档记录过程中传统手动转换铅笔草图到出版级墨线图耗时费力的问题。解决方案的关键在于提出了一种名为PyPotteryInk的开源自动化流程，利用改进的img2img-turbo架构实现单步扩散模型，能够在单次前向传递中处理草图，同时保留关键形态细节，并符合考古文档标准和分析价值。该系统采用高效的基于补丁的方法并具有动态重叠功能，确保高分辨率输出，不受输入草图大小的影响。

链接: https://arxiv.org/abs/2502.06897
作者: Lorenzo Cardarelli
机构: Sapienza University of Rome (罗马大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Archaeological pottery documentation traditionally requires a time-consuming manual process of converting pencil sketches into publication-ready inked drawings. I present PyPotteryInk, an open-source automated pipeline that transforms archaeological pottery sketches into standardised publication-ready drawings using a one-step diffusion model. Built on a modified img2img-turbo architecture, the system processes drawings in a single forward pass while preserving crucial morphological details and maintaining archaeologic documentation standards and analytical value. The model employs an efficient patch-based approach with dynamic overlap, enabling high-resolution output regardless of input drawing size. I demonstrate the effectiveness of the approach on a dataset of Italian protohistoric pottery drawings, where it successfully captures both fine details like decorative patterns and structural elements like vessel profiles or handling elements. Expert evaluation confirms that the generated drawings meet publication standards while significantly reducing processing time from hours to seconds per drawing. The model can be fine-tuned to adapt to different archaeological contexts with minimal training data, making it versatile across various pottery documentation styles. The pre-trained models, the Python library and comprehensive documentation are provided to facilitate adoption within the archaeological research community.
zh

[CV-90] AI-Driven HSI: Multimodality Fusion Challenges and the Deep Learning Revolution

【速读】：该论文旨在综述高光谱成像（HSI）技术及其应用，并探讨数据融合挑战及深度学习模型在处理HSI数据中的作用。论文的关键解决方案在于整合多模态HSI与人工智能，特别是深度学习，以提升分类准确性及操作效率。通过深度学习，HSI分析在特征提取、变化检测、降噪、端元分离、维数约简、土地覆盖制图、数据增强、光谱重建和超分辨率等领域得到显著改进。此外，论文还关注将高光谱相机与大型语言模型（LLM）融合的新兴趋势，以开发如低能见度碰撞检测和面部反欺骗等高级应用。

链接: https://arxiv.org/abs/2502.06894
作者: David S. Bhatti,Yougin Choi,Rahman S M Wahidur,Maleeka Bakhtawar,Sumin Kim,Surin Lee,Yongtae Lee,Heung-No Lee
机构: School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST)(光州科学技术研究院); Artificial Intelligence Graduate School, GIST(光州科学技术研究院); INFONET research lab at GIST (Korea)(韩国光州科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 39 Pages, 22 figures, 20 tables

点击查看摘要

Abstract:Hyperspectral imaging (HSI) captures spatial and spectral data, enabling analysis of features invisible to conventional systems. The technology is vital in fields such as weather monitoring, food quality control, counterfeit detection, healthcare diagnostics, and extending into defense, agriculture, and industrial automation at the same time. HSI has advanced with improvements in spectral resolution, miniaturization, and computational methods. This study provides an overview of the HSI, its applications, challenges in data fusion and the role of deep learning models in processing HSI data. We discuss how integration of multimodal HSI with AI, particularly with deep learning, improves classification accuracy and operational efficiency. Deep learning enhances HSI analysis in areas like feature extraction, change detection, denoising unmixing, dimensionality reduction, landcover mapping, data augmentation, spectral construction and super resolution. An emerging focus is the fusion of hyperspectral cameras with large language models (LLMs), referred as highbrain LLMs, enabling the development of advanced applications such as low visibility crash detection and face antispoofing. We also highlight key players in HSI industry, its compound annual growth rate and the growing industrial significance. The purpose is to offer insight to both technical and non-technical audience, covering HSI’s images, trends, and future directions, while providing valuable information on HSI datasets and software libraries.
zh

[CV-91] Secure Visual Data Processing via Federated Learning

【速读】：该论文旨在解决大规模视觉数据处理中的隐私保护问题，特别是在结合物体检测、联邦学习与匿名化技术方面。论文的关键解决方案在于提出一种融合物体检测（Object Detection）、联邦学习（Federated Learning）和匿名化（Anonymization）的新方法，以提供更全面的隐私保护策略，从而弥补单一使用任一技术时可能存在的安全漏洞。

链接: https://arxiv.org/abs/2502.06889
作者: Pedro Santos,Tânia Carvalho,Filipe Magalhães,Luís Antunes
机构: Faculty of Sciences, University of Porto(科学学院，波尔图大学); TekPrivacy(泰克隐私), Porto(波尔图)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 Pages, 3 figures, 5 tables

点击查看摘要

Abstract:As the demand for privacy in visual data management grows, safeguarding sensitive information has become a critical challenge. This paper addresses the need for privacy-preserving solutions in large-scale visual data processing by leveraging federated learning. Although there have been developments in this field, previous research has mainly focused on integrating object detection with either anonymization or federated learning. However, these pairs often fail to address complex privacy concerns. On the one hand, object detection with anonymization alone can be vulnerable to reverse techniques. On the other hand, federated learning may not provide sufficient privacy guarantees. Therefore, we propose a new approach that combines object detection, federated learning and anonymization. Combining these three components aims to offer a robust privacy protection strategy by addressing different vulnerabilities in visual data. Our solution is evaluated against traditional centralized models, showing that while there is a slight trade-off in accuracy, the privacy benefits are substantial, making it well-suited for privacy sensitive applications.
zh

[CV-92] BF-GAN: Development of an AI-driven Bubbly Flow Image Generation Model Using Generative Adversarial Networks

【速读】：该论文旨在解决通过生成式人工智能（Generative AI）生成逼真且高质量的气泡流图像的问题。解决方案的关键在于开发了一种名为气泡流生成对抗网络（BF-GAN）的架构，该架构能够通过物理条件输入jg和jf生成图像。为了进一步提升生成性能，研究者还设计了一个包含不匹配损失和像素损失的多尺度损失函数。此外，通过与测量值和经验相关性的对比验证了BF-GAN生成性能的有效性，从而证明BF-GAN能够在给定的jg和jf范围内生成真实的高质量气泡流图像。

链接: https://arxiv.org/abs/2502.06863
作者: Wen Zhou,Shuichiro Miwa,Yang Liu,Koji Okamoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A generative AI architecture called bubbly flow generative adversarial networks (BF-GAN) is developed, designed to generate realistic and high-quality bubbly flow images through physically conditioned inputs, jg and jf. Initially, 52 sets of bubbly flow experiments under varying conditions are conducted to collect 140,000 bubbly flow images with physical labels of jg and jf for training data. A multi-scale loss function is then developed, incorporating mismatch loss and pixel loss to enhance the generative performance of BF-GAN further. Regarding evaluative metrics of generative AI, the BF-GAN has surpassed conventional GAN. Physically, key parameters of bubbly flow generated by BF-GAN are extracted and compared with measurement values and empirical correlations, validating BF-GAN’s generative performance. The comparative analysis demonstrate that the BF-GAN can generate realistic and high-quality bubbly flow images with any given jg and jf within the research scope. BF-GAN offers a generative AI solution for two-phase flow research, substantially lowering the time and cost required to obtain high-quality data. In addition, it can function as a benchmark dataset generator for bubbly flow detection and segmentation algorithms, enhancing overall productivity in this research domain. The BF-GAN model is available online (this https URL). Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.06863 [cs.CV] (or arXiv:2502.06863v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.06863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-93] AutoSketch: VLM-assisted Style-Aware Vector Sketch Completion

【速读】：该论文旨在解决自动完成复杂场景部分手绘草图（如“公园里的一名女性正在与一名男性交谈”）时保持原有风格的问题。现有方法仅能从零开始生成草图，而无法在保留原始风格的情况下完成部分草图。论文的关键在于引入了一种名为AutoSketch的方法，这是一种风格感知向量草图补全技术，能够适应多样的草图风格。其关键是利用预训练的视觉-语言模型（VLM）来描述部分草图的风格，并通过新生成的笔画来复制这些风格。这种方法首先优化笔画以匹配由VLM提取的风格描述增强的输入提示，然后利用VLM生成可执行的风格调整代码，使笔画符合所需的风格。

链接: https://arxiv.org/abs/2502.06860
作者: Hsiao-Yuan Chin,I-Chao Shen,Yi-Ting Chiu,Bing-Yu Chen
机构: National Taiwan University; The University of Tokyo
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 11 pages

点击查看摘要

Abstract:The ability to automatically complete a partial sketch that depicts a complex scene, e.g., “a woman chatting with a man in the park”, is very useful. However, existing sketch generation methods create sketches from scratch; they do not complete a partial sketch in the style of the original. To address this challenge, we introduce AutoSketch, a styleaware vector sketch completion method that accommodates diverse sketch styles. Our key observation is that the style descriptions of a sketch in natural language preserve the style during automatic sketch completion. Thus, we use a pretrained vision-language model (VLM) to describe the styles of the partial sketches in natural language and replicate these styles using newly generated strokes. We initially optimize the strokes to match an input prompt augmented by style descriptions extracted from the VLM. Such descriptions allow the method to establish a diffusion prior in close alignment with that of the partial sketch. Next, we utilize the VLM to generate an executable style adjustment code that adjusts the strokes to conform to the desired style. We compare our method with existing methods across various sketch styles and prompts, performed extensive ablation studies and qualitative and quantitative evaluations, and demonstrate that AutoSketch can support various sketch scenarios.
zh

[CV-94] Vision-Integrated LLM s for Autonomous Driving Assistance : Human Performance Comparison and Trust Evaluation

【速读】：该论文旨在解决传统自动驾驶系统在复杂且不可预见的情景中因空间关系理解有限而导致推理困难的问题。解决方案的关键在于引入了一种基于大型语言模型（Large Language Model, LLM）的自动驾驶辅助系统，该系统集成了视觉适配器和LLM推理模块，以增强视觉理解和决策能力。其中，视觉适配器结合了YOLOv4和Vision Transformer（ViT），用于提取全面的视觉特征，而GPT-4则实现了类似人类的空间推理和响应生成。

链接: https://arxiv.org/abs/2502.06843
作者: Namhee Kim,Woojin Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Traditional autonomous driving systems often struggle with reasoning in complex, unexpected scenarios due to limited comprehension of spatial relationships. In response, this study introduces a Large Language Model (LLM)-based Autonomous Driving (AD) assistance system that integrates a vision adapter and an LLM reasoning module to enhance visual understanding and decision-making. The vision adapter, combining YOLOv4 and Vision Transformer (ViT), extracts comprehensive visual features, while GPT-4 enables human-like spatial reasoning and response generation. Experimental evaluations with 45 experienced drivers revealed that the system closely mirrors human performance in describing situations and moderately aligns with human decisions in generating appropriate responses.
zh

[CV-95] Emotion Recognition and Generation: A Comprehensive Review of Face Speech and Text Modalities

【速读】：该论文旨在提供情感识别与生成领域的全面综述，以帮助初涉此领域的研究者。论文重点介绍了面部、声音和文本模态下情感识别与生成的基本原理，并分类总结了最新的前沿技术方法及其理论基础和动机。关键在于通过讨论评估指标、对比分析和当前局限性，揭示该领域研究者所面临的主要挑战，并提出未来的研究方向，以期发展出稳健、有效且符合伦理的情感识别与生成系统。

链接: https://arxiv.org/abs/2502.06803
作者: Rebecca Mobbs,Dimitrios Makris,Vasileios Argyriou
机构: Kingston University (金斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ACM Computing Surveys

点击查看摘要

Abstract:Emotion recognition and generation have emerged as crucial topics in Artificial Intelligence research, playing a significant role in enhancing human-computer interaction within healthcare, customer service, and other fields. Although several reviews have been conducted on emotion recognition and generation as separate entities, many of these works are either fragmented or limited to specific methodologies, lacking a comprehensive overview of recent developments and trends across different modalities. In this survey, we provide a holistic review aimed at researchers beginning their exploration in emotion recognition and generation. We introduce the fundamental principles underlying emotion recognition and generation across facial, vocal, and textual modalities. This work categorises recent state-of-the-art research into distinct technical approaches and explains the theoretical foundations and motivations behind these methodologies, offering a clearer understanding of their application. Moreover, we discuss evaluation metrics, comparative analyses, and current limitations, shedding light on the challenges faced by researchers in the field. Finally, we propose future research directions to address these challenges and encourage further exploration into developing robust, effective, and ethically responsible emotion recognition and generation systems.
zh

[CV-96] he Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation

【速读】：该论文旨在解决生成式文本到图像（Text-to-Image, T2I）扩散模型在训练过程中对数据集MIMIC-CXR的过度记忆问题，从而威胁患者隐私。研究发现，包含去标识化痕迹的提示语是最容易被模型记忆的部分，特别是去标识化标记。论文的关键解决方案在于提出了增强隐私保护和提高生成模型可靠性的可操作策略，并指出当前的缓解措施无效，需要进一步开发和基准测试新的去记忆化技术以应对这一挑战。

链接: https://arxiv.org/abs/2502.07516
作者: Raman Dutt
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models, particularly text-to-image (T2I) diffusion models, play a crucial role in medical image analysis. However, these models are prone to training data memorization, posing significant risks to patient privacy. Synthetic chest X-ray generation is one of the most common applications in medical image analysis with the MIMIC-CXR dataset serving as the primary data repository for this task. This study adopts a data-driven approach and presents the first systematic attempt to identify prompts and text tokens in MIMIC-CXR that contribute the most to training data memorization. Our analysis reveals an unexpected finding: prompts containing traces of de-identification procedures are among the most memorized, with de-identification markers contributing the most. Furthermore, we also find existing inference-time memorization mitigation strategies are ineffective and fail to sufficiently reduce the model’s reliance on memorized text tokens highlighting a broader issue in T2I synthesis with MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy and improve the reliability of generative models in medical imaging. Finally, our results provide a foundation for future work on developing and benchmarking memorization mitigation techniques for synthetic chest X-ray generation using the MIMIC-CXR dataset.
zh

[CV-97] Quantitative evaluation of unsupervised clustering algorithms for dynamic total-body PET image analysis

【速读】：该论文旨在评估和比较多种无监督聚类方法在处理动态全身正电子发射断层成像（PET）图像中的时间活动曲线（TACs）分类任务上的性能。论文的关键在于系统性地分析和对比包括K均值聚类、高斯混合模型（GMM）、模糊C均值（FCM）、独立成分分析（ICA）与K均值聚类结合等在内的15种无监督聚类算法，以确定其在分类脑、右心室、右肾、下右肺叶及膀胱等不同组织的TACs方面的准确性。研究表明，高斯混合模型（GMM）、模糊C均值（FCM）以及独立成分分析（ICA）与mini-batch K均值结合的方法表现最优，分类准确率分别达到89%、83%和81%，且处理时间平均不超过半秒。

链接: https://arxiv.org/abs/2502.07511
作者: Oona Rainio,Maria K. Jaakkola,Riku Klén
机构: 未知
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Background. Recently, dynamic total-body positron emission tomography (PET) imaging has become possible due to new scanner devices. While clustering algorithms have been proposed for PET analysis already earlier, there is still little research systematically evaluating these algorithms for processing of dynamic total-body PET images. Materials and methods. Here, we compare the performance of 15 unsupervised clustering methods, including K-means either by itself or after principal component analysis (PCA) or independent component analysis (ICA), Gaussian mixture model (GMM), fuzzy c-means (FCM), agglomerative clustering, spectral clustering, and several newer clustering algorithms, for classifying time activity curves (TACs) in dynamic PET images. We use dynamic total-body ^15 O-water PET images collected from 30 patients with suspected or confirmed coronary artery disease. To evaluate the clustering algorithms in a quantitative way, we use them to classify 5000 TACs from each image based on whether the curve is taken from brain, right heart ventricle, right kidney, lower right lung lobe, or urinary bladder. Results. According to our results, the best methods are GMM, FCM, and ICA combined with mini batch K-means, which classified the TACs with a median accuracies of 89%, 83%, and 81%, respectively, in a processing time of half a second or less on average for each image. Conclusion. GMM, FCM, and ICA with mini batch K-means show promise for dynamic total-body PET analysis.
zh

[CV-98] Supervised contrastive learning for cell stage classification of animal embryos

【速读】：该论文旨在解决通过视频显微镜自动分类胚胎细胞阶段的问题，特别是在从二维延时显微视频中识别胚胎发育阶段（包括细胞分裂）时所面临的挑战。关键解决方案在于引入CLEmbryo方法，该方法结合了监督对比学习（Supervised Contrastive Learning）与焦损失函数（focal loss）进行训练，并采用轻量级三维神经网络CSN-50作为编码器。这一方法有效应对了低质量图像、细胞阶段边界模糊以及数据分布不平衡等挑战。

链接: https://arxiv.org/abs/2502.07360
作者: Yasmine Hachani(LACODAM),Patrick Bouthemy(SAIRPICO),Elisa Fromont(IRISA, LACODAM),Sylvie Ruffini(UVSQ, INRAE),Ludivine Laffont(UVSQ, INRAE),Alline de Paula Reis(BREED, ENVA)
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video microscopy, when combined with machine learning, offers a promising approach for studying the early development of in vitro produced (IVP) embryos. However, manually annotating developmental events, and more specifically cell divisions, is time-consuming for a biologist and cannot scale up for practical applications. We aim to automatically classify the cell stages of embryos from 2D time-lapse microscopy videos with a deep learning approach. We focus on the analysis of bovine embryonic development using video microscopy, as we are primarily interested in the application of cattle breeding, and we have created a Bovine Embryos Cell Stages (ECS) dataset. The challenges are three-fold: (1) low-quality images and bovine dark cells that make the identification of cell stages difficult, (2) class ambiguity at the boundaries of developmental stages, and (3) imbalanced data distribution. To address these challenges, we introduce CLEmbryo, a novel method that leverages supervised contrastive learning combined with focal loss for training, and the lightweight 3D neural network CSN-50 as an encoder. We also show that our method generalizes well. CLEmbryo outperforms state-of-the-art methods on both our Bovine ECS dataset and the publicly available NYU Mouse Embryos dataset.
zh

[CV-99] Flat U-Net: An Efficient Ultralightweight Model for Solar Filament Segmentation in Full-disk Hα Images

链接: https://arxiv.org/abs/2502.07259
作者: GaoFei Zhu,GangHua Lin,Xiao Yang,Cheng Zeng
机构: School of Computer and Software Engineering, Xihua University (西华大学); National Astronomical Observatories, Chinese Academy of Sciences (中国科学院国家天文台); State Key Laboratory of Solar Activity and Space Weather, National Space Science Center, Chinese Academy of Sciences (中国科学院国家空间科学中心)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 5 figures, 3 tables, accepted for publication in ApJ

点击查看摘要

[CV-100] Color-Quality Invariance for Robust Medical Image Segmentation

链接: https://arxiv.org/abs/2502.07200
作者: Ravi Shah,Atsushi Fukuda,Quan Huu Cap
机构: AI Development Department, Aillis, Inc.(艾丽莉斯公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-101] Choroidal image analysis for OCT image sequences with applications in systemic health

【速读】：该论文旨在解决通过光学相干断层成像（Optical Coherence Tomography, OCT）图像序列分析脉络膜（choroid）时，传统手动或半自动方法存在的可重复性差、标准化不足及临床实用性受限的问题。论文的关键解决方案在于开发了一系列新的方法，包括基于深度学习的区域分割方法DeepGPET以及完全自动化的Choroidalyzer管道，这些方法显著提升了分析的执行时间、可重复性和临床意义，从而推动了脉络膜特征测量的标准化，并突显了脉络膜作为全身健康生物标志物的潜力。

链接: https://arxiv.org/abs/2502.07117
作者: Jamie Burke
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Mathematical Software (cs.MS)
备注: PhD thesis toward a doctorate degree at the University of Edinburgh. PhD funded by the Medical Research Council (grant MR/N013166/1). Reviewed and examined by Dr. Roly Megaw (internal) and Prof. Pearse Keane (external) in December 2024 and ratified in the same month by the university. Official record found here: this https URL

点击查看摘要

Abstract:The choroid, a highly vascular layer behind the retina, is an extension of the central nervous system and has parallels with the renal cortex, with blood flow far exceeding that of the brain and kidney. Thus, there has been growing interest of choroidal blood flow reflecting physiological status of systemic disease. Optical coherence tomography (OCT) enables high-resolution imaging of the choroid, but conventional analysis methods remain manual or semi-automatic, limiting reproducibility, standardisation and clinical utility. In this thesis, I develop several new methods to analyse the choroid in OCT image sequences, with each successive method improving on its predecessors. I first develop two semi-automatic approaches for choroid region (Gaussian Process Edge Tracing, GPET) and vessel (Multi-scale Median Cut Quantisation, MMCQ) analysis, which improve on manual approaches but remain user-dependent. To address this, I introduce DeepGPET, a deep learning-based region segmentation method which improves on execution time, reproducibility, and end-user accessibility, but lacks choroid vessel analysis and automatic feature measurement. Improving on this, I developed Choroidalyzer, a deep learning-based pipeline to segment the choroidal space and vessels and generate fully automatic, clinically meaningful and reproducible choroidal features. I provide rigorous evaluation of these four approaches and consider their potential clinical value in three applications into systemic health: OCTANE, assessing choroidal changes in renal transplant recipients and donors; PREVENT, exploring choroidal associations with Alzheimer’s risk factors at mid-life; D-RISCii, assessing choroidal variation and feasibility of OCT in critical care. In short, this thesis contributes many open-source tools for standardised choroidal measurement and highlights the choroid’s potential as a biomarker in systemic health.
zh

[CV-102] A Framework for Supervised and Unsupervised Segmentation and Classification of Materials Microstructure Images

链接: https://arxiv.org/abs/2502.07107
作者: Kungang Zhang,Daniel W. Apley,Wei Chen,Wing K. Liu,L. Catherine Brinson
机构: Northwestern University (西北大学); Duke University (杜克大学)
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-103] On the use of neural networks for the structural characterization of polymeric porous materials

链接: https://arxiv.org/abs/2502.07076
作者: Jorge Torre,Suset Barroso-Solares,M.A. Rodríguez-Pérez,Javier Pinto
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-104] Conditional diffusion model with spatial attention and latent embedding for medical image segmentation MICCAI2024

链接: https://arxiv.org/abs/2502.06997
作者: Behzad Hejrati,Soumyanil Banerjee,Carri Glide-Hurst,Ming Dong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 3 tables, Accepted in MICCAI 2024

点击查看摘要

[CV-105] Universal Vessel Segmentation for Multi-Modality Retinal Images

链接: https://arxiv.org/abs/2502.06987
作者: Bo Wen,Anna Heinke,Akshay Agnihotri,Dirk-Uwe Bartsch,William Freeman,Truong Nguyen,Cheolhong An
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-106] Generalizable automated ischaemic stroke lesion segmentation with vision transformers

【速读】：该论文旨在解决缺血性卒中扩散加权成像（DWI）病变自动分割面临的挑战，包括磁敏感伪影、形态异质性、与年龄相关的合并症、时间依赖性信号动态变化、仪器变异性以及有限的标记数据。论文的关键解决方案在于通过优化的基于视觉变换器的架构、整合来自多站点的3563个注释病变数据以及算法增强，实现了高性能的DWI病变分割工具。此外，论文还提出了一种新的评估框架，用于评估模型的保真度、公平性（跨人口统计学和病变亚型）、解剖精确性和对仪器变异性鲁棒性的综合性能，从而推动临床和研究应用的发展。

链接: https://arxiv.org/abs/2502.06939
作者: Chris Foulon,Robert Gray,James K. Ruffle,Jonathan Best,Tianbo Xu,Henry Watkins,Jane Rondina,Guilherme Pombo,Dominic Giles,Paul Wright,Marcela Ovando-Tellez,H. Rolf Jäger,Jorge Cardoso,Sebastien Ourselin,Geraint Rees,Parashkev Nachev
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 7 figures, 2 tables, 1 supplementary table, 2 supplementary figures

点击查看摘要

Abstract:Ischaemic stroke, a leading cause of death and disability, critically relies on neuroimaging for characterising the anatomical pattern of injury. Diffusion-weighted imaging (DWI) provides the highest expressivity in ischemic stroke but poses substantial challenges for automated lesion segmentation: susceptibility artefacts, morphological heterogeneity, age-related comorbidities, time-dependent signal dynamics, instrumental variability, and limited labelled data. Current U-Net-based models therefore underperform, a problem accentuated by inadequate evaluation metrics that focus on mean performance, neglecting anatomical, subpopulation, and acquisition-dependent variability. Here, we present a high-performance DWI lesion segmentation tool addressing these challenges through optimized vision transformer-based architectures, integration of 3563 annotated lesions from multi-site data, and algorithmic enhancements, achieving state-of-the-art results. We further propose a novel evaluative framework assessing model fidelity, equity (across demographics and lesion subtypes), anatomical precision, and robustness to instrumental variability, promoting clinical and research utility. This work advances stroke imaging by reconciling model expressivity with domain-specific challenges and redefining performance benchmarks to prioritize equity and generalizability, critical for personalized medicine and mechanistic research.
zh

[CV-107] A Comprehensive Review of U-Net and Its Variants: Advances and Applications in Medical Image Segmentation

链接: https://arxiv.org/abs/2502.06895
作者: Wang Jiangtao,Nur Intan Raihana Ruhaiyem,Fu Panpan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages,26 figures,7 tables

点击查看摘要

人工智能

[AI-0] Polynomial-Time Approximability of Constrained Reinforcement Learning

链接: https://arxiv.org/abs/2502.07764
作者: Jeremy McMahan
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time (0,\epsilon) -additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as P \neq NP . The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.

[AI-1] owards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

链接: https://arxiv.org/abs/2502.07752
作者: Wenbo Gong,Meyer Scetbon,Chao Ma,Edward Meeds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

[AI-2] PFedDST: Personalized Federated Learning with Decentralized Selection Training

链接: https://arxiv.org/abs/2502.07750
作者: Mengchen Fan,Keren Li,Tianyun Zhang,Qing Tian,Baocheng Geng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Distributed Learning (DL) enables the training of machine learning models across multiple devices, yet it faces challenges like non-IID data distributions and device capability disparities, which can impede training efficiency. Communication bottlenecks further complicate traditional Federated Learning (FL) setups. To mitigate these issues, we introduce the Personalized Federated Learning with Decentralized Selection Training (PFedDST) framework. PFedDST enhances model training by allowing devices to strategically evaluate and select peers based on a comprehensive communication score. This score integrates loss, task similarity, and selection frequency, ensuring optimal peer connections. This selection strategy is tailored to increase local personalization and promote beneficial peer collaborations to strengthen the stability and efficiency of the training process. Our experiments demonstrate that PFedDST not only enhances model accuracy but also accelerates convergence. This approach outperforms state-of-the-art methods in handling data heterogeneity, delivering both faster and more effective training in diverse and decentralized systems.

[AI-3] Verifying LLM -Generated Code in the Context of Software Verification with Ada/SPARK

链接: https://arxiv.org/abs/2502.07728
作者: Marcos Cramer,Lucian McIntyre
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable code generation capabilities, but the correctness of the generated code cannot be inherently trusted. This paper explores the feasibility of using formal software verification, specifically the SPARK framework for Ada, to ensure the reliability of LLM-generated code. We present Marmaragan, a tool that leverages an LLM in order to generate SPARK annotations for existing programs, enabling formal verification of the code. The tool is benchmarked on a curated set of SPARK programs, with annotations selectively removed to test specific capabilities. The performance of Marmaragan with GPT-4o on the benchmark is promising, with correct annotations having been generated for 50.7% of the benchmark cases. The results establish a foundation for future work on combining the power of LLMs with the reliability of formal software verification.

[AI-4] MLC-Net: Transferable Meta Label Correction for Noisy Label Learning

链接: https://arxiv.org/abs/2502.07721
作者: Mengyang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The prevalence of noisy labels in real-world datasets poses a significant impediment to the effective deployment of deep learning models. While meta-learning strategies have emerged as a promising approach for addressing this challenge, existing methods often suffer from limited transferability and task-specific designs. This paper introduces TMLC-Net, a novel Transferable Meta-Learner for Correcting Noisy Labels, designed to overcome these limitations. TMLC-Net learns a general-purpose label correction strategy that can be readily applied across diverse datasets and model architectures without requiring extensive retraining or fine-tuning. Our approach integrates three core components: (1) Normalized Noise Perception, which captures and normalizes training dynamics to handle distribution shifts; (2) Time-Series Encoding, which models the temporal evolution of sample statistics using a recurrent neural network; and (3) Subclass Decoding, which predicts a corrected label distribution based on the learned representations. We conduct extensive experiments on benchmark datasets with various noise types and levels, demonstrating that TMLC-Net consistently outperforms state-of-the-art methods in terms of both accuracy and robustness to label noise. Furthermore, we analyze the transferability of TMLC-Net, showcasing its adaptability to new datasets and noise conditions, and establishing its potential as a broadly applicable solution for robust deep learning in noisy environments.

[AI-5] MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

链接: https://arxiv.org/abs/2502.07709
作者: Loris Gaven,Thomas Carta,Clément Romac,Cédric Colas,Sylvain Lamprier,Olivier Sigaud,Pierre-Yves Oudeyer
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one’s own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and LP online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.

[AI-6] SoK: A Classification for AI-driven Personalized Privacy Assistants

链接: https://arxiv.org/abs/2502.07693
作者: Victor Morel,Leonardo Iwaya,Simone Fischer-Hübner
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:To help users make privacy-related decisions, personalized privacy assistants based on AI technology have been developed in recent years. These AI-driven Personalized Privacy Assistants (AI-driven PPAs) can reap significant benefits for users, who may otherwise struggle to make decisions regarding their personal data in environments saturated with privacy-related decision requests. However, no study systematically inquired about the features of these AI-driven PPAs, their underlying technologies, or the accuracy of their decisions. To fill this gap, we present a Systematization of Knowledge (SoK) to map the existing solutions found in the scientific literature. We screened 1697 unique research papers over the last decade (2013-2023), constructing a classification from 39 included papers. As a result, this SoK reviews several aspects of existing research on AI-driven PPAs in terms of types of publications, contributions, methodological quality, and other quantitative insights. Furthermore, we provide a comprehensive classification for AI-driven PPAs, delving into their architectural choices, system contexts, types of AI used, data sources, types of decisions, and control over decisions, among other facets. Based on our SoK, we further underline the research gaps and challenges and formulate recommendations for the design and development of AI-driven PPAs as well as avenues for future research.

[AI-7] A Unifying Framework for Causal Imitation Learning with Hidden Confounders

链接: https://arxiv.org/abs/2502.07656
作者: Daqian Shao,Thomas Kleine Buening,Marta Kwiatkowska
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a general and unifying framework for causal Imitation Learning (IL) with hidden confounders that subsumes several existing confounded IL settings from the literature. Our framework accounts for two types of hidden confounders: (a) those observed by the expert, which thus influence the expert’s policy, and (b) confounding noise hidden to both the expert and the IL algorithm. For additional flexibility, we also introduce a confounding noise horizon and time-varying expert-observable hidden variables. We show that causal IL in our framework can be reduced to a set of Conditional Moment Restrictions (CMRs) by leveraging trajectory histories as instruments to learn a history-dependent policy. We propose DML-IL, a novel algorithm that uses instrumental variable regression to solve these CMRs and learn a policy. We provide a bound on the imitation gap for DML-IL, which recovers prior results as special cases. Empirical evaluation on a toy environment with continues state-action spaces and multiple Mujoco tasks demonstrate that DML-IL outperforms state-of-the-art causal IL algorithms.

[AI-8] SymGPT : Auditing Smart Contracts via Combining Symbolic Execution with Large Language Models

链接: https://arxiv.org/abs/2502.07644
作者: Shihao Xia,Mengting He,Shuai Shao,Tingting Yu,Yiying Zhang,Linhai Song
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages. arXiv admin note: text overlap with arXiv:2404.04306

点击查看摘要

Abstract:To govern smart contracts running on Ethereum, multiple Ethereum Request for Comment (ERC) standards have been developed, each having a set of rules to guide the behaviors of smart contracts. Violating the ERC rules could cause serious security issues and financial loss, signifying the importance of verifying smart contracts follow ERCs. Today’s practices of such verification are to manually audit each single contract, use expert-developed program-analysis tools, or use large language models (LLMs), all of which are far from effective in identifying ERC rule violations. This paper introduces SymGPT, a tool that combines the natural language understanding of large language models (LLMs) with the formal guarantees of symbolic execution to automatically verify smart contracts’ compliance with ERC rules. To develop SymGPT, we conduct an empirical study of 132 ERC rules from three widely used ERC standards, examining their content, security implications, and natural language descriptions. Based on this study, we design SymGPT by first instructing an LLM to translate ERC rules into a defined EBNF grammar. We then synthesize constraints from the formalized rules to represent scenarios where violations may occur and use symbolic execution to detect them. Our evaluation shows that SymGPT identifies 5,783 ERC rule violations in 4,000 real-world contracts, including 1,375 violations with clear attack paths for stealing financial assets, demonstrating its effectiveness. Furthermore, SymGPT outperforms six automated techniques and a security-expert auditing service, underscoring its superiority over current smart contract analysis methods.

[AI-9] Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

链接: https://arxiv.org/abs/2502.07640
作者: Yong Lin,Shange Tang,Bohan Lyu,Jiayun Wu,Hongzhou Lin,Kaiyu Yang,Jia Li,Mengzhou Xia,Danqi Chen,Sanjeev Arora,Chi Jin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems. The key challenge in this field is the scarcity of formalized math statements and proofs, which we tackle in the following ways. We train statement formalizers to translate the natural language math problems from Numina into formal language (Lean 4), creating a dataset of 1.64 million formal statements. LLMs are used to check that the formal statements accurately preserve the content of the original natural language problems. We then iteratively build a large dataset of formal proofs by training a series of provers. Each prover succeeds in proving many statements that the previous ones could not, and these new proofs are added to the training set for the next prover. The final prover outperforms all existing open-source models in whole-proof generation. On the miniF2F benchmark, it achieves a 57.6% success rate (Pass@32), exceeding the previous best open-source model by 7.6%. On PutnamBench, Goedel-Prover successfully solves 7 problems (Pass@512), ranking first on the leaderboard. Furthermore, it generates 29.7K formal proofs for Lean Workbook problems, nearly doubling the 15.7K produced by earlier works.

[AI-10] Distributed Value Decomposition Networks with Networked Agents AAMAS2025 AAMAS

链接: https://arxiv.org/abs/2502.07635
作者: Guilherme S. Varela,Alberto Sardinha,Francisco S. Melo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 21 pages, 15 figures, to be published in Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Detroit, Michigan, USA, May 19 - 23, 2025, IFAAMAS

点击查看摘要

Abstract:We investigate the problem of distributed training under partial observability, whereby cooperative multi-agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q-function that factorizes into agent-wise Q-functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.

[AI-11] DMWM: Dual-Mind World Model with Long-Term Imagination

链接: https://arxiv.org/abs/2502.07591
作者: Lingyi Wang,Rashed Shelim,Walid Saad,Naren Ramakrishnan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imagination in world models is crucial for enabling agents to learn long-horizon policy in a sample-efficient manner. Existing recurrent state-space model (RSSM)-based world models depend on single-step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long-term imagination tasks due to the accumulation of prediction errors. Inspired by the dual-process theory of human cognition, we propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM-based System 1 (RSSM-S1) component that handles state transitions in an intuitive manner and a logic-integrated neural network-based System 2 (LINN-S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter-system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long-term imagination over the state-of-the-art world models.

[AI-12] LoRP-TTS: Low-Rank Personalized Text-To-Speech

链接: https://arxiv.org/abs/2502.07562
作者: Łukasz Bondaruk,Jakub Kubiak
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to 30pp while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

[AI-13] HGTUL: A Hypergraph-based Model For Trajectory User Linking

链接: https://arxiv.org/abs/2502.07549
作者: Fengjie Chang,Xinning Zhu,Zheng Hu,Yang Qin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Trajectory User Linking (TUL), which links anonymous trajectories with users who generate them, plays a crucial role in modeling human mobility. Despite significant advancements in this field, existing studies primarily neglect the high-order inter-trajectory relationships, which represent complex associations among multiple trajectories, manifested through multi-location co-occurrence patterns emerging when trajectories intersect at various Points of Interest (POIs). Furthermore, they also overlook the variable influence of POIs on different trajectories, as well as the user class imbalance problem caused by disparities in user activity levels and check-in frequencies. To address these limitations, we propose a novel HyperGraph-based multi-perspective Trajectory User Linking model (HGTUL). Our model learns trajectory representations from both relational and spatio-temporal perspectives: (1) it captures high-order associations among trajectories by constructing a trajectory hypergraph and leverages a hypergraph attention network to learn the variable impact of POIs on trajectories; (2) it models the spatio-temporal characteristics of trajectories by incorporating their temporal and spatial information into a sequential encoder. Moreover, we design a data balancing method to effectively address the user class imbalance problem and experimentally validate its significance in TUL. Extensive experiments on three real-world datasets demonstrate that HGTUL outperforms state-of-the-art baselines, achieving improvements of 2.57%~20.09% and 5.68%~26.00% in ACC@1 and Macro-F1 metrics, respectively.

[AI-14] NatureLM: Deciphering the Language of Nature for Scientific Discovery

链接: https://arxiv.org/abs/2502.07527
作者: Yingce Xia,Peiran Jin,Shufang Xie,Liang He,Chuan Cao,Renqian Luo,Guoqing Liu,Yue Wang,Zequn Liu,Yuan-Jyue Chen,Zekun Guo,Yeqi Bai,Pan Deng,Yaosen Min,Ziheng Lu,Hongxia Hao,Han Yang,Jielan Li,Chang Liu,Jia Zhang,Jianwei Zhu,Kehan Wu,Wei Zhang,Kaiyuan Gao,Qizhi Pei,Qian Wang,Xixian Liu,Yanting Li,Houtian Zhu,Yeqing Lu,Mingqian Ma,Zun Wang,Tian Xie,Krzysztof Maziarz,Marwin Segler,Zhao Yang,Zilong Chen,Yu Shi,Shuxin Zheng,Lijun Wu,Chen Hu,Peggy Dai,Tie-Yan Liu,Haiguang Liu,Tao Qin
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 81 pages

点击查看摘要

Abstract:Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the “language of nature”, we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.

[AI-15] Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization

链接: https://arxiv.org/abs/2502.07523
作者: Daniel Palenicek,Florian Vogt,Jan Peters
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ’s scaling behavior with higher UTD ratios. We identify challenges in the training dynamics, which are emphasized by higher UTD ratios. To address these, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, has been shown to prevent potential loss of plasticity and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks on the DeepMind Control Suite and Myosuite benchmarks, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a simple yet robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.

[AI-16] Harnessing Languages Fractal Geometry with Recursive Inference Scaling

链接: https://arxiv.org/abs/2502.07503
作者: Ibrahim Alabdulmohsin,Xiaohua Zhai
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Recent research in language modeling reveals two scaling effects: the well-known improvement from increased training compute, and a lesser-known boost from applying more sophisticated or computationally intensive inference methods. Inspired by recent findings on the fractal geometry of language, we introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time. For a given fixed model architecture and training compute budget, RINS substantially improves language modeling performance. It also generalizes beyond pure language tasks, delivering gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16. Additionally, by deriving data scaling laws, we show that RINS improves both the asymptotic performance limits and the scaling exponents. These advantages are maintained even when compared to state-of-the-art recursive techniques like the “repeat-all-over” (RAO) strategy in Mobile LLM. Finally, stochastic RINS not only can enhance performance further but also provides the flexibility to optionally forgo increased inference computation at test time with minimal performance degradation.

[AI-17] URECA: The Chain of Two Minimum Set Cover Problems exists behind Adaptation to Shifts in Semantic Code Search

链接: https://arxiv.org/abs/2502.07494
作者: Seok-Ung Choi,Joonghyuk Hahn,Yo-Sub Han
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Adaptation is to make model learn the patterns shifted from the training distribution. In general, this adaptation is formulated as the minimum entropy problem. However, the minimum entropy problem has inherent limitation – shifted initialization cascade phenomenon. We extend the relationship between the minimum entropy problem and the minimum set cover problem via Lebesgue integral. This extension reveals that internal mechanism of the minimum entropy problem ignores the relationship between disentangled representations, which leads to shifted initialization cascade. From the analysis, we introduce a new clustering algorithm, Union-find based Recursive Clustering Algorithm~(URECA). URECA is an efficient clustering algorithm for the leverage of the relationships between disentangled representations. The update rule of URECA depends on Thresholdly-Updatable Stationary Assumption to dynamics as a released version of Stationary Assumption. This assumption helps URECA to transport disentangled representations with no errors based on the relationships between disentangled representations. URECA also utilize simulation trick to efficiently cluster disentangled representations. The wide range of evaluations show that URECA achieves consistent performance gains for the few-shot adaptation to diverse types of shifts along with advancement to State-of-The-Art performance in CoSQA in the scenario of query shift.

[AI-18] WebChecker: A Versatile EVL Plugin for Validating HTML Pages with Bootstrap Frameworks

链接: https://arxiv.org/abs/2502.07479
作者: Milind Cherukuri
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:WebChecker is a plugin for Epsilon Validation Language (EVL), designed to validate both static and dynamic HTML pages utilizing frameworks like Bootstrap. By employing configurable EVL constraints, WebChecker enforces implicit rules governing HTML and CSS frameworks. The effectiveness of the plugin is demonstrated through its application on Bootstrap, the widely adopted HTML, CSS, and JavaScript framework. WebChecker comes with a set of EVL constraints to assess Bootstrap based web pages. To substantiate our claims, I present an illustrative example featuring two solutions that effectively enforce implicit rules.

[AI-19] Crime Forecasting: A Spatio-temporal Analysis with Deep Learning Models

链接: https://arxiv.org/abs/2502.07465
作者: Li Mao,Wei Du,Shuo Wen,Qi Li,Tong Zhang,Wei Zhong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages,6 figures

点击查看摘要

Abstract:This study uses deep-learning models to predict city partition crime counts on specific days. It helps police enhance surveillance, gather intelligence, and proactively prevent crimes. We formulate crime count prediction as a spatiotemporal sequence challenge, where both input data and prediction targets are spatiotemporal sequences. In order to improve the accuracy of crime forecasting, we introduce a new model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. We conducted a comparative analysis to access the effects of various data sequences, including raw and binned data, on the prediction errors of four deep learning forecasting models. Directly inputting raw crime data into the forecasting model causes high prediction errors, making the model unsuitable for real - world use. The findings indicate that the proposed CNN-LSTM model achieves optimal performance when crime data is categorized into 10 or 5 groups. Data binning can enhance forecasting model performance, but poorly defined intervals may reduce map granularity. Compared to dividing into 5 bins, binning into 10 intervals strikes an optimal balance, preserving data characteristics and surpassing raw data in predictive modelling efficacy.

[AI-20] JamendoMaxCaps: A Large Scale Music-caption Dataset with Imputed Metadata

链接: https://arxiv.org/abs/2502.07461
作者: Abhinaba Roy,Renhang Liu,Tongyu Lu,Dorien Herremans
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:We introduce JamendoMaxCaps, a large-scale music-caption dataset featuring over 200,000 freely licensed instrumental tracks from the renowned Jamendo platform. The dataset includes captions generated by a state-of-the-art captioning model, enhanced with imputed metadata. We also introduce a retrieval system that leverages both musical features and metadata to identify similar songs, which are then used to fill in missing metadata using a local large language model (LLLM). This approach allows us to provide a more comprehensive and informative dataset for researchers working on music-language understanding tasks. We validate this approach quantitatively with five different measurements. By making the JamendoMaxCaps dataset publicly available, we provide a high-quality resource to advance research in music-language understanding tasks such as music retrieval, multimodal representation learning, and generative music models.

[AI-21] Eliciting Rational Initial Weights in Gradual Argumentation

链接: https://arxiv.org/abs/2502.07452
作者: Nir Oren,Bruno Yun
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many semantics for weighted argumentation frameworks assume that each argument is associated with an initial weight. However, eliciting these initial weights poses challenges: (1) accurately providing a specific numerical value is often difficult, and (2) individuals frequently confuse initial weights with acceptability degrees in the presence of other arguments. To address these issues, we propose an elicitation pipeline that allows one to specify acceptability degree intervals for each argument. By employing gradual semantics, we can refine these intervals when they are rational, restore rationality when they are not, and ultimately identify possible initial weights for each argument.

[AI-22] Approximating Human Strategic Reasoning with LLM Reason ing with LLM-Enhanced Recursive Reasoners Leverag ing Multi-agent Hypergames

链接: https://arxiv.org/abs/2502.07443
作者: Vince Trencsenyi,Agnieszka Mensfelt,Kostas Stathis
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:LLM-driven multi-agent-based simulations have been gaining traction with applications in game-theoretic and social simulations. While most implementations seek to exploit or evaluate LLM-agentic reasoning, they often do so with a weak notion of agency and simplified architectures. We implement a role-based multi-agent strategic interaction framework tailored to sophisticated recursive reasoners, providing the means for systematic in-depth development and evaluation of strategic reasoning. Our game environment is governed by the umpire responsible for facilitating games, from matchmaking through move validation to environment management. Players incorporate state-of-the-art LLMs in their decision mechanism, relying on a formal hypergame-based model of hierarchical beliefs. We use one-shot, 2-player beauty contests to evaluate the recursive reasoning capabilities of the latest LLMs, providing a comparison to an established baseline model from economics and data from human experiments. Furthermore, we introduce the foundations of an alternative semantic measure of reasoning to the k-level theory. Our experiments show that artificial reasoners can outperform the baseline model in terms of both approximating human behaviour and reaching the optimal solution.

[AI-23] SensPS: Sensing Personal Space Comfortable Distance between Human-Human Using Multimodal Sensors

链接: https://arxiv.org/abs/2502.07441
作者: Ko Watanabe,Nico Förster,Shoya Ishimaru
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personal space, also known as peripersonal space, is crucial in human social interaction, influencing comfort, communication, and social stress. Estimating and respecting personal space is essential for enhancing human-computer interaction (HCI) and smart environments. Personal space preferences vary due to individual traits, cultural background, and contextual factors. Advanced multimodal sensing technologies, including eye-tracking and wristband sensors, offer opportunities to develop adaptive systems that dynamically adjust to user comfort levels. Integrating physiological and behavioral data enables a deeper understanding of spatial interactions. This study develops a sensor-based model to estimate comfortable personal space and identifies key features influencing spatial preferences. Our findings show that multimodal sensors, particularly eye-tracking and physiological wristband data, can effectively predict personal space preferences, with eye-tracking data playing a more significant role. An experimental study involving controlled human interactions demonstrates that a Transformer-based model achieves the highest predictive accuracy (F1 score: 0.87) for estimating personal space. Eye-tracking features, such as gaze point and pupil diameter, emerge as the most significant predictors, while physiological signals from wristband sensors contribute marginally. These results highlight the potential for AI-driven personalization of social space in adaptive environments, suggesting that multimodal sensing can be leveraged to develop intelligent systems that optimize spatial arrangements in workplaces, educational institutions, and public settings. Future work should explore larger datasets, real-world applications, and additional physiological markers to enhance model robustness.

[AI-24] owards a Formal Theory of the Need for Competence via Computational Intrinsic Motivation

链接: https://arxiv.org/abs/2502.07423
作者: Erik M. Lintunen,Nadia M. Ady,Sebastian Deterding,Christian Guckelsberger
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages excluding references

点击查看摘要

Abstract:Computational models offer powerful tools for formalising psychological theories, making them both testable and applicable in digital contexts. However, they remain little used in the study of motivation within psychology. We focus on the “need for competence”, postulated as a key basic human need within Self-Determination Theory (SDT) – arguably the most influential psychological framework for studying intrinsic motivation (IM). The need for competence is treated as a single construct across SDT texts. Yet, recent research has identified multiple, ambiguously defined facets of competence in SDT. We propose that these inconsistencies may be alleviated by drawing on computational models from the field of artificial intelligence, specifically from the domain of reinforcement learning (RL). By aligning the aforementioned facets of competence – effectance, skill use, task performance, and capacity growth – with existing RL formalisms, we provide a foundation for advancing competence-related theory in SDT and motivational psychology more broadly. The formalisms reveal underlying preconditions that SDT fails to make explicit, demonstrating how computational models can improve our understanding of IM. Additionally, our work can support a cycle of theory development by inspiring new computational models formalising aspects of the theory, which can then be tested empirically to refine the theory. While our research lays a promising foundation, empirical studies of these models in both humans and machines are needed, inviting collaboration across disciplines.

[AI-25] Enhancing Higher Education with Generative AI: A Multimodal Approach for Personalised Learning

链接: https://arxiv.org/abs/2502.07401
作者: Johnny Chan,Yuming Li
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, accepted and presented in the 2025 6th International Conference on Advances in Education and Information Technology (AEIT)

点击查看摘要

Abstract:This research explores the opportunities of Generative AI (GenAI) in the realm of higher education through the design and development of a multimodal chatbot for an undergraduate course. Leveraging the ChatGPT API for nuanced text-based interactions and Google Bard for advanced image analysis and diagram-to-code conversions, we showcase the potential of GenAI in addressing a broad spectrum of educational queries. Additionally, the chatbot presents a file-based analyser designed for educators, offering deep insights into student feedback via sentiment and emotion analysis, and summarising course evaluations with key metrics. These combinations highlight the crucial role of multimodal conversational AI in enhancing teaching and learning processes, promising significant advancements in educational adaptability, engagement, and feedback analysis. By demonstrating a practical web application, this research underlines the imperative for integrating GenAI technologies to foster more dynamic and responsive educational environments, ultimately contributing to improved educational outcomes and pedagogical strategies.

[AI-26] On Iterative Evaluation and Enhancement of Code Quality Using GPT -4o

链接: https://arxiv.org/abs/2502.07399
作者: Rundong Liu,Andre Frade,Amal Vaidya,Maxime Labonne,Marcus Kaiser,Bismayan Chakrabarti,Jonathan Budd,Sean Moran
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator’s feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework’s evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: this https URL.

[AI-27] LLM s Can Easily Learn to Reason from Demonstrations Structure not content is what matters!

链接: https://arxiv.org/abs/2502.07374
作者: Dacheng Li,Shiyi Cao,Tyler Griggs,Shu Liu,Xiangxi Mo,Shishir G. Patil,Matei Zaharia,Joseph E. Gonzalez,Ion Stoica
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model’s score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at this https URL.

[AI-28] KABB: Knowledge-Aware Bayesian Bandits for Dynamic Expert Coordination in Multi-Agent Systems

链接: https://arxiv.org/abs/2502.07350
作者: Jusheng Zhang,Zimeng Huang,Yijia Fan,Ningyuan Liu,Mingyan Li,Zhuojie Yang,Jiawei Yao,Jian Wang,Keze Wang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As scaling large language models faces prohibitive costs, multi-agent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduces Knowledge-Aware Bayesian Bandits (KABB), a novel framework that enhances multi-agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a three-dimensional knowledge distance model for deep semantic understanding, a dual-adaptation mechanism for continuous expert optimization, and a knowledge-aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low in multi-agent coordination.

[AI-29] Coarse Set Theory: A Mathematical Foundation for Coarse Ethics

链接: https://arxiv.org/abs/2502.07347
作者: Takashi Izumo
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Logic (math.LO); Probability (math.PR)
*备注: 31 pages, 2 figures

点击查看摘要

Abstract:In ethical decision-making, individuals are often evaluated based on generalized assessments rather than precise individual performance. This concept, known as Coarse Ethics (CE), has primarily been discussed in natural language without a formal mathematical foundation. This paper introduces Coarse Set Theory (CST) to establish a mathematical framework for CE. We define coarse sets using totally ordered sets and propose axioms that characterize the hierarchical relationships between elements and their groupings. Additionally, we introduce coarse-grained sets, which partition an underlying set into equivalence classes based on predefined criteria. We extend this framework by defining coarse mappings, which transform detailed individual data into coarser representations while maintaining essential structural properties. To measure the information loss, we employ Kullback-Leibler (KL) divergence, demonstrating how different coarse partitions affect the preservation of information. We illustrate how CST can be applied to real-world grading systems through theoretical formulations and empirical analysis. This study provides a rigorous foundation for CE, enabling a more systematic exploration of fairness, interpretability, and decision-making trade-offs.

[AI-30] Integrating Physics and Data-Driven Approaches: An Explainable and Uncertainty-Aware Hybrid Model for Wind Turbine Power Prediction

链接: https://arxiv.org/abs/2502.07344
作者: Alfonso Gijón,Simone Eiraudo,Antonio Manjavacas,Daniele Salvatore Schiera,Miguel Molina-Solana,Juan Gómez-Romero
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:The rapid growth of the wind energy sector underscores the urgent need to optimize turbine operations and ensure effective maintenance through early fault detection systems. While traditional empirical and physics-based models offer approximate predictions of power generation based on wind speed, they often fail to capture the complex, non-linear relationships between other input variables and the resulting power output. Data-driven machine learning methods present a promising avenue for improving wind turbine modeling by leveraging large datasets, enhancing prediction accuracy but often at the cost of interpretability. In this study, we propose a hybrid semi-parametric model that combines the strengths of both approaches, applied to a dataset from a wind farm with four turbines. The model integrates a physics-inspired submodel, providing a reasonable approximation of power generation, with a non-parametric submodel that predicts the residuals. This non-parametric submodel is trained on a broader range of variables to account for phenomena not captured by the physics-based component. The hybrid model achieves a 37% improvement in prediction accuracy over the physics-based model. To enhance interpretability, SHAP values are used to analyze the influence of input features on the residual submodel’s output. Additionally, prediction uncertainties are quantified using a conformalized quantile regression method. The combination of these techniques, alongside the physics grounding of the parametric submodel, provides a flexible, accurate, and reliable framework. Ultimately, this study opens the door for evaluating the impact of unmodeled variables on wind turbine power generation, offering a basis for potential optimization.

[AI-31] OpenGrok: Enhancing SNS Data Processing with Distilled Knowledge and Mask-like Mechanisms

链接: https://arxiv.org/abs/2502.07312
作者: Lumen AI,Zaozhuang No.28 Middle School,Shihao Ji,Zihui Song,Fucheng Zhong,Jisen Jia,Zhaobo Wu,Zheyi Cao,Tianhao Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:This report details Lumen Labs’ novel approach to processing Social Networking Service (SNS) data. We leverage knowledge distillation, specifically a simple distillation method inspired by DeepSeek-R1’s CoT acquisition, combined with prompt hacking, to extract valuable training data from the Grok model. This data is then used to fine-tune a Phi-3-mini model, augmented with a mask-like mechanism specifically designed for handling the nuances of SNS data. Our method demonstrates state-of-the-art (SOTA) performance on several SNS data processing tasks, outperforming existing models like Grok, Phi-3, and GPT-4. We provide a comprehensive analysis of our approach, including mathematical formulations, engineering details, ablation studies, and comparative evaluations.

[AI-32] MIGT: Memory Instance Gated Transformer Framework for Financial Portfolio Management

链接: https://arxiv.org/abs/2502.07280
作者: Fengchen Gu,Angelos Stefanidis,Ángel García-Fernández,Jionglong Su,Huakang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has been applied in financial portfolio management to improve returns in changing market conditions. However, unlike most fields where DRL is widely used, the stock market is more volatile and dynamic as it is affected by several factors such as global events and investor sentiment. Therefore, it remains a challenge to construct a DRL-based portfolio management framework with strong return capability, stable training, and generalization ability. This study introduces a new framework utilizing the Memory Instance Gated Transformer (MIGT) for effective portfolio management. By incorporating a novel Gated Instance Attention module, which combines a transformer variant, instance normalization, and a Lite Gate Unit, our approach aims to maximize investment returns while ensuring the learning process’s stability and reducing outlier impacts. Tested on the Dow Jones Industrial Average 30, our framework’s performance is evaluated against fifteen other strategies using key financial metrics like the cumulative return and risk-return ratios (Sharpe, Sortino, and Omega ratios). The results highlight MIGT’s advantage, showcasing at least a 9.75% improvement in cumulative returns and a minimum 2.36% increase in risk-return ratios over competing strategies, marking a significant advancement in DRL for portfolio management.

[AI-33] Exploratory Diffusion Policy for Unsupervised Reinforcement Learning

链接: https://arxiv.org/abs/2502.07279
作者: Chengyang Ying,Huayu Chen,Xinning Zhou,Zhongkai Hao,Hang Su,Jun Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised reinforcement learning (RL) aims to pre-train agents by exploring states or skills in reward-free environments, facilitating the adaptation to downstream tasks. However, existing methods often overlook the fitting ability of pre-trained policies and struggle to handle the heterogeneous pre-training data, which are crucial for achieving efficient exploration and fast fine-tuning. To address this gap, we propose Exploratory Diffusion Policy (EDP), which leverages the strong expressive ability of diffusion models to fit the explored data, both boosting exploration and obtaining an efficient initialization for downstream tasks. Specifically, we estimate the distribution of collected data in the replay buffer with the diffusion policy and propose a score intrinsic reward, encouraging the agent to explore unseen states. For fine-tuning the pre-trained diffusion policy on downstream tasks, we provide both theoretical analyses and practical algorithms, including an alternating method of Q function optimization and diffusion policy distillation. Extensive experiments demonstrate the effectiveness of EDP in efficient exploration during pre-training and fast adaptation during fine-tuning.

[AI-34] Cost-Efficient Continual Learning with Sufficient Exemplar Memory

链接: https://arxiv.org/abs/2502.07274
作者: Dongkyu Cho,Taesup Moon,Rumi Chunara,Kyunghyun Cho,Sungmin Cha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Continual learning (CL) research typically assumes highly constrained exemplar memory resources. However, in many real-world scenarios-especially in the era of large foundation models-memory is abundant, while GPU computational costs are the primary bottleneck. In this work, we investigate CL in a novel setting where exemplar memory is ample (i.e., sufficient exemplar memory). Unlike prior methods designed for strict exemplar memory constraints, we propose a simple yet effective approach that directly operates in the model’s weight space through a combination of weight resetting and averaging techniques. Our method achieves state-of-the-art performance while reducing the computational cost to a quarter or third of existing methods. These findings challenge conventional CL assumptions and provide a practical baseline for computationally efficient CL applications.

[AI-35] Variational Learning Induces Adaptive Label Smoothing

链接: https://arxiv.org/abs/2502.07273
作者: Sin-Han Yang,Zhedong Liu,Gian Maria Marconi,Mohammad Emtiyaz Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We show that variational learning naturally induces an adaptive label smoothing where label noise is specialized for each example. Such label-smoothing is useful to handle examples with labeling errors and distribution shifts, but designing a good adaptivity strategy is not always easy. We propose to skip this step and simply use the natural adaptivity induced during the optimization of a variational objective. We show empirical results where a variational algorithm called IVON outperforms traditional label smoothing and yields adaptivity strategies similar to those of an existing approach. By connecting Bayesian methods to label smoothing, our work provides a new way to handle overconfident predictions.

[AI-36] Fairness in Multi-Agent AI: A Unified Framework for Ethical and Equitable Autonomous Systems

链接: https://arxiv.org/abs/2502.07254
作者: Rajesh Ranjan,Shailja Gupta,Surya Narayan Singh
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Ensuring fairness in decentralized multi-agent systems presents significant challenges due to emergent biases, systemic inefficiencies, and conflicting agent incentives. This paper provides a comprehensive survey of fairness in multi-agent AI, introducing a novel framework where fairness is treated as a dynamic, emergent property of agent interactions. The framework integrates fairness constraints, bias mitigation strategies, and incentive mechanisms to align autonomous agent behaviors with societal values while balancing efficiency and robustness. Through empirical validation, we demonstrate that incorporating fairness constraints results in more equitable decision-making. This work bridges the gap between AI ethics and system design, offering a foundation for accountable, transparent, and socially responsible multi-agent AI systems.

[AI-37] NARCE: A Mamba-Based Neural Algorithmic Reason er Framework for Online Complex Event Detection

链接: https://arxiv.org/abs/2502.07250
作者: Liying Han,Gaofeng Dong,Xiaomin Ouyang,Lance Kaplan,Federico Cerutti,Mani Srivastava
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current machine learning models excel in short-span perception tasks but struggle to derive high-level insights from long-term observation, a capability central to understanding complex events (CEs). CEs, defined as sequences of short-term atomic events (AEs) governed by spatiotemporal rules, are challenging to detect online due to the need to extract meaningful patterns from long and noisy sensor data while ignoring irrelevant events. We hypothesize that state-based methods are well-suited for CE detection, as they capture event progression through state transitions without requiring long-term memory. Baseline experiments validate this, demonstrating that the state-space model Mamba outperforms existing architectures. However, Mamba’s reliance on extensive labeled data, which are difficult to obtain, motivates our second hypothesis: decoupling CE rule learning from noisy sensor data can reduce data requirements. To address this, we propose NARCE, a framework that combines Neural Algorithmic Reasoning (NAR) to split the task into two components: (i) learning CE rules independently of sensor data using synthetic concept traces generated by LLMs and (ii) mapping sensor inputs to these rules via an adapter. Our results show that NARCE outperforms baselines in accuracy, generalization to unseen and longer sensor data, and data efficiency, significantly reducing annotation costs while advancing robust CE detection.

[AI-38] Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

链接: https://arxiv.org/abs/2502.07244
作者: Jiecheng Lu,Shihao Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

[AI-39] Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement ICLR2025

链接: https://arxiv.org/abs/2502.07243
作者: Xueyao Zhang,Xiaohui Zhang,Kainan Peng,Zhenyu Tang,Vimal Manohar,Yingru Liu,Jeff Hwang,Dangna Li,Yuhao Wang,Julian Chan,Yuan Huang,Zhizheng Wu,Mingbo Ma
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech’s content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo’s effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at this https URL.

[AI-40] LUNAR: LLM Unlearning via Neural Activation Redirection

链接: https://arxiv.org/abs/2502.07218
作者: William F. Shen,Xinchi Qiu,Meghdad Kurmanji,Alex Iacob,Lorenzo Sani,Yihong Chen,Nicola Cancedda,Nicholas D. Lane
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operates by redirecting the representations of unlearned data to regions that trigger the model’s inherent ability to express its inability to answer. LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference. Specifically, LUNAR achieves between 2.9x to 11.7x improvements on combined “unlearning efficacy” and “model utility” score (“Deviation Score”) on the PISTOL dataset across various base models. We also demonstrate, through quantitative analysis and qualitative examples, LUNAR’s superior controllability in generating coherent and contextually aware responses, mitigating undesired side effects of existing methods. Moreover, we demonstrate that LUNAR is robust against white-box adversarial attacks and versatile in handling real-world scenarios, such as processing sequential unlearning requests.

[AI-41] Pareto Optimal Algorithmic Recourse in Multi-cost Function

链接: https://arxiv.org/abs/2502.07214
作者: Wen-Ling Chen,Hong-Chang Huang,Kai-Hung Lin,Shang-Wei Hwang,Hao-Tsung Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:In decision-making systems, algorithmic recourse aims to identify minimal-cost actions to alter an individual features, thereby obtaining a desired outcome. This empowers individuals to understand, question, or alter decisions that negatively affect them. However, due to the variety and sensitivity of system environments and individual personalities, quantifying the cost of a single function is nearly impossible while considering multiple criteria situations. Most current recourse mechanisms use gradient-based methods that assume cost functions are differentiable, often not applicable in real-world scenarios, resulting in sub-optimal solutions that compromise various criteria. These solutions are typically intractable and lack rigorous theoretical foundations, raising concerns regarding interpretability, reliability, and transparency from the explainable AI (XAI) perspective. To address these issues, this work proposes an algorithmic recourse framework that handles non-differentiable and discrete multi-cost functions. By formulating recourse as a multi-objective optimization problem and assigning weights to different criteria based on their importance, our method identifies Pareto optimal recourse recommendations. To demonstrate scalability, we incorporate the concept of epsilon-net, proving the ability to find approximated Pareto optimal actions. Experiments show the trade-off between different criteria and the methods scalability in large graphs. Compared to current heuristic practices, our approach provides a stronger theoretical foundation and better aligns recourse suggestions with real-world requirements. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2502.07214 [cs.LG] (or arXiv:2502.07214v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07214 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-42] Evaluation for Regression Analyses on Evolving Data Streams

链接: https://arxiv.org/abs/2502.07213
作者: Yibin Sun,Heitor Murilo Gomes,Bernhard Pfahringer,Albert Bifet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 Pages, 9 figures

点击查看摘要

Abstract:The paper explores the challenges of regression analysis in evolving data streams, an area that remains relatively underexplored compared to classification. We propose a standardized evaluation process for regression and prediction interval tasks in streaming contexts. Additionally, we introduce an innovative drift simulation strategy capable of synthesizing various drift types, including the less-studied incremental drift. Comprehensive experiments with state-of-the-art methods, conducted under the proposed process, validate the effectiveness and robustness of our approach.

[AI-43] A Study on the Importance of Features in Detecting Advanced Persistent Threats Using Machine Learning

链接: https://arxiv.org/abs/2502.07207
作者: Ehsan Hallaji,Roozbeh Razavi-Far,Mehrdad Saif
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication in the 2024 International Conference on Computational Science and Computational Intelligence (CSCI’24)

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) pose a significant security risk to organizations and industries. These attacks often lead to severe data breaches and compromise the system for a long time. Mitigating these sophisticated attacks is highly challenging due to the stealthy and persistent nature of APTs. Machine learning models are often employed to tackle this challenge by bringing automation and scalability to APT detection. Nevertheless, these intelligent methods are data-driven, and thus, highly affected by the quality and relevance of input data. This paper aims to analyze measurements considered when recording network traffic and conclude which features contribute more to detecting APT samples. To do this, we study the features associated with various APT cases and determine their importance using a machine learning framework. To ensure the generalization of our findings, several feature selection techniques are employed and paired with different classifiers to evaluate their effectiveness. Our findings provide insights into how APT detection can be enhanced in real-world scenarios.

[AI-44] Monte Carlo Tree Diffusion for System 2 Planning

链接: https://arxiv.org/abs/2502.07202
作者: Jaesik Yoon,Hyeonseo Cho,Doojin Baek,Yoshua Bengio,Sungjin Ahn
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)-whose performance naturally improves with additional test-time computation (TTC), standard diffusion-based planners offer only limited avenues for TTC scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree-structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long-horizon tasks show that MCTD outperforms diffusion baselines, yielding higher-quality solutions as TTC increases.

[AI-45] Bag of Tricks for Inference-time Computation of LLM Reasoning

链接: https://arxiv.org/abs/2502.07191
作者: Fan Liu,Wenshuo Chao,Naiqiang Tan,Hao Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the advancement of large language models (LLMs), solving complex reasoning tasks has gained increasing attention. Inference-time computation methods (e.g., Best-of-N, beam search, et al.) are particularly valuable as they can enhance reasoning performance without modifying model parameters or requiring additional training. However, these techniques come with implementation challenges, and most existing methods remain at the proof-of-concept stage with limited practical adoption due to their computational complexity and varying effectiveness across different tasks. In this paper, we investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity. Since most current methods rely on a proposer-verifier pipeline that first generates candidate solutions (e.g., reasoning solutions) and then selects the best one based on reward signals (e.g., RLHF rewards, process rewards), our research focuses on optimizing both candidate solution generation (e.g., instructing prompts, hyperparameters such as temperature and top-p) and reward mechanisms (e.g., self-evaluation, reward types). Through extensive experiments (more than 20,000 A100-80G GPU hours with over 1,000 experiments) across a variety of models (e.g., Llama, Qwen, and Mistral families) of various sizes, our ablation studies reveal that previously overlooked strategies can significantly enhance performance (e.g., tuning temperature can improve reasoning task performance by up to 5%). Furthermore, we establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks. These findings provide a stronger foundation for future research. The code is available at this https URL

[AI-46] Understanding LLM s Fluid Intelligence Deficiency: An Analysis of the ARC Task NAACL2025

链接: https://arxiv.org/abs/2502.07190
作者: Junjie Wu,Mo Yu,Lemao Liu,Dit-Yan Yeung,Jie Zhou
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 9 figures, accepted by NAACL 2025 main conference

点击查看摘要

Abstract:While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs’ parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs’ abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding. Our data and code can be found in this https URL.

[AI-47] Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer

链接: https://arxiv.org/abs/2502.07158
作者: Jiaying Lu,Stephanie R. Brown,Songyuan Liu,Shifan Zhao,Kejun Dong,Del Bold,Michael Fundora,Alaa Aljiffry,Alex Fedorov,Jocelyn Grunwell,Xiao Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early prediction of pediatric cardiac arrest (CA) is critical for timely intervention in high-risk intensive care settings. We introduce PedCA-FT, a novel transformer-based framework that fuses tabular view of EHR with the derived textual view of EHR to fully unleash the interactions of high-dimensional risk factors and their dynamics. By employing dedicated transformer modules for each modality view, PedCA-FT captures complex temporal and contextual patterns to produce robust CA risk estimates. Evaluated on a curated pediatric cohort from the CHOA-CICU database, our approach outperforms ten other artificial intelligence models across five key performance metrics and identifies clinically meaningful risk factors. These findings underscore the potential of multimodal fusion techniques to enhance early CA detection and improve patient care.

[AI-48] Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

链接: https://arxiv.org/abs/2502.07154
作者: Feng Chen,Allan Raventos,Nan Cheng,Surya Ganguli,Shaul Druckmann
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in N independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be \it misaligned with pass@N in that pass@N accuracy \it decreases with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.

[AI-49] Feature Importance Depends on Properties of the Data: Towards Choosing the Correct Explanations for Your Data and Decision Trees based Models

链接: https://arxiv.org/abs/2502.07153
作者: Célia Wafa Ayad,Thomas Bonnier,Benjamin Bosch,Sonali Parbhoo,Jesse Read
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In order to ensure the reliability of the explanations of machine learning models, it is crucial to establish their advantages and limits and in which case each of these methods outperform. However, the current understanding of when and how each method of explanation can be used is insufficient. To fill this gap, we perform a comprehensive empirical evaluation by synthesizing multiple datasets with the desired properties. Our main objective is to assess the quality of feature importance estimates provided by local explanation methods, which are used to explain predictions made by decision tree-based models. By analyzing the results obtained from synthetic datasets as well as publicly available binary classification datasets, we observe notable disparities in the magnitude and sign of the feature importance estimates generated by these methods. Moreover, we find that these estimates are sensitive to specific properties present in the data. Although some model hyper-parameters do not significantly influence feature importance assignment, it is important to recognize that each method of explanation has limitations in specific contexts. Our assessment highlights these limitations and provides valuable insight into the suitability and reliability of different explanatory methods in various scenarios.

[AI-50] Interactive Data Harmonization with LLM Agents

链接: https://arxiv.org/abs/2502.07132
作者: Aécio Santos,Eduardo H. M. Pena,Roque Lopez,Juliana Freire
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.

[AI-51] Online Scheduling for LLM Inference with KV Cache Constraints

链接: https://arxiv.org/abs/2502.07115
作者: Patrick Jaillet,Jiashuo Jiang,Chara Podimata,Zijie Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose novel batching and scheduling algorithms that minimize inference latency while effectively managing the KV cache’s memory. We analyze both semi-online and fully online scheduling models, and our results are threefold. First, we provide a polynomial-time algorithm that achieves exact optimality in terms of average latency in the semi-online prompt arrival model. Second, in the fully online case with a stochastic prompt arrival, we introduce an efficient online scheduling algorithm with constant regret. Third, we prove that no algorithm (deterministic or randomized) can achieve a constant competitive ratio in fully online adversarial settings. Our empirical evaluations on a public LLM inference dataset, using the Llama-70B model on A100 GPUs, show that our approach significantly outperforms benchmark algorithms used currently in practice, achieving lower latency while reducing energy consumption. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2502.07115 [cs.LG] (or arXiv:2502.07115v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07115 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-52] Contextual Thompson Sampling via Generation of Missing Data

链接: https://arxiv.org/abs/2502.07064
作者: Kelly W. Zhang,Tiffany Tianhui Cai,Hongseok Namkoong,Daniel Russo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a framework for Thompson sampling contextual bandit algorithms, in which the algorithm’s ability to quantify uncertainty and make decisions depends on the quality of a generative model that is learned offline. Instead of viewing uncertainty in the environment as arising from unobservable latent parameters, our algorithm treats uncertainty as stemming from missing, but potentially observable, future outcomes. If these future outcomes were all observed, one could simply make decisions using an “oracle” policy fit on the complete dataset. Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to probabilistically impute missing future outcomes, fits a policy using the imputed complete dataset, and uses that policy to select the next action. We formally show that this algorithm is a generative formulation of Thompson Sampling and prove a state-of-the-art regret bound for it. Notably, our regret bound i) depends on the probabilistic generative model only through the quality of its offline prediction loss, and ii) applies to any method of fitting the “oracle” policy, which easily allows one to adapt Thompson sampling to decision-making settings with fairness and/or resource constraints.

[AI-53] Federated Continual Learning: Concepts Challenges and Solutions

链接: https://arxiv.org/abs/2502.07059
作者: Parisa Hamedi,Roozbeh Razavi-Far,Ehsan Hallaji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Continual Learning (FCL) has emerged as a robust solution for collaborative model training in dynamic environments, where data samples are continuously generated and distributed across multiple devices. This survey provides a comprehensive review of FCL, focusing on key challenges such as heterogeneity, model stability, communication overhead, and privacy preservation. We explore various forms of heterogeneity and their impact on model performance. Solutions to non-IID data, resource-constrained platforms, and personalized learning are reviewed in an effort to show the complexities of handling heterogeneous data distributions. Next, we review techniques for ensuring model stability and avoiding catastrophic forgetting, which are critical in non-stationary environments. Privacy-preserving techniques are another aspect of FCL that have been reviewed in this work. This survey has integrated insights from federated learning and continual learning to present strategies for improving the efficacy and scalability of FCL systems, making it applicable to a wide range of real-world scenarios.

[AI-54] Autonomous Deep Agent

链接: https://arxiv.org/abs/2502.07056
作者: Amy Yu,Erik Lebedev,Lincoln Everett,Xiaoxin Chen,Terry Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This technical brief introduces Deep Agent, an advanced autonomous AI system designed to manage complex multi-phase tasks through a novel hierarchical task management architecture. The system’s foundation is built on our Hierarchical Task DAG (HTDAG) framework, which dynamically decomposes high-level objectives into manageable sub-tasks while rigorously maintaining dependencies and execution coherence. Deep Agent advances beyond traditional agent systems through three key innovations: First, it implements a recursive two-stage planner-executor architecture that enables continuous task refinement and adaptation as circumstances change. Second, it features an Autonomous API Tool Creation (AATC) system that automatically generates reusable components from UI interactions, substantially reducing operational costs for similar tasks. Third, it incorporates Prompt Tweaking Engine and Autonomous Prompt Feedback Learning components that optimize Large Language Model prompts for specific scenarios, enhancing both inference accuracy and operational stability. These components are integrated to form a service infrastructure that manages user contexts, handles complex task dependencies, and orchestrates end-to-end agentic workflow execution. Through this sophisticated architecture, Deep Agent establishes a novel paradigm in self-governing AI systems, demonstrating robust capability to independently handle intricate, multi-step tasks while maintaining consistent efficiency and reliability through continuous self-optimization.

[AI-55] Large Language Models in Software Security: A Survey of Vulnerability Detection Techniques and Insights

链接: https://arxiv.org/abs/2502.07049
作者: Ze Sheng,Zhicheng Chen,Shuning Gu,Heqing Huang,Guofei Gu,Jeff Huang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 33 pages, 12 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection, addressing critical challenges in the security domain. Traditional methods, such as static and dynamic analysis, often falter due to inefficiencies, high false positive rates, and the growing complexity of modern software systems. By leveraging their ability to analyze code structures, identify patterns, and generate repair sugges- tions, LLMs, exemplified by models like GPT, BERT, and CodeBERT, present a novel and scalable approach to mitigating vulnerabilities. This paper provides a detailed survey of LLMs in vulnerability detection. It examines key aspects, including model architectures, application methods, target languages, fine-tuning strategies, datasets, and evaluation metrics. We also analyze the scope of current research problems, highlighting the strengths and weaknesses of existing approaches. Further, we address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis. Based on these findings, we propose solutions for issues like dataset scalability, model interpretability, and applications in low-resource scenarios. Our contributions are threefold: (1) a systematic review of how LLMs are applied in vulnerability detection; (2) an analysis of shared patterns and differences across studies, with a unified framework for understanding the field; and (3) a summary of key challenges and future research directions. This work provides valuable insights for advancing LLM-based vulnerability detection. We also maintain and regularly update latest selected paper on this https URL

[AI-56] SnipGen: A Mining Repository Framework for Evaluating LLM s for Code

链接: https://arxiv.org/abs/2502.07046
作者: Daniel Rodriguez-Cardenas,Alejandro Velasco,Denys Poshyvany
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Language Models (LLMs), such as transformer-based neural networks trained on billions of parameters, have become increasingly prevalent in software engineering (SE). These models, trained on extensive datasets that include code repositories, exhibit remarkable capabilities for SE tasks. However, evaluating their effectiveness poses significant challenges, primarily due to the potential overlap between the datasets used for training and those employed for evaluation. To address this issue, we introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation. SnipGen aims to mitigate data contamination by generating robust testbeds and crafting tailored data points to assist researchers and practitioners in evaluating LLMs for code-related tasks. In our exploratory study, SnipGen mined approximately 227K data points from 338K recent code changes in GitHub commits, focusing on method-level granularity. SnipGen features a collection of prompt templates that can be combined to create a Chain-of-Thought-like sequence of prompts, enabling a nuanced assessment of LLMs’ code generation quality. By providing the mining tool, the methodology, and the dataset, SnipGen empowers researchers and practitioners to rigorously evaluate and interpret LLMs’ performance in software engineering contexts.

[AI-57] Automated Consistency Analysis of LLM s

链接: https://arxiv.org/abs/2502.07036
作者: Aditya Patwardhan,Vivek Vaidya,Ashish Kundu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures, 3 tables, 3 algorithms

点击查看摘要

Abstract:Generative AI (Gen AI) with large language models (LLMs) are being widely adopted across the industry, academia and government. Cybersecurity is one of the key sectors where LLMs can be and/or are already being used. There are a number of problems that inhibit the adoption of trustworthy Gen AI and LLMs in cybersecurity and such other critical areas. One of the key challenge to the trustworthiness and reliability of LLMs is: how consistent an LLM is in its responses? In this paper, we have analyzed and developed a formal definition of consistency of responses of LLMs. We have formally defined what is consistency of responses and then develop a framework for consistency evaluation. The paper proposes two approaches to validate consistency: self-validation, and validation across multiple LLMs. We have carried out extensive experiments for several LLMs such as GPT4oMini, GPT3.5, Gemini, Cohere, and Llama3, on a security benchmark consisting of several cybersecurity questions: informational and situational. Our experiments corroborate the fact that even though these LLMs are being considered and/or already being used for several cybersecurity tasks today, they are often inconsistent in their responses, and thus are untrustworthy and unreliable for cybersecurity. Comments: 10 pages, 12 figures, 3 tables, 3 algorithms Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.07036 [cs.CR] (or arXiv:2502.07036v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2502.07036 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: A. Patwardhan, V. Vaidya and A. Kundu, “Automated Consistency Analysis of LLMs,” 2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA), Washington, DC, USA

[AI-58] Representational Alignment with Chemical Induced Fit for Molecular Relational Learning

链接: https://arxiv.org/abs/2502.07027
作者: Peiliang Zhang,Jingling Yuan,Qing Xie,Yongjun Zhu,Lin Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textite.g., functional group, scaffold) shifted data. With theoretical justification, we propose the \textbfRepresentational \textbfAlignment with Chemical Induced \textbfFit (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model’s stability in both rule-shifted and scaffold-shifted data distributions.

[AI-59] Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML ALT

链接: https://arxiv.org/abs/2502.07026
作者: Mohammad Amir Salari,Bahareh Rahmani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Focus: Artificial Intelligence, Healthcare analytics, cloud computing, BigQuery ML

点击查看摘要

Abstract:Machine learning (ML) is transforming healthcare by enabling predictive analytics, personalized treatments, and improved patient outcomes. However, traditional ML workflows require specialized skills, infrastructure, and resources, limiting accessibility for many healthcare professionals. This paper explores how Google Cloud’s BigQuery ML simplifies the development and deployment of ML models using SQL, reducing technical barriers. Through a case study on diabetes prediction using the Diabetes Health Indicators Dataset, we evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep Neural Network (DNN). Our results demonstrate that the Boosted Tree model achieves the highest performance, making it highly effective for diabetes prediction. This study highlights BigQuery ML’s role in democratizing machine learning by providing a scalable, efficient, and accessible solution for healthcare analytics.

[AI-60] Who is Helping Whom? Analyzing Inter-dependencies to Evaluate Cooperation in Human-AI Teaming

链接: https://arxiv.org/abs/2502.06976
作者: Upasana Biswas,Siddhant Bhambri,Subbarao Kambhampati
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The long-standing research challenges of Human-AI Teaming(HAT) and Zero-shot Cooperation(ZSC) have been tackled by applying multi-agent reinforcement learning(MARL) to train an agent by optimizing the environment reward function and evaluating their performance through task performance metrics such as task reward. However, such evaluation focuses only on task completion, while being agnostic to `how’ the two agents work with each other. Specifically, we are interested in understanding the cooperation arising within the team when trained agents are paired with humans. To formally address this problem, we propose the concept of interdependence to measure how much agents rely on each other’s actions to achieve the shared goal, as a key metric for evaluating cooperation in human-agent teams. Towards this, we ground this concept through a symbolic formalism and define evaluation metrics that allow us to assess the degree of reliance between the agents’ actions. We pair state-of-the-art agents trained through MARL for HAT, with learned human models for the the popular Overcooked domain, and evaluate the team performance for these human-agent teams. Our results demonstrate that trained agents are not able to induce cooperative behavior, reporting very low levels of interdependence across all the teams. We also report that teaming performance of a team is not necessarily correlated with the task reward.

[AI-61] Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents

链接: https://arxiv.org/abs/2502.06975
作者: Mathis Pink,Qinyuan Wu,Vy Ai Vo,Javier Turek,Jianing Mu,Alexander Huth,Mariya Toneva
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve from text-completion tools into fully fledged agents operating in dynamic environments, they must address the challenge of continually learning and retaining long-term knowledge. Many biological systems solve these challenges with episodic memory, which supports single-shot learning of instance-specific contexts. Inspired by this, we present an episodic memory framework for LLM agents, centered around five key properties of episodic memory that underlie adaptive and context-sensitive behavior. With various research efforts already partially covering these properties, this position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long-term agents. To this end, we outline a roadmap that unites several research directions under the goal to support all five properties of episodic memory for more efficient long-term LLM agents.

[AI-62] ask Offloading in Vehicular Edge Computing using Deep Reinforcement Learning: A Survey

链接: https://arxiv.org/abs/2502.06963
作者: Ashab Uddin,Ahmed Hamdi Sakr,Ning Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: 27 Pages, 3 Figures, 3 Tables

点击查看摘要

Abstract:The increasing demand for Intelligent Transportation Systems (ITS) has introduced significant challenges in managing the complex, computation-intensive tasks generated by modern vehicles while offloading tasks to external computing infrastructures such as edge computing (EC), nearby vehicular , and UAVs has become influential solution to these challenges. However, traditional computational offloading strategies often struggle to adapt to the dynamic and heterogeneous nature of vehicular environments. In this study, we explored the potential of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) frameworks to optimize computational offloading through adaptive, real-time decision-making, and we have thoroughly investigated the Markov Decision Process (MDP) approaches on the existing literature. The paper focuses on key aspects such as standardized learning models, optimized reward structures, and collaborative multi-agent systems, aiming to advance the understanding and application of DRL in vehicular networks. Our findings offer insights into enhancing the efficiency, scalability, and robustness of ITS, setting the stage for future innovations in this rapidly evolving field.

[AI-63] Occams model: Selecting simpler representations for better transferability estimation

链接: https://arxiv.org/abs/2502.06925
作者: Prabhant Singh,Sibylle Hess,Joaquin Vanschoren
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning models that have been pre-trained on large datasets has become a cornerstone of modern machine learning workflows. With the widespread availability of online model repositories, such as Hugging Face, it is now easier than ever to fine-tune pre-trained models for specific tasks. This raises a critical question: which pre-trained model is most suitable for a given task? This problem is called transferability estimation. In this work, we introduce two novel and effective metrics for estimating the transferability of pre-trained models. Our approach is grounded in viewing transferability as a measure of how easily a pre-trained model’s representations can be trained to separate target classes, providing a unique perspective on transferability estimation. We rigorously evaluate the proposed metrics against state-of-the-art alternatives across diverse problem settings, demonstrating their robustness and practical utility. Additionally, we present theoretical insights that explain our metrics’ efficacy and adaptability to various scenarios. We experimentally show that our metrics increase Kendall’s Tau by up to 32% compared to the state-of-the-art baselines.

[AI-64] XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units

链接: https://arxiv.org/abs/2502.06924
作者: Arghadip Das,Arnab Raha,Shamik Kundu,Soumendu Kumar Ghosh,Deepak Mathaikutty,Vijay Raghunathan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, making them ideal for long-sequence applications in NLP, vision, and edge AI, including real-time transcription, translation, and contextual search. These applications require lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. Designing specialized accelerators for every emerging neural network is costly and impractical; instead, optimizing models for existing NPUs in AI PCs provides a scalable solution. To this end, we propose XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. XAMBA follows a three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet KPI requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. Additionally, ActiBA enhances performance by approximating expensive activation functions (e.g., Swish, Softplus) using piecewise linear mappings, reducing latency with minimal accuracy loss. Evaluations on an Intel Core Ultra Series 2 AI PC show that XAMBA achieves up to 2.6X speed-up over the baseline. Our implementation is available at this https URL.

[AI-65] Do Attention Heads Compete or Cooperate during Counting?

链接: https://arxiv.org/abs/2502.06923
作者: Pál Zsámboki,Ádám Fraknói,Máté Gedeon,András Kornai,Zsolt Zombori
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:We present an in-depth mechanistic interpretability analysis of training small transformers on an elementary task, counting, which is a crucial deductive step in many algorithms. In particular, we investigate the collaboration/competition among the attention heads: we ask whether the attention heads behave as a pseudo-ensemble, all solving the same subtask, or they perform different subtasks, meaning that they can only solve the original task in conjunction. Our work presents evidence that on the semantics of the counting task, attention heads behave as a pseudo-ensemble, but their outputs need to be aggregated in a non-uniform manner in order to create an encoding that conforms to the syntax. Our source code will be available upon publication.

[AI-66] GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units

链接: https://arxiv.org/abs/2502.06921
作者: Arghadip Das,Shamik Kundu,Arnab Raha,Soumendu Ghosh,Deepak Mathaikutty,Vijay Raghunathan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are vital for learning from graph-structured data, enabling applications in network analysis, recommendation systems, and speech analytics. Deploying them on edge devices like client PCs and laptops enhances real-time processing, privacy, and cloud independence. GNNs aid Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and enable event-based vision tasks. However, irregular memory access, sparsity, and dynamic structures cause high latency and energy overhead on resource-constrained devices. While modern edge processors integrate CPUs, GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular GNN computations. We introduce GraNNite, the first hardware-aware framework optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN accelerators via a structured three-step methodology: (1) enabling NPU execution, (2) optimizing performance, and (3) trading accuracy for efficiency gains. Step 1 employs GraphSplit for workload distribution and StaGr for static aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts performance using EffOp for control-heavy tasks and GraSp for sparsity exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce redundancy and memory transfers. Step 3 balances quality versus efficiency, where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs, GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to 8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher performance than CPUs and GPUs, respectively, across GNN models.

[AI-67] Select before Act: Spatially Decoupled Action Repetition for Continuous Control ICLR2025

链接: https://arxiv.org/abs/2502.06919
作者: Buqing Nie,Yangqing Fu,Yue Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: ICLR 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) has achieved remarkable success in various continuous control tasks, such as robot manipulation and locomotion. Different to mainstream RL which makes decisions at individual steps, recent studies have incorporated action repetition into RL, achieving enhanced action persistence with improved sample efficiency and superior performance. However, existing methods treat all action dimensions as a whole during repetition, ignoring variations among them. This constraint leads to inflexibility in decisions, which reduces policy agility with inferior effectiveness. In this work, we propose a novel repetition framework called SDAR, which implements Spatially Decoupled Action Repetition through performing closed-loop act-or-repeat selection for each action dimension individually. SDAR achieves more flexible repetition strategies, leading to an improved balance between action persistence and diversity. Compared to existing repetition frameworks, SDAR is more sample efficient with higher policy performance and reduced action fluctuation. Experiments are conducted on various continuous control scenarios, demonstrating the effectiveness of spatially decoupled repetition design proposed in this work.

[AI-68] Leverag ing GPT -4o Efficiency for Detecting Rework Anomaly in Business Processes

链接: https://arxiv.org/abs/2502.06918
作者: Mohammad Derakhshan,Paolo Ceravolo,Fatemeh Mohammadi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 images, 4 tables

点击查看摘要

Abstract:This paper investigates the effectiveness of GPT-4o-2024-08-06, one of the Large Language Models (LLM) from OpenAI, in detecting business process anomalies, with a focus on rework anomalies. In our study, we developed a GPT-4o-based tool capable of transforming event logs into a structured format and identifying reworked activities within business event logs. The analysis was performed on a synthetic dataset designed to contain rework anomalies but free of loops. To evaluate the anomaly detection capabilities of GPT 4o-2024-08-06, we used three prompting techniques: zero-shot, one-shot, and few-shot. These techniques were tested on different anomaly distributions, namely normal, uniform, and exponential, to identify the most effective approach for each case. The results demonstrate the strong performance of GPT-4o-2024-08-06. On our dataset, the model achieved 96.14% accuracy with one-shot prompting for the normal distribution, 97.94% accuracy with few-shot prompting for the uniform distribution, and 74.21% accuracy with few-shot prompting for the exponential distribution. These results highlight the model’s potential as a reliable tool for detecting rework anomalies in event logs and how anomaly distribution and prompting strategy influence the model’s performance.

[AI-69] Krum Federated Chain (KFC): Using blockchain to defend against adversarial attacks in Federated Learning

链接: https://arxiv.org/abs/2502.06917
作者: Mario García-Márquez,Nuria Rodríguez-Barroso,M.Victoria Luzón,Francisco Herrera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Neural Networks

点击查看摘要

Abstract:Federated Learning presents a nascent approach to machine learning, enabling collaborative model training across decentralized devices while safeguarding data privacy. However, its distributed nature renders it susceptible to adversarial attacks. Integrating blockchain technology with Federated Learning offers a promising avenue to enhance security and integrity. In this paper, we tackle the potential of blockchain in defending Federated Learning against adversarial attacks. First, we test Proof of Federated Learning, a well known consensus mechanism designed ad-hoc to federated contexts, as a defense mechanism demonstrating its efficacy against Byzantine and backdoor attacks when at least one miner remains uncompromised. Second, we propose Krum Federated Chain, a novel defense strategy combining Krum and Proof of Federated Learning, valid to defend against any configuration of Byzantine or backdoor attacks, even when all miners are compromised. Our experiments conducted on image classification datasets validate the effectiveness of our proposed approaches.

[AI-70] Hyper Compressed Fine-Tuning of Large Foundation Models with Quantum Inspired Adapters

链接: https://arxiv.org/abs/2502.06916
作者: Snehal Raj,Brian Coyle
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Quantum Physics (quant-ph)
*备注: 16 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Fine-tuning pre-trained large foundation models for specific tasks has become increasingly challenging due to the computational and storage demands associated with full parameter updates. Parameter-Efficient Fine-Tuning (PEFT) methods address this issue by updating only a small subset of model parameters using adapter modules. In this work, we propose \emphQuantum-Inspired Adapters, a PEFT approach inspired by Hamming-weight preserving quantum circuits from quantum machine learning literature. These models can be both expressive and parameter-efficient by operating in a combinatorially large space while simultaneously preserving orthogonality in weight parameters. We test our proposed adapters by adapting large language models and large vision transformers on benchmark datasets. Our method can achieve 99.2% of the performance of existing fine-tuning methods such LoRA with a 44x parameter compression on language understanding datasets like GLUE and VTAB. Compared to existing orthogonal fine-tuning methods such as OFT or BOFT, we achieve 98% relative performance with 25x fewer parameters. This demonstrates competitive performance paired with a significant reduction in trainable parameters. Through ablation studies, we determine that combining multiple Hamming-weight orders with orthogonality and matrix compounding are essential for performant fine-tuning. Our findings suggest that Quantum-Inspired Adapters offer a promising direction for efficient adaptation of language and vision models in resource-constrained environments.

[AI-71] Foundation Models for Anomaly Detection: Vision and Challenges

链接: https://arxiv.org/abs/2502.06911
作者: Jing Ren,Tao Tang,Hong Jia,Haytham Fayek,Xiaodong Li,Suyu Ma,Xiwei Xu,Feng Xia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:As data continues to grow in volume and complexity across domains such as finance, manufacturing, and healthcare, effective anomaly detection is essential for identifying irregular patterns that may signal critical issues. Recently, foundation models (FMs) have emerged as a powerful tool for advancing anomaly detection. They have demonstrated unprecedented capabilities in enhancing anomaly identification, generating detailed data descriptions, and providing visual explanations. This survey presents the first comprehensive review of recent advancements in FM-based anomaly detection. We propose a novel taxonomy that classifies FMs into three categories based on their roles in anomaly detection tasks, i.e., as encoders, detectors, or interpreters. We provide a systematic analysis of state-of-the-art methods and discuss key challenges in leveraging FMs for improved anomaly detection. We also outline future research directions in this rapidly evolving field.

[AI-72] meKAN: KAN-based Frequency Decomposition Learning Architecture for Long-term Time Series Forecasting

链接: https://arxiv.org/abs/2502.06910
作者: Songtao Huang,Zhen Zhao,Can Li,Lei Bai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-world time series often have multiple frequency components that are intertwined with each other, making accurate time series forecasting challenging. Decomposing the mixed frequency components into multiple single frequency components is a natural choice. However, the information density of patterns varies across different frequencies, and employing a uniform modeling approach for different frequency components can lead to inaccurate characterization. To address this challenges, inspired by the flexibility of the recent Kolmogorov-Arnold Network (KAN), we propose a KAN-based Frequency Decomposition Learning architecture (TimeKAN) to address the complex forecasting challenges caused by multiple frequency mixtures. Specifically, TimeKAN mainly consists of three components: Cascaded Frequency Decomposition (CFD) blocks, Multi-order KAN Representation Learning (M-KAN) blocks and Frequency Mixing blocks. CFD blocks adopt a bottom-up cascading approach to obtain series representations for each frequency band. Benefiting from the high flexibility of KAN, we design a novel M-KAN block to learn and represent specific temporal patterns within each frequency band. Finally, Frequency Mixing blocks is used to recombine the frequency bands into the original format. Extensive experimental results across multiple real-world time series datasets demonstrate that TimeKAN achieves state-of-the-art performance as an extremely lightweight architecture. Code is available at this https URL.

[AI-73] Satisfaction-Aware Incentive Scheme for Federated Learning in Industrial Metaverse: DRL-Based Stackbelberg Game Approach

链接: https://arxiv.org/abs/2502.06909
作者: Xiaohuan Li,Shaowen Qin,Xin Tang,Jiawen Kang,Jin Ye,Zhonghua Zhao,Dusit Niyato
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Industrial Metaverse leverages the Industrial Internet of Things (IIoT) to integrate data from diverse devices, employing federated learning and meta-computing to train models in a distributed manner while ensuring data privacy. Achieving an immersive experience for industrial Metaverse necessitates maintaining a balance between model quality and training latency. Consequently, a primary challenge in federated learning tasks is optimizing overall system performance by balancing model quality and training latency. This paper designs a satisfaction function that accounts for data size, Age of Information (AoI), and training latency. Additionally, the satisfaction function is incorporated into the utility functions to incentivize node participation in model training. We model the utility functions of servers and nodes as a two-stage Stackelberg game and employ a deep reinforcement learning approach to learn the Stackelberg equilibrium. This approach ensures balanced rewards and enhances the applicability of the incentive scheme for industrial Metaverse. Simulation results demonstrate that, under the same budget constraints, the proposed incentive scheme improves at least 23.7% utility compared to existing schemes without compromising model accuracy.

[AI-74] Can ChatGPT Diagnose Alzheimers Disease?

链接: https://arxiv.org/abs/2502.06907
作者: Quoc-Toan Nguyen,Linh Le,Xuan-The Tran,Thomas Do,Chin-Teng Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Can ChatGPT diagnose Alzheimer’s Disease (AD)? AD is a devastating neurodegenerative condition that affects approximately 1 in 9 individuals aged 65 and older, profoundly impairing memory and cognitive function. This paper utilises 9300 electronic health records (EHRs) with data from Magnetic Resonance Imaging (MRI) and cognitive tests to address an intriguing question: As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? We present an in-depth evaluation of ChatGPT using a black-box approach with zero-shot and multi-shot methods. This study unlocks ChatGPT’s capability to analyse MRI and cognitive test results, as well as its potential as a diagnostic tool for AD. By automating aspects of the diagnostic process, this research opens a transformative approach for the healthcare system, particularly in addressing disparities in resource-limited regions where AD specialists are scarce. Hence, it offers a foundation for a promising method for early detection, supporting individuals with timely interventions, which is paramount for Quality of Life (QoL).

[AI-75] Learning-based estimation of cattle weight gain and its influencing factors

链接: https://arxiv.org/abs/2502.06906
作者: Muhammad Riaz Hasib Hossain,Rafiqul Islam,Shawn R. McGrath,Md Zahidul Islam,David Lamb
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many cattle farmers still depend on manual methods to measure the live weight gain of cattle at set intervals, which is time consuming, labour intensive, and stressful for both the animals and handlers. A remote and autonomous monitoring system using machine learning (ML) or deep learning (DL) can provide a more efficient and less invasive method and also predictive capabilities for future cattle weight gain (CWG). This system allows continuous monitoring and estimation of individual cattle live weight gain, growth rates and weight fluctuations considering various factors like environmental conditions, genetic predispositions, feed availability, movement patterns and behaviour. Several researchers have explored the efficiency of estimating CWG using ML and DL algorithms. However, estimating CWG suffers from a lack of consistency in its application. Moreover, ML or DL can provide weight gain estimations based on several features that vary in existing research. Additionally, previous studies have encountered various data related challenges when estimating CWG. This paper presents a comprehensive investigation in estimating CWG using advanced ML techniques based on research articles (between 2004 and 2024). This study investigates the current tools, methods, and features used in CWG estimation, as well as their strengths and weaknesses. The findings highlight the significance of using advanced ML approaches in CWG estimation and its critical influence on factors. Furthermore, this study identifies potential research gaps and provides research direction on CWG prediction, which serves as a reference for future research in this area.

[AI-76] Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

链接: https://arxiv.org/abs/2502.06905
作者: Yeseul Cho,Baekrok Shin,Changmin Kang,Chulhee Yun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce a Difficulty and Uncertainty-Aware Lightweight (DUAL) score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning, we further propose a ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA, specifically 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.

[AI-77] A Sociotechnical Approach for Knowledge Management (KM)

链接: https://arxiv.org/abs/2502.06899
作者: Leoncio Jimenez
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: in French language. The author would like to thank Mrs. Christine Deville for her help with the grammatical correction of the text and especially Mr. Germain Lacoste (director of ENI of Tarbes, France) for his friendship, and finally, I thank something as alive as the always happy song of a hummingbird among flowers. arXiv admin note: substantial text overlap with arXiv:2502.01656

点击查看摘要

Abstract:This article presents a sociotechnical framework for KM. This sociotechnical vision of KM allows: (1) to remove KM from a commercial concern; (2) to divide the different KM technologies; and (3) to question the paradigms associated with the social and technical components of KM. It is precisely this last point that this article develops to identify the generic mechanisms of KM. More precisely, the social aspect is explained through the organizational approach to KM, the managerial approach to KM, and the biological approach to KM. In contrast, the technical aspect is described through the knowledge and skills engineering approach to KM. These approaches also lead us to provide a comparative table between these organizational, managerial, and biological visions of KM.

[AI-78] Large Language Models for In-File Vulnerability Localization Can Be “Lost in the End”

链接: https://arxiv.org/abs/2502.06898
作者: Francesco Sovrano,Adam Bauer,Alberto Bacchelli
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025. Replication Package: this https URL

点击查看摘要

Abstract:Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (p .05) underperform when detecting vulnerabilities located toward the end of larger files, a pattern we call the ‘lost-in-the-end’ effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models.

[AI-79] Certifying Language Model Robustness with Fuzzed Randomized Smoothing: An Efficient Defense Against Backdoor Attacks ICLR2025

链接: https://arxiv.org/abs/2502.06892
作者: Bowei He,Lihao Yin,Hui-Ling Zhen,Jianping Zhang,Lanqing Hong,Mingxuan Yuan,Chen Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:The widespread deployment of pre-trained language models (PLMs) has exposed them to textual backdoor attacks, particularly those planted during the pre-training stage. These attacks pose significant risks to high-reliability applications, as they can stealthily affect multiple downstream tasks. While certifying robustness against such threats is crucial, existing defenses struggle with the high-dimensional, interdependent nature of textual data and the lack of access to original poisoned pre-training data. To address these challenges, we introduce \textbfFuzzed \textbfRandomized \textbfSmoothing (\textbfFRS), a novel approach for efficiently certifying language model robustness against backdoor attacks. FRS integrates software robustness certification techniques with biphased model parameter smoothing, employing Monte Carlo tree search for proactive fuzzing to identify vulnerable textual segments within the Damerau-Levenshtein space. This allows for targeted and efficient text randomization, while eliminating the need for access to poisoned training data during model smoothing. Our theoretical analysis demonstrates that FRS achieves a broader certified robustness radius compared to existing methods. Extensive experiments across various datasets, model configurations, and attack strategies validate FRS’s superiority in terms of defense efficiency, accuracy, and robustness.

[AI-80] LLM s for Drug-Drug Interaction Prediction: A Comprehensive Comparison

链接: https://arxiv.org/abs/2502.06890
作者: Gabriele De Vito,Filomena Ferrucci,Athanasios Angelakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The increasing volume of drug combinations in modern therapeutic regimens needs reliable methods for predicting drug-drug interactions (DDIs). While Large Language Models (LLMs) have revolutionized various domains, their potential in pharmaceutical research, particularly in DDI prediction, remains largely unexplored. This study thoroughly investigates LLMs’ capabilities in predicting DDIs by uniquely processing molecular structures (SMILES), target organisms, and gene interaction data as raw text input from the latest DrugBank dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first assessing their zero-shot capabilities in DDI prediction. We then fine-tuned selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 distilled Qwen 1.5B) to optimize their performance. Our comprehensive evaluation framework included validation across 13 external DDI datasets, comparing against traditional approaches such as l2-regularized logistic regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of 0.919 on balanced datasets (50% positive, 50% negative cases). This result represents an improvement over both zero-shot predictions and state-of-the-art machine-learning methods used for DDI prediction. Our analysis reveals that LLMs can effectively capture complex molecular interaction patterns and cases where drug pairs target common genes, making them valuable tools for practical applications in pharmaceutical research and clinical settings.

[AI-81] Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

链接: https://arxiv.org/abs/2502.06888
作者: Zhiyuan Fang,Yuegui Huang,Zicong Hong,Yufeng Lyu,Wuhui Chen,Yue Yu,Fan Yu,Zibin Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE), with its distinctive sparse structure, enables the scaling of language models up to trillions of parameters without significantly increasing computational costs. However, the substantial parameter size presents a challenge for inference, as the expansion in GPU memory cannot keep pace with the growth in parameters. Although offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency, the computation for each expert in MoE models is often less than the I/O, resulting in numerous bubbles in the pipeline. Therefore, we propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm. The proposed paradigm uses batch processing to extend the computation time of the current layer to overlap with the loading time of the next layer. Although this idea has been effectively applied to dense models, more batches may activate more experts in the MoE, leading to longer loading times and more bubbles. Thus, unlike traditional approaches, we balance computation and I/O time and minimise bubbles by orchestrating their inference orders based on their heterogeneous computation and I/O requirements and activation patterns under different batch numbers. Moreover, to adapt to different hardware environments and models, we design a constraint-sensitive I/O-compute planner and a correlation-aware expert prefetcher for a schedule that minimises pipeline bubbles. Experimental results demonstrate that Klotski achieves a superior throughput-latency trade-off compared to state-of-the-art techniques, with throughput improvements of up to 85.12x. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.06888 [cs.LG] (or arXiv:2502.06888v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.06888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-82] Gradient Based Method for the Fusion of Lattice Quantizers

链接: https://arxiv.org/abs/2502.06887
作者: Liyuan Zhang,Hanzhong Cao,Jiaheng Li,Minyang Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In practical applications, lattice quantizers leverage discrete lattice points to approximate arbitrary points in the lattice. An effective lattice quantizer significantly enhances both the accuracy and efficiency of these approximations. In the context of high-dimensional lattice quantization, previous work proposed utilizing low-dimensional optimal lattice quantizers and addressed the challenge of determining the optimal length ratio in orthogonal splicing. Notably, it was demonstrated that fixed length ratios and orthogonality yield suboptimal results when combining low-dimensional lattices. Building on this foundation, another approach employed gradient descent to identify optimal lattices, which inspired us to explore the use of neural networks to discover matrices that outperform those obtained from orthogonal splicing methods. We propose two novel approaches to tackle this problem: the Household Algorithm and the Matrix Exp Algorithm. Our results indicate that both the Household Algorithm and the Matrix Exp Algorithm achieve improvements in lattice quantizers across dimensions 13, 15, 17 to 19, 21, and 22. Moreover, the Matrix Exp Algorithm demonstrates superior efficacy in high-dimensional settings.

[AI-83] opological derivative approach for deep neural network architecture adaptation

链接: https://arxiv.org/abs/2502.06885
作者: C G Krishnanunni,Tan Bui-Thanh,Clint Dawson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work presents a novel algorithm for progressively adapting neural network architecture along the depth. In particular, we attempt to address the following questions in a mathematically principled way: i) Where to add a new capacity (layer) during the training process? ii) How to initialize the new capacity? At the heart of our approach are two key ingredients: i) the introduction of a ``shape functional" to be minimized, which depends on neural network topology, and ii) the introduction of a topological derivative of the shape functional with respect to the neural network topology. Using an optimal control viewpoint, we show that the network topological derivative exists under certain conditions, and its closed-form expression is derived. In particular, we explore, for the first time, the connection between the topological derivative from a topology optimization framework with the Hamiltonian from optimal control theory. Further, we show that the optimality condition for the shape functional leads to an eigenvalue problem for deep neural architecture adaptation. Our approach thus determines the most sensitive location along the depth where a new layer needs to be inserted during the training phase and the associated parametric initialization for the newly added layer. We also demonstrate that our layer insertion strategy can be derived from an optimal transport viewpoint as a solution to maximizing a topological derivative in p -Wasserstein space, where p= 1 . Numerical investigations with fully connected network, convolutional neural network, and vision transformer on various regression and classification problems demonstrate that our proposed approach can outperform an ad-hoc baseline network and other architecture adaptation strategies. Further, we also demonstrate other applications of topological derivative in fields such as transfer learning.

[AI-84] Learning Conformal Abstention Policies for Adaptive Risk Management in Large Language and Vision-Language Models

链接: https://arxiv.org/abs/2502.06884
作者: Sina Tayebati,Divake Kumar,Nastaran Darabi,Dinithi Jayasuriya,Ranganath Krishnan,Amit Ranjan Trivedi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language and Vision-Language Models (LLMs/VLMs) are increasingly used in safety-critical applications, yet their opaque decision-making complicates risk assessment and reliability. Uncertainty quantification (UQ) helps assess prediction confidence and enables abstention when uncertainty is high. Conformal prediction (CP), a leading UQ method, provides statistical guarantees but relies on static thresholds, which fail to adapt to task complexity and evolving data distributions, leading to suboptimal trade-offs in accuracy, coverage, and informativeness. To address this, we propose learnable conformal abstention, integrating reinforcement learning (RL) with CP to optimize abstention thresholds dynamically. By treating CP thresholds as adaptive actions, our approach balances multiple objectives, minimizing prediction set size while maintaining reliable coverage. Extensive evaluations across diverse LLM/VLM benchmarks show our method outperforms Least Ambiguous Classifiers (LAC) and Adaptive Prediction Sets (APS), improving accuracy by up to 3.2%, boosting AUROC for hallucination detection by 22.19%, enhancing uncertainty-guided selective generation (AUARC) by 21.17%, and reducing calibration error by 70%-85%. These improvements hold across multiple models and datasets while consistently meeting the 90% coverage target, establishing our approach as a more effective and flexible solution for reliable decision-making in safety-critical applications. The code is available at: this https URL.

[AI-85] FlavorDiffusion: Predicting Food Pairings and Chemical Interactions Using Diffusion Models

链接: https://arxiv.org/abs/2502.06871
作者: Seo Jun Pyo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:The study of food pairing has evolved beyond subjective expertise with the advent of machine learning. This paper presents FlavorDiffusion, a novel framework leveraging diffusion models to predict food-chemical interactions and ingredient pairings without relying on chromatography. By integrating graph-based embeddings, diffusion processes, and chemical property encoding, FlavorDiffusion addresses data imbalances and enhances clustering quality. Using a heterogeneous graph derived from datasets like Recipe1M and FlavorDB, our model demonstrates superior performance in reconstructing ingredient-ingredient relationships. The addition of a Chemical Structure Prediction (CSP) layer further refines the embedding space, achieving state-of-the-art NMI scores and enabling meaningful discovery of novel ingredient combinations. The proposed framework represents a significant step forward in computational gastronomy, offering scalable, interpretable, and chemically informed solutions for food science.

[AI-86] Bridging Traffic State and Trajectory for Dynamic Road Network and Trajectory Representation Learning

链接: https://arxiv.org/abs/2502.06870
作者: Chengkai Han,Jingyuan Wang,Yongyao Wang,Xie Yu,Hao Lin,Chao Li,Junjie Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Effective urban traffic management is vital for sustainable city development, relying on intelligent systems with machine learning tasks such as traffic flow prediction and travel time estimation. Traditional approaches usually focus on static road network and trajectory representation learning, and overlook the dynamic nature of traffic states and trajectories, which is crucial for downstream tasks. To address this gap, we propose TRACK, a novel framework to bridge traffic state and trajectory data for dynamic road network and trajectory representation learning. TRACK leverages graph attention networks (GAT) to encode static and spatial road segment features, and introduces a transformer-based model for trajectory representation learning. By incorporating transition probabilities from trajectory data into GAT attention weights, TRACK captures dynamic spatial features of road segments. Meanwhile, TRACK designs a traffic transformer encoder to capture the spatial-temporal dynamics of road segments from traffic state data. To further enhance dynamic representations, TRACK proposes a co-attentional transformer encoder and a trajectory-traffic state matching task. Extensive experiments on real-life urban traffic datasets demonstrate the superiority of TRACK over state-of-the-art baselines. Case studies confirm TRACK’s ability to capture spatial-temporal dynamics effectively.

[AI-87] A Survey on Explainable Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.06869
作者: Zelei Cheng,Jiahao Yu,Xinyu Xing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making tasks across diverse domains, yet its reliance on black-box neural architectures hinders interpretability, trust, and deployment in high-stakes applications. Explainable Deep Reinforcement Learning (XRL) addresses these challenges by enhancing transparency through feature-level, state-level, dataset-level, and model-level explanation techniques. This survey provides a comprehensive review of XRL methods, evaluates their qualitative and quantitative assessment frameworks, and explores their role in policy refinement, adversarial robustness, and security. Additionally, we examine the integration of reinforcement learning with Large Language Models (LLMs), particularly through Reinforcement Learning from Human Feedback (RLHF), which optimizes AI alignment with human preferences. We conclude by highlighting open research challenges and future directions to advance the development of interpretable, reliable, and accountable DRL systems.

[AI-88] Global Ease of Living Index: a machine learning framework for longitudinal analysis of major economies

链接: https://arxiv.org/abs/2502.06866
作者: Tanay Panat,Rohitash Chandra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is important to understand the long-term nature of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework for addressing the problem of missing data for some of the economic indicators for specific countries. We then curate and update the data and use a dimensionality reduction approach (principal component analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts. This transparency and accessibility make our work a valuable resource for ongoing research and policy development in quality-of-life assessment.

[AI-89] Design Considerations in Offline Preference-based RL

链接: https://arxiv.org/abs/2502.06861
作者: Alekh Agarwal,Christoph Dann,Teodor V. Marinov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

[AI-90] Gemstones: A Model Suite for Multi-Faceted Scaling Laws

链接: https://arxiv.org/abs/2502.06857
作者: Sean McLeish,John Kirchenbauer,David Yu Miller,Siddharth Singh,Abhinav Bhatele,Micah Goldblum,Ashwinee Panda,Tom Goldstein
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting. Code: this https URL

[AI-91] Native Fortran Implementation of TensorFlow-Trained Deep and Bayesian Neural Networks

链接: https://arxiv.org/abs/2502.06853
作者: Aidan Furlong,Xingang Zhao,Bob Salko,Xu Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted for inclusion in the 2025 American Nuclear Society Annual Conference

点击查看摘要

Abstract:Over the past decade, the investigation of machine learning (ML) within the field of nuclear engineering has grown significantly. With many approaches reaching maturity, the next phase of investigation will determine the feasibility and usefulness of ML model implementation in a production setting. Several of the codes used for reactor design and assessment are primarily written in the Fortran language, which is not immediately compatible with TensorFlow-trained ML models. This study presents a framework for implementing deep neural networks (DNNs) and Bayesian neural networks (BNNs) in Fortran, allowing for native execution without TensorFlow’s C API, Python runtime, or ONNX conversion. Designed for ease of use and computational efficiency, the framework can be implemented in any Fortran code, supporting iterative solvers and UQ via ensembles or BNNs. Verification was performed using a two-input, one-output test case composed of a noisy sinusoid to compare Fortran-based predictions to those from TensorFlow. The DNN predictions showed negligible differences and achieved a 19.6x speedup, whereas the BNN predictions exhibited minor disagreement, plausibly due to differences in random number generation. An 8.0x speedup was noted for BNN inference. The approach was then further verified on a nuclear-relevant problem predicting critical heat flux (CHF), which demonstrated similar behavior along with significant computational gains. Discussion regarding the framework’s successful integration into the CTF thermal-hydraulics code is also included, outlining its practical usefulness. Overall, this framework was shown to be effective at implementing both DNN and BNN model inference within Fortran, allowing for the continued study of ML-based methods in real-world nuclear applications.

[AI-92] EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

链接: https://arxiv.org/abs/2502.06852
作者: Lin Zhang,Wenshuo Dong,Zhuoran Zhang,Shu Yang,Lijie Hu,Ninghao Liu,Pan Zhou,Di Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.

[AI-93] Model Fusion via Neuron Transplantation ECML-PKDD2024

链接: https://arxiv.org/abs/2502.06849
作者: Muhammed Öz,Nicholas Kiefer,Charlotte Debus,Jasmin Hörter,Achim Streit,Markus Götz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 7 figures, conference: ECML-PKDD 2024

点击查看摘要

Abstract:Ensemble learning is a widespread technique to improve the prediction performance of neural networks. However, it comes at the price of increased memory and inference time. In this work we propose a novel model fusion technique called \emphNeuron Transplantation (NT) in which we fuse an ensemble of models by transplanting important neurons from all ensemble members into the vacant space obtained by pruning insignificant neurons. An initial loss in performance post-transplantation can be quickly recovered via fine-tuning, consistently outperforming individual ensemble members of the same model capacity and architecture. Furthermore, NT enables all the ensemble members to be jointly pruned and jointly trained in a combined model. Comparing it to alignment-based averaging (like Optimal-Transport-fusion), it requires less fine-tuning than the corresponding OT-fused model, the fusion itself is faster and requires less memory, while the resulting model performance is comparable or better. The code is available under the following link: this https URL.

[AI-94] ransfer learning in Scalable Graph Neural Network for Improved Physical Simulation

链接: https://arxiv.org/abs/2502.06848
作者: Siqi Shen,Yu Liu,Daniel Biggs,Omar Hafez,Jiandong Yu,Wentao Zhang,Bin Cui,Jiulong Shan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Graph Neural Network (GNN) based models have shown promising results in simulating physics of complex systems. However, training dedicated graph network based physics simulators can be costly, as most models are confined to fully supervised training, which requires extensive data generated from traditional physics simulators. To date, how transfer learning could improve the model performance and training efficiency has remained unexplored. In this work, we introduce a pre-training and transfer learning paradigm for graph network simulators. We propose the scalable graph U-net (SGUNET). Incorporating an innovative depth-first search (DFS) pooling, the SGUNET is adaptable to different mesh sizes and resolutions for various simulation tasks. To enable the transfer learning between differently configured SGUNETs, we propose a set of mapping functions to align the parameters between the pre-trained model and the target model. An extra normalization term is also added into the loss to constrain the difference between the pre-trained weights and target model weights for better generalization performance. To pre-train our physics simulator we created a dataset which includes 20,000 physical simulations of randomly selected 3D shapes from the open source A Big CAD (ABC) dataset. We show that our proposed transfer learning methods allow the model to perform even better when fine-tuned with small amounts of training data than when it is trained from scratch with full extensive dataset. On the 2D Deformable Plate benchmark dataset, our pre-trained model fine-tuned on 1/16 of the training data achieved an 11.05% improvement in position RMSE compared to the model trained from scratch.

[AI-95] Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure

链接: https://arxiv.org/abs/2502.06846
作者: Zhicong Wang,Zicheng Ma,Ziqiang Cao,Changlong Zhou,Jun Zhang,Yiqin Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Proteins play a pivotal role in living organisms, yet understanding their functions presents significant challenges, including the limited flexibility of classification-based methods, the inability to effectively leverage spatial structural information, and the lack of systematic evaluation metrics for protein QA systems. To address these limitations, we propose Prot2Chat, a novel framework that integrates multimodal protein representations with natural language through a unified module, enabling large language model (LLM)-driven answer generation. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, a protein-text adapter with cross-attention mechanisms, and a LLaMA3 decoder. To optimize training efficiency, we freeze the encoder and employ LoRA techniques for the decoder. We conducted experiments on two datasets, both automated metrics and expert evaluations demonstrate the superior performance of our model. Furthermore, zero-shot prediction results highlight its strong generalization capabilities. This framework offers a promising solution for bridging protein domain knowledge with natural language understanding, paving the way for transformative advancements in protein-related research.

[AI-96] Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases

链接: https://arxiv.org/abs/2502.06842
作者: Andrew G. Breithaupt,Alice Tang,Bruce L. Miller,Pedro Pinheiro-Chagas
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 20 pages, 1 figure

点击查看摘要

Abstract:Healthcare systems are struggling to meet the growing demand for neurological care, with challenges particularly acute in Alzheimer’s disease and related dementias (ADRD). While artificial intelligence research has often focused on identifying patterns beyond human perception, implementing such predictive capabilities remains challenging as clinicians cannot readily verify insights they cannot themselves detect. We propose that large language models (LLMs) offer more immediately practical applications by enhancing clinicians’ capabilities in three critical areas: comprehensive data collection, interpretation of complex clinical information, and timely application of relevant medical knowledge. These challenges stem from limited time for proper diagnosis, growing data complexity, and an overwhelming volume of medical literature that exceeds any clinician’s capacity to fully master. We present a framework for responsible AI integration that leverages LLMs’ ability to communicate effectively with both patients and providers while maintaining human oversight. This approach prioritizes standardized, high-quality data collection to enable a system that learns from every patient encounter while incorporating the latest clinical evidence, continuously improving care delivery. We begin to address implementation challenges and initiate important discussions around ethical considerations and governance needs. While developed for ADRD, this roadmap provides principles for responsible AI integration across neurology and other medical specialties, with potential to improve diagnostic accuracy, reduce care disparities, and advance clinical knowledge through a learning healthcare system.

[AI-97] CAST: Cross Attention based multimodal fusion of Structure and Text for materials property prediction

链接: https://arxiv.org/abs/2502.06836
作者: Jaewan Lee,Changyoung Park,Hongjun Yang,Sungbin Lim,Sehui Han
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in AI have revolutionized property prediction in materials science and accelerating material discovery. Graph neural networks (GNNs) stand out due to their ability to represent crystal structures as graphs, effectively capturing local interactions and delivering superior predictions. However, these methods often lose critical global information, such as crystal systems and repetitive unit connectivity. To address this, we propose CAST, a cross-attention-based multimodal fusion model that integrates graph and text modalities to preserve essential material information. CAST combines node- and token-level features using cross-attention mechanisms, surpassing previous approaches reliant on material-level embeddings like graph mean-pooling or [CLS] tokens. A masked node prediction pretraining strategy further enhances atomic-level information integration. Our method achieved up to 22.9% improvement in property prediction across four crystal properties including band gap compared to methods like CrysMMNet and MultiMat. Pretraining was key to aligning node and text embeddings, with attention maps confirming its effectiveness in capturing relationships between nodes and tokens. This study highlights the potential of multimodal learning in materials science, paving the way for more robust predictive models that incorporate both local and global information.

[AI-98] A Unified Knowledge-Distillation and Semi-Supervised Learning Framework to Improve Industrial Ads Delivery Systems

链接: https://arxiv.org/abs/2502.06834
作者: Hamid Eghbalzadeh,Yang Wang,Rui Li,Yuji Mo,Qin Ding,Jiaxiang Fu,Liang Dai,Shuo Gu,Nima Noorshams,Sem Park,Bo Long,Xue Feng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Industrial ads ranking systems conventionally rely on labeled impression data, which leads to challenges such as overfitting, slower incremental gain from model scaling, and biases due to discrepancies between training and serving data. To overcome these issues, we propose a Unified framework for Knowledge-Distillation and Semi-supervised Learning (UKDSL) for ads ranking, empowering the training of models on a significantly larger and more diverse datasets, thereby reducing overfitting and mitigating training-serving data discrepancies. We provide detailed formal analysis and numerical simulations on the inherent miscalibration and prediction bias of multi-stage ranking systems, and show empirical evidence of the proposed framework’s capability to mitigate those. Compared to prior work, UKDSL can enable models to learn from a much larger set of unlabeled data, hence, improving the performance while being computationally efficient. Finally, we report the successful deployment of UKDSL in an industrial setting across various ranking models, serving users at multi-billion scale, across various surfaces, geological locations, clients, and optimize for various events, which to the best of our knowledge is the first of its kind in terms of the scale and efficiency at which it operates.

[AI-99] Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach ICML2025

链接: https://arxiv.org/abs/2502.06832
作者: Xu Zhang,Kaidi Xu,Ziqing Hu,Ren Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures, submitted to ICML 2025 (under review)

点击查看摘要

Abstract:Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods.

[AI-100] No Location Left Behind: Measuring and Improving the Fairness of Implicit Representations for Earth Data

链接: https://arxiv.org/abs/2502.06831
作者: Daniel Cai,Randall Balestriero
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) exhibit growing promise in addressing Earth representation challenges, ranging from emissions monitoring to climate modeling. However, existing methods disproportionately prioritize global average performance, whereas practitioners require fine-grained insights to understand biases and variations in these models. To bridge this gap, we introduce FAIR-Earth: a first-of-its-kind dataset explicitly crafted to examine and challenge inequities in Earth representations. FAIR-Earth comprises various high-resolution Earth signals and uniquely aggregates extensive metadata along stratifications like landmass size and population density to assess the fairness of models. Evaluating state-of-the-art INRs across the various modalities of FAIR-Earth, we uncover striking performance disparities. Certain subgroups, especially those associated with high-frequency signals (e.g., islands, coastlines), are consistently poorly modeled by existing methods. In response, we propose spherical wavelet encodings, building on previous spatial encoding research. Leveraging the multi-resolution capabilities of wavelets, our encodings yield consistent performance over various scales and locations, offering more accurate and robust representations of the biased subgroups. These open-source contributions represent a crucial step towards the equitable assessment and deployment of Earth INRs.

[AI-101] Convolution-Based Converter : A Weak-Prior Approach For Modeling Stochastic Processes Based On Conditional Density Estimation

链接: https://arxiv.org/abs/2502.06829
作者: Chaoran Pang,Shuangrong Liu,Shikun Tian,WenHao Yue,Xingshen Zhang,Lin Wang,Bo Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, a Convolution-Based Converter (CBC) is proposed to develop a methodology for removing the strong or fixed priors in estimating the probability distribution of targets based on observations in the stochastic process. Traditional approaches, e.g., Markov-based and Gaussian process-based methods, typically leverage observations to estimate targets based on strong or fixed priors (such as Markov properties or Gaussian prior). However, the effectiveness of these methods depends on how well their prior assumptions align with the characteristics of the problem. When the assumed priors are not satisfied, these approaches may perform poorly or even become unusable. To overcome the above limitation, we introduce the Convolution-Based converter (CBC), which implicitly estimates the conditional probability distribution of targets without strong or fixed priors, and directly outputs the expected trajectory of the stochastic process that satisfies the constraints from observations. This approach reduces the dependence on priors, enhancing flexibility and adaptability in modeling stochastic processes when addressing different problems. Experimental results demonstrate that our method outperforms existing baselines across multiple metrics.

[AI-102] Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study

链接: https://arxiv.org/abs/2502.06828
作者: Martin Wimpff,Bruno Aristimunha,Sylvain Chevallier,Bin Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates continual fine-tuning strategies for deep learning in online longitudinal electroencephalography (EEG) motor imagery (MI) decoding within a causal setting involving a large user group and multiple sessions per participant. We are the first to explore such strategies across a large user group, as longitudinal adaptation is typically studied in the single-subject setting with a single adaptation strategy, which limits the ability to generalize findings. First, we examine the impact of different fine-tuning approaches on decoder performance and stability. Building on this, we integrate online test-time adaptation (OTTA) to adapt the model during deployment, complementing the effects of prior fine-tuning. Our findings demonstrate that fine-tuning that successively builds on prior subject-specific information improves both performance and stability, while OTTA effectively adapts the model to evolving data distributions across consecutive sessions, enabling calibration-free operation. These results offer valuable insights and recommendations for future research in longitudinal online MI decoding and highlight the importance of combining domain adaptation strategies for improving BCI performance in real-world applications. Clinical Relevance: Our investigation enables more stable and efficient long-term motor imagery decoding, which is critical for neurorehabilitation and assistive technologies.

[AI-103] Learning to Synthesize Compatible Fashion Items Using Semantic Alignment and Collocation Classification: An Outfit Generation Framework

链接: https://arxiv.org/abs/2502.06827
作者: Dongliang Zhou,Haijun Zhang,Kai Yang,Linlin Liu,Han Yan,Xiaofei Xu,Zhao Zhang,Shuicheng Yan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: This paper was accepted by IEEE TNNLS

点击查看摘要

Abstract:The field of fashion compatibility learning has attracted great attention from both the academic and industrial communities in recent years. Many studies have been carried out for fashion compatibility prediction, collocated outfit recommendation, artificial intelligence (AI)-enabled compatible fashion design, and related topics. In particular, AI-enabled compatible fashion design can be used to synthesize compatible fashion items or outfits in order to improve the design experience for designers or the efficacy of recommendations for customers. However, previous generative models for collocated fashion synthesis have generally focused on the image-to-image translation between fashion items of upper and lower clothing. In this paper, we propose a novel outfit generation framework, i.e., OutfitGAN, with the aim of synthesizing a set of complementary items to compose an entire outfit, given one extant fashion item and reference masks of target synthesized items. OutfitGAN includes a semantic alignment module, which is responsible for characterizing the mapping correspondence between the existing fashion items and the synthesized ones, to improve the quality of the synthesized images, and a collocation classification module, which is used to improve the compatibility of a synthesized outfit. In order to evaluate the performance of our proposed models, we built a large-scale dataset consisting of 20,000 fashion outfits. Extensive experimental results on this dataset show that our OutfitGAN can synthesize photo-realistic outfits and outperform state-of-the-art methods in terms of similarity, authenticity and compatibility measurements.

[AI-104] ransferring Graph Neural Networks for Soft Sensor Modeling using Process Topologies

链接: https://arxiv.org/abs/2502.06826
作者: Maximilian F. Theisen,Gabrie M. H. Meesters,Artur M. Schweidtmann
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data-driven soft sensors help in process operations by providing real-time estimates of otherwise hard- to-measure process quantities, e.g., viscosities or product concentrations. Currently, soft sensors need to be developed individually per plant. Using transfer learning, machine learning-based soft sensors could be reused and fine-tuned across plants and applications. However, transferring data-driven soft sensor models is in practice often not possible, because the fixed input structure of standard soft sensor models prohibits transfer if, e.g., the sensor information is not identical in all plants. We propose a topology-aware graph neural network approach for transfer learning of soft sensor models across multiple plants. In our method, plants are modeled as graphs: Unit operations are nodes, streams are edges, and sensors are embedded as attributes. Our approach brings two advantages for transfer learning: First, we not only include sensor data but also crucial information on the plant topology. Second, the graph neural network algorithm is flexible with respect to its sensor inputs. This allows us to model data from different plants with different sensor networks. We test the transfer learning capabilities of our modeling approach on ammonia synthesis loops with different process topologies. We build a soft sensor predicting the ammonia concentration in the product. After training on data from one process, we successfully transfer our soft sensor model to a previously unseen process with a different topology. Our approach promises to extend the data-driven soft sensors to cases to leverage data from multiple plants.

[AI-105] Neural Network-based Vehicular Channel Estimation Performance: Effect of Noise in the Training Set

链接: https://arxiv.org/abs/2502.06824
作者: Simbarashe Aldrin Ngorima,Albert Helberg,Marelie H. Davel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 Figures

点击查看摘要

Abstract:Vehicular communication systems face significant challenges due to high mobility and rapidly changing environments, which affect the channel over which the signals travel. To address these challenges, neural network (NN)-based channel estimation methods have been suggested. These methods are primarily trained on high signal-to-noise ratio (SNR) with the assumption that training a NN in less noisy conditions can result in good generalisation. This study examines the effectiveness of training NN-based channel estimators on mixed SNR datasets compared to training solely on high SNR datasets, as seen in several related works. Estimators evaluated in this work include an architecture that uses convolutional layers and self-attention mechanisms; a method that employs temporal convolutional networks and data pilot-aided estimation; two methods that combine classical methods with multilayer perceptrons; and the current state-of-the-art model that combines Long-Short-Term Memory networks with data pilot-aided and temporal averaging methods as post processing. Our results indicate that using only high SNR data for training is not always optimal, and the SNR range in the training dataset should be treated as a hyperparameter that can be adjusted for better performance. This is illustrated by the better performance of some models in low SNR conditions when trained on the mixed SNR dataset, as opposed to when trained exclusively on high SNR data.

[AI-106] LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2502.06820
作者: Zhekai Du,Yinjie Min,Jingjing Li,Ke Lu,Changliang Zou,Liuhua Peng,Tingjin Chu,Mingming Gong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has become a prevalent method for adapting pre-trained large language models to downstream tasks. However, the simple low-rank decomposition form may constrain the hypothesis space. To address this limitation, we introduce Location-aware Cosine Adaptation (LoCA), a novel frequency-domain parameter-efficient fine-tuning method based on inverse Discrete Cosine Transform (iDCT) with selective locations of learnable components. We begin with a comprehensive theoretical comparison between frequency-domain and low-rank decompositions for fine-tuning pre-trained large models. Our analysis reveals that frequency-domain approximation with carefully selected frequency components can surpass the expressivity of traditional low-rank-based methods. Furthermore, we demonstrate that iDCT offers a more efficient implementation compared to inverse Discrete Fourier Transform (iDFT), allowing for better selection and tuning of frequency components while maintaining equivalent expressivity to the optimal iDFT-based adaptation. By employing finite-difference approximation to estimate gradients for discrete locations of learnable coefficients on the DCT spectrum, LoCA dynamically selects the most informative frequency components during training. Experiments on diverse language and vision fine-tuning tasks demonstrate that LoCA offers enhanced parameter efficiency while maintains computational feasibility comparable to low-rank-based methods.

[AI-107] DeepCell: Multiview Representation Learning for Post-Mapping Netlists

链接: https://arxiv.org/abs/2502.06816
作者: Zhengyuan Shi,Chengyu Ma,Ziyang Zheng,Lingfeng Zhou,Hongyang Pan,Wentao Jiang,Fan Yang,Xiaoyan Yang,Zhufei Chu,Qiang Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation learning for post-mapping (PM) netlists is a critical challenge in Electronic Design Automation (EDA), driven by the diverse and complex nature of modern circuit designs. Existing approaches focus on intermediate representations like And-Inverter Graphs (AIGs), limiting their applicability to post-synthesis stages. We introduce DeepCell, a multiview representation learning framework that integrates structural and functional insights from both PM netlists and AIGs to learn rich, generalizable embeddings. At its core, DeepCell employs the novel Mask Circuit Modeling (MCM) mechanism, which refines PM netlist representations in a self-supervised manner using pretrained AIG encoders. DeepCell sets a new benchmark in PM netlist representation, outperforming existing methods in predictive accuracy and reconstruction fidelity. To validate its efficacy, we apply DeepCell to functional Engineering Change Orders (ECO), achieving significant reductions in patch generation costs and runtime while improving patch quality.

[AI-108] Diffusion Instruction Tuning

链接: https://arxiv.org/abs/2502.06814
作者: Chen Jin,Ryutaro Tanno,Amrutha Saseendran,Tom Diethe,Philip Teare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page at this https URL

点击查看摘要

Abstract:We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model’s visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at this https URL.

[AI-109] Policy Guided Tree Search for Enhanced LLM Reasoning

链接: https://arxiv.org/abs/2502.06813
作者: Yang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-110] On the Benefits of Attribute-Driven Graph Domain Adaptation ICLR2025

链接: https://arxiv.org/abs/2502.06808
作者: Ruiyi Fang,Bingheng Li,Zhao Kang,Qiuhao Zeng,Ruizhi Pu,Nima Hosseini Dashtbayaz,Boyu Wang,Charles Ling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the ICLR 2025

点击查看摘要

Abstract:Graph Domain Adaptation (GDA) addresses a pressing challenge in cross-network learning, particularly pertinent due to the absence of labeled data in real-world graph datasets. Recent studies attempted to learn domain invariant representations by eliminating structural shifts between graphs. In this work, we show that existing methodologies have overlooked the significance of the graph node attribute, a pivotal factor for graph domain alignment. Specifically, we first reveal the impact of node attributes for GDA by theoretically proving that in addition to the graph structural divergence between the domains, the node attribute discrepancy also plays a critical role in GDA. Moreover, we also empirically show that the attribute shift is more substantial than the topology shift, which further underscores the importance of node attribute alignment in GDA. Inspired by this finding, a novel cross-channel module is developed to fuse and align both views between the source and target graphs for GDA. Experimental results on a variety of benchmarks verify the effectiveness of our method.

[AI-111] Information-theoretic Bayesian Optimization: Survey and Tutorial

链接: https://arxiv.org/abs/2502.06789
作者: Eduardo C. Garrido-Merchán
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: None

点击查看摘要

[AI-112] Exoplanet Transit Candidate Identification in TESS Full-Frame Images via a Transformer-Based Algorithm

链接: https://arxiv.org/abs/2502.07542
作者: Helem Salinas,Rafael Brahm,Greg Olmschenk,Richard K. Barry,Karim Pichara,Stela Ishitani Silva,Vladimir Araujo
类目: Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-113] 5D Neural Surrogates for Nonlinear Gyrokinetic Simulations of Plasma Turbulence

链接: https://arxiv.org/abs/2502.07469
作者: Gianluca Galletti,Fabian Paischer,Paul Setinek,William Hornsby,Lorenzo Zanisi,Naomi Carey,Stanislas Pamela,Johannes Brandstetter
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 6 pages (+ references and appendix)

点击查看摘要

[AI-114] Explainable Multimodal Machine Learning for Revealing Structure-Property Relationships in Carbon Nanotube Fibers

链接: https://arxiv.org/abs/2502.07400
作者: Daisuke Kimura,Naoko Tajima,Toshiya Okazaki,Shun Muroga
类目: Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 33 pages, 9 figures

点击查看摘要

[AI-115] VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification

链接: https://arxiv.org/abs/2502.07205
作者: Pengyu Wang,Ying Fang,Xiaofei Li
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE/ACM Trans. on TASLP

点击查看摘要

[AI-116] Generative Distribution Prediction: A Unified Approach to Multimodal Learning

链接: https://arxiv.org/abs/2502.07090
作者: Xinyu Tian,Xiaotong Shen
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages 4 figures

点击查看摘要

Abstract:Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic data generation-such as conditional diffusion models-to enhance predictive performance across structured and unstructured modalities. GDP is model-agnostic, compatible with any high-fidelity generative model, and supports transfer learning for domain adaptation. We establish a rigorous theoretical foundation for GDP, providing statistical guarantees on its predictive accuracy when using diffusion models as the generative backbone. By estimating the data-generating distribution and adapting to various loss functions for risk minimization, GDP enables accurate point predictions across multimodal settings. We empirically validate GDP on four supervised learning tasks-tabular data prediction, question answering, image captioning, and adaptive quantile regression-demonstrating its versatility and effectiveness across diverse domains.

[AI-117] RADES: Generating Realistic Market Simulations with Diffusion Models

链接: https://arxiv.org/abs/2502.07071
作者: Leonardo Berti,Bardh Prenkaj,Paola Velardi
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 14 pages

点击查看摘要

[AI-118] Direct Estimation of Pediatric Heart Rate Variability from BOLD-fMRI: A Machine Learning Approach Using Dynamic Connectivity

链接: https://arxiv.org/abs/2502.06920
作者: Abdoljalil Addeh,Karen Ardila,Rebecca J Williams,G. Bruce Pike,M. Ethan MacDonald
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 5 figures, ISMSMR 2025

点击查看摘要

[AI-119] UniZyme: A Unified Protein Cleavage Site Predictor Enhanced with Enzyme Active-Site Knowledge

链接: https://arxiv.org/abs/2502.06914
作者: Chenao Li,Shuo Yan,Enyan Dai
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages,8 figures

点击查看摘要

[AI-120] A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer

链接: https://arxiv.org/abs/2502.06913
作者: Lirong Wu,Yunfan Liu,Haitao Lin,Yufei Huang,Guojiang Zhao,Zhifeng Gao,Stan Z. Li
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proteins that exist today have been optimized over billions of years of natural evolution, during which nature creates random mutations and selects them. The discovery of functionally promising mutations is challenged by the limited evolutionary accessible regions, i.e., only a small region on the fitness landscape is beneficial. There have been numerous priors used to constrain protein evolution to regions of landscapes with high-fitness variants, among which the change in binding free energy (DDG) of protein complexes upon mutations is one of the most commonly used priors. However, the huge mutation space poses two challenges: (1) how to improve the efficiency of DDG prediction for fast mutation screening; and (2) how to explain mutation preferences and efficiently explore accessible evolutionary regions. To address these challenges, we propose a lightweight DDG predictor (Light-DDG), which adopts a structure-aware Transformer as the backbone and enhances it by knowledge distilled from existing powerful but computationally heavy DDG predictors. Additionally, we augmented, annotated, and released a large-scale dataset containing millions of mutation data for pre-training Light-DDG. We find that such a simple yet effective Light-DDG can serve as a good unsupervised antibody optimizer and explainer. For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences, which accounts for the marginal benefit of each mutation per residue. To further explore accessible evolutionary regions, we conduct preference-guided antibody optimization and evaluate antibody candidates quickly using Light-DDG to identify desirable mutations.

[AI-121] DiffNMR3: Advancing NMR Resolution Beyond Instrumental Limits

链接: https://arxiv.org/abs/2502.06845
作者: Sen Yan,Etienne Goffinet,Fabrizio Gabellieri,Ryan Young,Lydia Gkoura,Laurence Jennings,Filippo Castiglione,Thomas Launey
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

[AI-122] A Hybrid Model for Weakly-Supervised Speech Dereverberation

链接: https://arxiv.org/abs/2502.06839
作者: Louis Bahrman(S2A, IDS),Mathieu Fontaine(S2A, IDS),Gael Richard(S2A, IDS)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
*备注:

点击查看摘要

[AI-123] OrderFusion: Encoding Orderbook for Probabilistic Intraday Price Prediction

链接: https://arxiv.org/abs/2502.06830
作者: Runyao Yu,Yuchen Tao,Fabian Leimgruber,Tara Esterl,Jochen L. Cremer
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 3 tables

点击查看摘要

[AI-124] Emergence of Self-Awareness in Artificial Systems: A Minimalist Three-Layer Approach to Artificial Consciousness

链接: https://arxiv.org/abs/2502.06810
作者: Kurando Iida
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 46 pages

点击查看摘要

Abstract:This paper proposes a minimalist three-layer model for artificial consciousness, focusing on the emergence of self-awareness. The model comprises a Cognitive Integration Layer, a Pattern Prediction Layer, and an Instinctive Response Layer, interacting with Access-Oriented and Pattern-Integrated Memory systems. Unlike brain-replication approaches, we aim to achieve minimal self-awareness through essential elements only. Self-awareness emerges from layer interactions and dynamic self-modeling, without initial explicit self-programming. We detail each component’s structure, function, and implementation strategies, addressing technical feasibility. This research offers new perspectives on consciousness emergence in artificial systems, with potential implications for human consciousness understanding and adaptable AI development. We conclude by discussing ethical considerations and future research directions.

机器学习

[LG-0] Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

链接: https://arxiv.org/abs/2502.07783
作者: Leyang Hu,Randall Balestriero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scaling of model size and data size has reshaped the paradigm of AI. As a result, the common protocol to leverage the latest models is to steer them towards a specific downstream task of interest through \em fine-tuning. Despite its importance, the main methods for fine-tuning remain limited to full or low-rank adapters–containing countless hyper-parameters and lacking interpretability. In this paper, we take a step back and demonstrate how novel and explainable post-training steering solutions can be derived theoretically from \em spline operators, a rich mathematical framing of Deep Networks that was recently developed. Our method–coined \textbfCurvature Tuning (CT)–has a single parameter that provably modulates the curvature of the model’s decision boundary henceforth allowing training-free steering. This makes CT both more efficient and interpretable than conventional fine-tuning methods. We empirically validate its effectiveness in improving generalization and robustness of pretrained models. For example, CT improves out-of-distribution transfer performances of ResNet-18/50 by 2.57%/1.74% across seventeen downstream datasets, and improves RobustBench robust accuracy by 11.76%/348.44%. Additionally, we apply CT to ReLU-based Swin-T/S, improving their generalization on nine downstream datasets by 2.43%/3.33%. Our code is available at \hrefthis https URLthis https URL.

[LG-1] Optimistic Interior Point Methods for Sequential Hypothesis Testing by Betting

链接: https://arxiv.org/abs/2502.07774
作者: Can Chen,Jun-Kun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The technique of “testing by betting” frames nonparametric sequential hypothesis testing as a multiple-round game, where a player bets on future observations that arrive in a streaming fashion, accumulates wealth that quantifies evidence against the null hypothesis, and rejects the null once the wealth exceeds a specified threshold while controlling the false positive error. Designing an online learning algorithm that achieves a small regret in the game can help rapidly accumulate the bettor’s wealth, which in turn can shorten the time to reject the null hypothesis under the alternative H_1 . However, many of the existing works employ the Online Newton Step (ONS) to update within a halved decision space to avoid a gradient explosion issue, which is potentially conservative for rapid wealth accumulation. In this paper, we introduce a novel strategy utilizing interior-point methods in optimization that allows updates across the entire interior of the decision space without the risk of gradient explosion. Our approach not only maintains strong statistical guarantees but also facilitates faster null hypothesis rejection in critical scenarios, overcoming the limitations of existing approaches.

[LG-2] Scalable Fingerprinting of Large Language Models

链接: https://arxiv.org/abs/2502.07760
作者: Anshul Nasery,Jonathan Hayase,Creston Brooks,Peiyao Sheng,Himanshu Tyagi,Pramod Viswanath,Sewoong Oh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages 15 figures

点击查看摘要

Abstract:Model fingerprinting has emerged as a powerful tool for model owners to identify their shared model given API access. However, to lower false discovery rate, fight fingerprint leakage, and defend against coalitions of model users attempting to bypass detection, we argue that \em scalability is critical, i.e., scaling up the number of fingerprints one can embed into a model. Hence, we pose scalability as a crucial requirement for fingerprinting schemes. We experiment with fingerprint design at a scale significantly larger than previously considered, and introduce a new method, dubbed Perinucleus sampling, to generate scalable, persistent, and harmless fingerprints. We demonstrate that this scheme can add 24,576 fingerprints to a Llama-3.1-8B model – two orders of magnitude more than existing schemes – without degrading the model’s utility. Our inserted fingerprints persist even after supervised fine-tuning on standard post-training data. We further address security risks for fingerprinting, and theoretically and empirically show how a scalable fingerprinting scheme like ours can mitigate these risks.

[LG-3] HiPoNet: A Topology-Preserving Multi-View Neural Network For High Dimensional Point Cloud and Single-Cell Data

链接: https://arxiv.org/abs/2502.07746
作者: Siddharth Viswanath,Hiren Madhu,Dhananjay Bhaskar,Jake Kovalic,Dave Johnson,Rex Ying,Christopher Tape,Ian Adelstein,Michael Perlmutter,Smita Krishnaswamy
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:In this paper, we propose HiPoNet, an end-to-end differentiable neural network for regression, classification, and representation learning on high-dimensional point clouds. Single-cell data can have high dimensionality exceeding the capabilities of existing methods point cloud tailored for 3D data. Moreover, modern single-cell and spatial experiments now yield entire cohorts of datasets (i.e. one on every patient), necessitating models that can process large, high-dimensional point clouds at scale. Most current approaches build a single nearest-neighbor graph, discarding important geometric information. In contrast, HiPoNet forms higher-order simplicial complexes through learnable feature reweighting, generating multiple data views that disentangle distinct biological processes. It then employs simplicial wavelet transforms to extract multi-scale features - capturing both local and global topology. We empirically show that these components preserve topological information in the learned representations, and that HiPoNet significantly outperforms state-of-the-art point-cloud and graph-based models on single cell. We also show an application of HiPoNet on spatial transcriptomics datasets using spatial co-ordinates as one of the views. Overall, HiPoNet offers a robust and scalable solution for high-dimensional data analysis.

[LG-4] Advancing climate model interpretability: Feature attribution for Arctic melt anomalies

链接: https://arxiv.org/abs/2502.07741
作者: Tolulope Ale,Nicole-Jeanne Schlegel,Vandana P. Janeja
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:The focus of our work is improving the interpretability of anomalies in climate models and advancing our understanding of Arctic melt dynamics. The Arctic and Antarctic ice sheets are experiencing rapid surface melting and increased freshwater runoff, contributing significantly to global sea level rise. Understanding the mechanisms driving snowmelt in these regions is crucial. ERA5, a widely used reanalysis dataset in polar climate studies, offers extensive climate variables and global data assimilation. However, its snowmelt model employs an energy imbalance approach that may oversimplify the complexity of surface melt. In contrast, the Glacier Energy and Mass Balance (GEMB) model incorporates additional physical processes, such as snow accumulation, firn densification, and meltwater percolation/refreezing, providing a more detailed representation of surface melt dynamics. In this research, we focus on analyzing surface snowmelt dynamics of the Greenland Ice Sheet using feature attribution for anomalous melt events in ERA5 and GEMB models. We present a novel unsupervised attribution method leveraging counterfactual explanation method to analyze detected anomalies in ERA5 and GEMB. Our anomaly detection results are validated using MEaSUREs ground-truth data, and the attributions are evaluated against established feature ranking methods, including XGBoost, Shapley values, and Random Forest. Our attribution framework identifies the physics behind each model and the climate features driving melt anomalies. These findings demonstrate the utility of our attribution method in enhancing the interpretability of anomalies in climate models and advancing our understanding of Arctic melt dynamics.

[LG-5] HRP: High-Rank Preheating for Superior LoRA Initialization

链接: https://arxiv.org/abs/2502.07739
作者: Yuzhu Chen,Yingjie Wang,Shi Fu,Li Shen,Yongcheng Jing,Xinmei Tian,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies the crucial impact of initialization on the convergence properties of Low-Rank Adaptation (LoRA). We theoretically demonstrate that random initialization, a widely used schema, will likely lead LoRA to random low-rank results, rather than the best low-rank result. While this issue can be mitigated by adjusting initialization towards a well-informed direction, it relies on prior knowledge of the target, which is typically unknown in real-world scenarios. To approximate this well-informed initial direction, we propose High-Rank Preheating (HRP), which fine-tunes high-rank LoRA for a few steps and uses the singular value decomposition of the preheated result as a superior initialization. HRP initialization is theory-supported to combine the convergence strengths of high-rank LoRA and the generalization strengths of low-rank LoRA. Extensive experiments demonstrate that HRP significantly enhances LoRA’s effectiveness across various models and tasks, achieving performance comparable to full-parameter fine-tuning and outperforming other initialization strategies.

[LG-6] Revisiting Non-Acyclic GFlowNets in Discrete Environments

链接: https://arxiv.org/abs/2502.07735
作者: Nikita Morozov,Ian Maksimov,Daniil Tiapkin,Sergey Samsonov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlowNets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for non-acyclic GFlowNets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlowNets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in non-acyclic GFlowNet training, as well as validate our own theoretical findings.

[LG-7] Near-Optimal Sample Complexity in Reward-Free Kernel-Based Reinforcement Learning AISTATS2025

链接: https://arxiv.org/abs/2502.07715
作者: Aya Kayal,Sattar Vakili,Laura Toni,Alberto Bernacchia
类目: Machine Learning (cs.LG)
*备注: Accepted at AISTATS 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) problems are being considered under increasingly more complex structures. While tabular and linear models have been thoroughly explored, the analytical study of RL under nonlinear function approximation, especially kernel-based models, has recently gained traction for their strong representational capacity and theoretical tractability. In this context, we examine the question of statistical efficiency in kernel-based RL within the reward-free RL framework, specifically asking: how many samples are required to design a near-optimal policy? Existing work addresses this question under restrictive assumptions about the class of kernel functions. We first explore this question by assuming a generative model, then relax this assumption at the cost of increasing the sample complexity by a factor of H, the length of the episode. We tackle this fundamental problem using a broad class of kernels and a simpler algorithm compared to prior work. Our approach derives new confidence intervals for kernel ridge regression, specific to our RL setting, which may be of broader applicability. We further validate our theoretical findings through simulations.

[LG-8] Partial-Label Learning with Conformal Candidate Cleaning

链接: https://arxiv.org/abs/2502.07661
作者: Tobias Fuchs,Florian Kalinke
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Real-world data is often ambiguous; for example, human annotation produces instances with multiple conflicting class labels. Partial-label learning (PLL) aims at training a classifier in this challenging setting, where each instance is associated with a set of candidate labels and one correct, but unknown, class label. A multitude of algorithms targeting this setting exists and, to enhance their prediction quality, several extensions that are applicable across a wide range of PLL methods have been introduced. While many of these extensions rely on heuristics, this article proposes a novel enhancing method that incrementally prunes candidate sets using conformal prediction. To work around the missing labeled validation set, which is typically required for conformal prediction, we propose a strategy that alternates between training a PLL classifier to label the validation set, leveraging these predicted class labels for calibration, and pruning candidate labels that are not part of the resulting conformal sets. In this sense, our method alternates between empirical risk minimization and candidate set pruning. We establish that our pruning method preserves the conformal validity with respect to the unknown ground truth. Our extensive experiments on artificial and real-world data show that the proposed approach significantly improves the test set accuracies of several state-of-the-art PLL classifiers.

[LG-9] Private Low-Rank Approximation for Covariance Matrices Dyson Brownian Motion and Eigenvalue-Gap Bounds for Gaussian Perturbations

链接: https://arxiv.org/abs/2502.07657
作者: Oren Mangoubi,Nisheeth K. Vishnoi
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注: Published in Journal of the ACM. arXiv admin note: substantial text overlap with arXiv:2306.16648

点击查看摘要

Abstract:We consider the problem of approximating a d \times d covariance matrix M with a rank- k matrix under (\varepsilon,\delta) -differential privacy. We present and analyze a complex variant of the Gaussian mechanism and obtain upper bounds on the Frobenius norm of the difference between the matrix output by this mechanism and the best rank- k approximation to M . Our analysis provides improvements over previous bounds, particularly when the spectrum of M satisfies natural structural assumptions. The novel insight is to view the addition of Gaussian noise to a matrix as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations enable us to upper bound the Frobenius distance between the best rank- k approximation of M and that of a Gaussian perturbation of M as an integral that involves inverse eigenvalue gaps of the stochastically evolving matrix, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems. Subsequently, again using the Dyson Brownian motion viewpoint, we show that the eigenvalues of the matrix M perturbed by Gaussian noise have large gaps with high probability. These results also contribute to the analysis of low-rank approximations under average-case perturbations, and to an understanding of eigenvalue gaps for random matrices, both of which may be of independent interest.

[LG-10] Causal Additive Models with Unobserved Causal Paths and Backdoor Paths

链接: https://arxiv.org/abs/2502.07646
作者: Thong Pham,Takashi Nicholas Maeda,Shohei Shimizu
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 14 pages

点击查看摘要

Abstract:Causal additive models have been employed as tractable yet expressive frameworks for causal discovery involving hidden variables. State-of-the-art methodologies suggest that determining the causal relationship between a pair of variables is infeasible in the presence of an unobserved backdoor or an unobserved causal path. Contrary to this assumption, we theoretically show that resolving the causal direction is feasible in certain scenarios by incorporating two novel components into the theory. The first component introduces a novel characterization of regression sets within independence between regression residuals. The second component leverages conditional independence among the observed variables. We also provide a search algorithm that integrates these innovations and demonstrate its competitive performance against existing methods.

[LG-11] Consistency Training with Physical Constraints

链接: https://arxiv.org/abs/2502.07636
作者: Che-Chia Chang,Chen-Yang Dai,Te-Sheng Lin,Ming-Chih Lai,Chieh-Hsin Lai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a physics-aware Consistency Training (CT) method that accelerates sampling in Diffusion Models with physical constraints. Our approach leverages a two-stage strategy: (1) learning the noise-to-data mapping via CT, and (2) incorporating physics constraints as a regularizer. Experiments on toy examples show that our method generates samples in a single step while adhering to the imposed constraints. This approach has the potential to efficiently solve partial differential equations (PDEs) using deep generative modeling.

[LG-12] Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques

链接: https://arxiv.org/abs/2502.07634
作者: Shruti Singh,Shantanu Kumar
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:This study investigates the impact of gradient compression on distributed training performance, focusing on sparsification and quantization techniques, including top-k, DGC, and QSGD. In baseline experiments, random-k compression results in severe performance degradation, highlighting its inefficacy. In contrast, using top-k and DGC at 50 times compression yields performance improvements, reducing perplexity by up to 0.06 compared to baseline. Experiments across 1, 2, and 4 workers demonstrate that conservative sparsification can have a regularizing effect, especially for smaller models, while compression ratios above 5000 times impair performance, particularly for DGC. Communication times are reduced across all compression methods, with top-k and DGC decreasing communication to negligible levels at high compression ratios. However, increased computation times offset this efficiency for top-k due to sorting demands, making it less scalable than DGC or QSGD. In convergence tests, sparsification techniques show accelerated convergence, requiring fewer epochs than the baseline, which has implications for computational savings. Although precision trade-offs emerge, floating point errors are mitigated by compression. This study’s findings underscore the need to tune hyperparameters specifically for each compression technique to achieve optimal model performance, especially in distributed training systems.

[LG-13] Beyond Prompting: Time2Lang – Bridging Time-Series Foundation Models and Large Language Models for Health Sensing

链接: https://arxiv.org/abs/2502.07608
作者: Arvind Pillai,Dimitris Spathis,Subigya Nepal,Amanda C Collins,Daniel M Mackin,Michael V Heinz,Tess Z Griffin,Nicholas C Jacobson,Andrew Campbell
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Under review at CHIL 2025

点击查看摘要

Abstract:Large language models (LLMs) show promise for health applications when combined with behavioral sensing data. Traditional approaches convert sensor data into text prompts, but this process is prone to errors, computationally expensive, and requires domain expertise. These challenges are particularly acute when processing extended time series data. While time series foundation models (TFMs) have recently emerged as powerful tools for learning representations from temporal data, bridging TFMs and LLMs remains challenging. Here, we present Time2Lang, a framework that directly maps TFM outputs to LLM representations without intermediate text conversion. Our approach first trains on synthetic data using periodicity prediction as a pretext task, followed by evaluation on mental health classification tasks. We validate Time2Lang on two longitudinal wearable and mobile sensing datasets: daily depression prediction using step count data (17,251 days from 256 participants) and flourishing classification based on conversation duration (46 participants over 10 weeks). Time2Lang maintains near constant inference times regardless of input length, unlike traditional prompting methods. The generated embeddings preserve essential time-series characteristics such as auto-correlation. Our results demonstrate that TFMs and LLMs can be effectively integrated while minimizing information loss and enabling performance transfer across these distinct modeling paradigms. To our knowledge, we are the first to integrate a TFM and an LLM for health, thus establishing a foundation for future research combining general-purpose large models for complex healthcare tasks.

[LG-14] Algorithmic Aspects of Strategic Trading

链接: https://arxiv.org/abs/2502.07606
作者: Michael Kearns,Mirah Shi
类目: Computer Science and Game Theory (cs.GT); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic trading in modern financial markets is widely acknowledged to exhibit strategic, game-theoretic behaviors whose complexity can be difficult to model. A recent series of papers (Chriss, 2024b,c,a, 2025) has made progress in the setting of trading for position building. Here parties wish to buy or sell a fixed number of shares in a fixed time period in the presence of both temporary and permanent market impact, resulting in exponentially large strategy spaces. While these papers primarily consider the existence and structural properties of equilibrium strategies, in this work we focus on the algorithmic aspects of the proposed model. We give an efficient algorithm for computing best responses, and show that while the temporary impact only setting yields a potential game, best response dynamics do not generally converge for the general setting, for which no fast algorithm for (Nash) equilibrium computation is known. This leads us to consider the broader notion of Coarse Correlated Equilibria (CCE), which we show can be computed efficiently via an implementation of Follow the Perturbed Leader (FTPL). We illustrate the model and our results with an experimental investigation, where FTPL exhibits interesting behavior in different regimes of the relative weighting between temporary and permanent market impact.

[LG-15] SEMU: Singular Value Decomposition for Efficient Machine Unlearning

链接: https://arxiv.org/abs/2502.07587
作者: Marcin Sendera,Łukasz Struski,Kamil Książek,Kryspin Musiol,Jacek Tabor,Dawid Rymarczyk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the capabilities of generative foundational models have advanced rapidly in recent years, methods to prevent harmful and unsafe behaviors remain underdeveloped. Among the pressing challenges in AI safety, machine unlearning (MU) has become increasingly critical to meet upcoming safety regulations. Most existing MU approaches focus on altering the most significant parameters of the model. However, these methods often require fine-tuning substantial portions of the model, resulting in high computational costs and training instabilities, which are typically mitigated by access to the original training dataset. In this work, we address these limitations by leveraging Singular Value Decomposition (SVD) to create a compact, low-dimensional projection that enables the selective forgetting of specific data points. We propose Singular Value Decomposition for Efficient Machine Unlearning (SEMU), a novel approach designed to optimize MU in two key aspects. First, SEMU minimizes the number of model parameters that need to be modified, effectively removing unwanted knowledge while making only minimal changes to the model’s weights. Second, SEMU eliminates the dependency on the original training dataset, preserving the model’s previously acquired knowledge without additional data requirements. Extensive experiments demonstrate that SEMU achieves competitive performance while significantly improving efficiency in terms of both data usage and the number of modified parameters. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.07587 [cs.LG] (or arXiv:2502.07587v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Generative Modeling with Bayesian Sample Inference

链接: https://arxiv.org/abs/2502.07580
作者: Marten Lienen,Marcel Kollovieh,Stephan Günnemann
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive a novel generative model from the simple act of Gaussian posterior inference. Treating the generated sample as an unknown variable to infer lets us formulate the sampling process in the language of Bayesian probability. Our model uses a sequence of prediction and posterior update steps to narrow down the unknown sample from a broad initial belief. In addition to a rigorous theoretical analysis, we establish a connection between our model and diffusion models and show that it includes Bayesian Flow Networks (BFNs) as a special case. In our experiments, we demonstrate improved performance over both BFNs and Variational Diffusion Models, achieving competitive likelihood scores on CIFAR10 and ImageNet.

[LG-17] Single-Step Consistent Diffusion Samplers

链接: https://arxiv.org/abs/2502.07579
作者: Pascal Jutras-Dubé,Patrick Pynadath,Ruqi Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sampling from unnormalized target distributions is a fundamental yet challenging task in machine learning and statistics. Existing sampling algorithms typically require many iterative steps to produce high-quality samples, leading to high computational costs that limit their practicality in time-sensitive or resource-constrained settings. In this work, we introduce consistent diffusion samplers, a new class of samplers designed to generate high-fidelity samples in a single step. We first develop a distillation algorithm to train a consistent diffusion sampler from a pretrained diffusion model without pre-collecting large datasets of samples. Our algorithm leverages incomplete sampling trajectories and noisy intermediate states directly from the diffusion process. We further propose a method to train a consistent diffusion sampler from scratch, fully amortizing exploration by training a single model that both performs diffusion sampling and skips intermediate steps using a self-consistency loss. Through extensive experiments on a variety of unnormalized distributions, we show that our approach yields high-fidelity samples using less than 1% of the network evaluations required by traditional diffusion samplers.

[LG-18] Attention Learning is Needed to Efficiently Learn Parity Function

链接: https://arxiv.org/abs/2502.07553
作者: Yaomengxi Han,Debarghya Ghoshdastidar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers, with their attention mechanisms, have emerged as the state-of-the-art architectures of sequential modeling and empirically outperform feed-forward neural networks (FFNNs) across many fields, such as natural language processing and computer vision. However, their generalization ability, particularly for low-sensitivity functions, remains less studied. We bridge this gap by analyzing transformers on the k -parity problem. Daniely and Malach (NeurIPS 2020) show that FFNNs with one hidden layer and O(nk^7 \log k) parameters can learn k -parity, where the input length n is typically much larger than k . In this paper, we prove that FFNNs require at least \Omega(n) parameters to learn k -parity, while transformers require only O(k) parameters, surpassing the theoretical lower bound needed by FFNNs. We further prove that this parameter efficiency cannot be achieved with fixed attention heads. Our work establishes transformers as theoretically superior to FFNNs in learning parity function, showing how their attention mechanisms enable parameter-efficient generalization in functions with low sensitivity.

[LG-19] Early Stopping Against Label Noise Without Validation Data ICLR2024

链接: https://arxiv.org/abs/2502.07551
作者: Suqin Yuan,Lei Feng,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2024

点击查看摘要

Abstract:Early stopping methods in deep learning face the challenge of balancing the volume of training and validation data, especially in the presence of label noise. Concretely, sparing more data for validation from training data would limit the performance of the learned model, yet insufficient validation data could result in a sub-optimal selection of the desired model. In this paper, we propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model in the presence of label noise. It works by tracking the changes in the model’s predictions on the training set during the training process, aiming to halt training before the model unduly fits mislabeled data. This method is empirically supported by our observation that minimum fluctuations in predictions typically occur at the training epoch before the model excessively fits mislabeled data. Through extensive experiments, we show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels.

[LG-20] Instance-dependent Early Stopping ICLR2025

链接: https://arxiv.org/abs/2502.07547
作者: Suqin Yuan,Runqi Lin,Lei Feng,Bo Han,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025 (Spotlight)

点击查看摘要

Abstract:In machine learning practice, early stopping has been widely used to regularize models and can save computational costs by halting the training process when the model’s performance on a validation set stops improving. However, conventional early stopping applies the same stopping criterion to all instances without considering their individual learning statuses, which leads to redundant computations on instances that are already well-learned. To further improve the efficiency, we propose an Instance-dependent Early Stopping (IES) method that adapts the early stopping mechanism from the entire training set to the instance level, based on the core principle that once the model has mastered an instance, the training on it should stop. IES considers an instance as mastered if the second-order differences of its loss value remain within a small range around zero. This offers a more consistent measure of an instance’s learning status compared with directly using the loss value, and thus allows for a unified threshold to determine when an instance can be excluded from further backpropagation. We show that excluding mastered instances from backpropagation can increase the gradient norms, thereby accelerating the decrease of the training loss and speeding up the training process. Extensive experiments on benchmarks demonstrate that IES method can reduce backpropagation instances by 10%-50% while maintaining or even slightly improving the test accuracy and transfer learning performance of a model.

[LG-21] Diffusion-LAM: Probabilistic Limited Area Weather Forecasting with Diffusion

链接: https://arxiv.org/abs/2502.07532
作者: Erik Larsson,Joel Oskarsson,Tomas Landelius,Fredrik Lindsten
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Machine learning methods have been shown to be effective for weather forecasting, based on the speed and accuracy compared to traditional numerical models. While early efforts primarily concentrated on deterministic predictions, the field has increasingly shifted toward probabilistic forecasting to better capture the forecast uncertainty. Most machine learning-based models have been designed for global-scale predictions, with only limited work targeting regional or limited area forecasting, which allows more specialized and flexible modeling for specific locations. This work introduces Diffusion-LAM, a probabilistic limited area weather model leveraging conditional diffusion. By conditioning on boundary data from surrounding regions, our approach generates forecasts within a defined area. Experimental results on the MEPS limited area dataset demonstrate the potential of Diffusion-LAM to deliver accurate probabilistic forecasts, highlighting its promise for limited-area weather prediction.

[LG-22] raining Deep Learning Models with Norm-Constrained LMOs

链接: https://arxiv.org/abs/2502.07529
作者: Thomas Pethick,Wanyun Xie,Kimon Antonakopoulos,Zhenyu Zhu,Antonio Silveti-Falls,Volkan Cevher
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.

[LG-23] A Near-optimal Scalable and Corruption-tolerant Framework for Stochastic Bandits: From Single-Agent to Multi-Agent and Beyond

链接: https://arxiv.org/abs/2502.07514
作者: Zicheng Hu,Cheng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate various stochastic bandit problems in the presence of adversarial corruption. A seminal contribution to this area is the BARBAR~\citepgupta2019better algorithm, which is both simple and efficient, tolerating significant levels of corruption with nearly no degradation in performance. However, its regret upper bound exhibits a complexity of O(KC) , while the lower bound is \Omega© . In this paper, we enhance the BARBAR algorithm by proposing a novel framework called BARBAT, which eliminates the factor of K and achieves an optimal regret bound up to a logarithmic factor. We also demonstrate how BARBAT can be extended to various settings, including graph bandits, combinatorial semi-bandits, batched bandits and multi-agent bandits. In comparison to the Follow-The-Regularized-Leader (FTRL) family of methods, which provide a best-of-both-worlds guarantee, our approach is more efficient and parallelizable. Notably, FTRL-based methods face challenges in scaling to batched and multi-agent settings.

[LG-24] Joint Metric Space Embedding by Unbalanced OT with Gromov-Wasserstein Marginal Penalization

链接: https://arxiv.org/abs/2502.07510
作者: Florian Beier,Moritz Piening,Robert Beinert,Gabriele Steidl
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new approach for unsupervised alignment of heterogeneous datasets, which maps data from two different domains without any known correspondences to a common metric space. Our method is based on an unbalanced optimal transport problem with Gromov-Wasserstein marginal penalization. It can be seen as a counterpart to the recently introduced joint multidimensional scaling method. We prove that there exists a minimizer of our functional and that for penalization parameters going to infinity, the corresponding sequence of minimizers converges to a minimizer of the so-called embedded Wasserstein distance. Our model can be reformulated as a quadratic, multi-marginal, unbalanced optimal transport problem, for which a bi-convex relaxation admits a numerical solver via block-coordinate descent. We provide numerical examples for joint embeddings in Euclidean as well as non-Euclidean spaces.

[LG-25] Unified Graph Networks (UGN): A Deep Neural Framework for Solving Graph Problems

链接: https://arxiv.org/abs/2502.07500
作者: Rudrajit Dawn,Madhusudan Ghosh,Partha Basuchowdhuri,Sudip Kumar Naskar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks have enabled researchers to create powerful generalized frameworks, such as transformers, that can be used to solve well-studied problems in various application domains, such as text and image. However, such generalized frameworks are not available for solving graph problems. Graph structures are ubiquitous in many applications around us and many graph problems have been widely studied over years. In recent times, there has been a surge in deep neural network based approaches to solve graph problems, with growing availability of graph structured datasets across diverse domains. Nevertheless, existing methods are mostly tailored to solve a specific task and lack the capability to create a generalized model leading to solutions for different downstream tasks. In this work, we propose a novel, resource-efficient framework named \emphUnified \emphGraph \emphNetwork (UGN) by leveraging the feature extraction capability of graph convolutional neural networks (GCN) and 2-dimensional convolutional neural networks (Conv2D). UGN unifies various graph learning tasks, such as link prediction, node classification, community detection, graph-to-graph translation, knowledge graph completion, and more, within a cohesive framework, while exercising minimal task-specific extensions (e.g., formation of supernodes for coarsening massive networks to increase scalability, use of \textitmean target connectivity matrix (MTCM) representation for achieving scalability in graph translation task, etc.) to enhance the generalization capability of graph learning and analysis. We test the novel UGN framework for six uncorrelated graph problems, using twelve different datasets. Experimental results show that UGN outperforms the state-of-the-art baselines by a significant margin on ten datasets, while producing comparable results on the remaining dataset.

[LG-26] On Training-Conditional Conformal Prediction and Binomial Proportion Confidence Intervals

链接: https://arxiv.org/abs/2502.07497
作者: Rudi Coppola,Manuel Mazo Jr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the expectation of a Bernoulli random variable based on N independent trials is a classical problem in statistics, typically addressed using Binomial Proportion Confidence Intervals (BPCI). In the control systems community, many critical tasks-such as certifying the statistical safety of dynamical systems-can be formulated as BPCI problems. Conformal Prediction (CP), a distribution-free technique for uncertainty quantification, has gained significant attention in recent years and has been applied to various control systems problems, particularly to address uncertainties in learned dynamics or controllers. A variant known as training-conditional CP was recently employed to tackle the problem of safety certification. In this note, we highlight that the use of training-conditional CP in this context does not provide valid safety guarantees. We demonstrate why CP is unsuitable for BPCI problems and argue that traditional BPCI methods are better suited for statistical safety certification.

[LG-27] LLM -Sketch: Enhancing Network Sketches with LLM

链接: https://arxiv.org/abs/2502.07495
作者: Yuanpeng Li,Zhen Xu,Zongwei Lv,Yannan Hu,Yong Cui,Tong Yang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Network stream mining is fundamental to many network operations. Sketches, as compact data structures that offer low memory overhead with bounded accuracy, have emerged as a promising solution for network stream mining. Recent studies attempt to optimize sketches using machine learning; however, these approaches face the challenges of lacking adaptivity to dynamic networks and incurring high training costs. In this paper, we propose LLM-Sketch, based on the insight that fields beyond the flow IDs in packet headers can also help infer flow sizes. By using a two-tier data structure and separately recording large and small flows, LLM-Sketch improves accuracy while minimizing memory usage. Furthermore, it leverages fine-tuned large language models (LLMs) to reliably estimate flow sizes. We evaluate LLM-Sketch on three representative tasks, and the results demonstrate that LLM-Sketch outperforms state-of-the-art methods by achieving a 7.5\times accuracy improvement.

[LG-28] Exploring Patterns Behind Sports

链接: https://arxiv.org/abs/2502.07491
作者: Chang Liu,Chengcheng Ma,XuanQi Zhou
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive framework for time series prediction using a hybrid model that combines ARIMA and LSTM. The model incorporates feature engineering techniques, including embedding and PCA, to transform raw data into a lower-dimensional representation while retaining key information. The embedding technique is used to convert categorical data into continuous vectors, facilitating the capture of complex relationships. PCA is applied to reduce dimensionality and extract principal components, enhancing model performance and computational efficiency. To handle both linear and nonlinear patterns in the data, the ARIMA model captures linear trends, while the LSTM model models complex nonlinear dependencies. The hybrid model is trained on historical data and achieves high accuracy, as demonstrated by low RMSE and MAE scores. Additionally, the paper employs the run test to assess the randomness of sequences, providing insights into the underlying patterns. Ablation studies are conducted to validate the roles of different components in the model, demonstrating the significance of each module. The paper also utilizes the SHAP method to quantify the impact of traditional advantages on the predicted results, offering a detailed understanding of feature importance. The KNN method is used to determine the optimal prediction interval, further enhancing the model’s accuracy. The results highlight the effectiveness of combining traditional statistical methods with modern deep learning techniques for robust time series forecasting in Sports.

[LG-29] Physiome-ODE: A Benchmark for Irregularly Sampled Multivariate Time Series Forecasting Based on Biological ODEs

链接: https://arxiv.org/abs/2502.07489
作者: Christian Klötergens,Vijaya Krishna Yalavarthi,Randolf Scholz,Maximilian Stubbemann,Stefan Born,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-of-the-art methods for forecasting irregularly sampled time series with missing values predominantly rely on just four datasets and a few small toy examples for evaluation. While ordinary differential equations (ODE) are the prevalent models in science and engineering, a baseline model that forecasts a constant value outperforms ODE-based models from the last five years on three of these existing datasets. This unintuitive finding hampers further research on ODE-based models, a more plausible model family. In this paper, we develop a methodology to generate irregularly sampled multivariate time series (IMTS) datasets from ordinary differential equations and to select challenging instances via rejection sampling. Using this methodology, we create Physiome-ODE, a large and sophisticated benchmark of IMTS datasets consisting of 50 individual datasets, derived from real-world ordinary differential equations from research in biology. Physiome-ODE is the first benchmark for IMTS forecasting that we are aware of and an order of magnitude larger than the current evaluation setting of four datasets. Using our benchmark Physiome-ODE, we show qualitatively completely different results than those derived from the current four datasets: on Physiome-ODE ODE-based models can play to their strength and our benchmark can differentiate in a meaningful way between different IMTS forecasting models. This way, we expect to give a new impulse to research on ODE-based time series modeling.

[LG-30] Improving Adaptive Moment Optimization via Preconditioner Diagonalization

链接: https://arxiv.org/abs/2502.07488
作者: Son Nguyen,Bo Liu,Lizhang Chen,Qiang Liu
类目: Machine Learning (cs.LG)
*备注: 19 pages, 13 figures

点击查看摘要

Abstract:Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based on estimates of gradient statistics. Compared to traditional algorithms like Stochastic Gradient Descent, these adaptive methods are typically more robust to model scale and hyperparameter tuning. However, the gradient statistics employed by these methods often do not leverage sufficient gradient covariance information, leading to suboptimal updates in certain directions of the parameter space and potentially slower convergence. In this work, we keep track of such covariance statistics in the form of a structured preconditioner matrix. Unlike other works, our approach does not apply direct approximations to estimate this matrix. We instead implement an invertible transformation that maps the preconditioner matrix into a new space where it becomes approximately diagonal. This enables a diagonal approximation of the preconditioner matrix in the transformed space, offering several computational advantages. Empirical results show that our approach can substantially enhance the convergence speed of modern adaptive optimizers. Notably, for large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam. Additionally, our method can be integrated with memory-efficient optimizers like Adafactor to manage computational overhead.

[LG-31] Overfitting Regimes of Nadaraya-Watson Interpolators

链接: https://arxiv.org/abs/2502.07480
作者: Daniel Barzilai,Guy Kornowski,Ohad Shamir
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 26 pages

点击查看摘要

Abstract:In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard’s method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. Numerical experiments complement our theory, demonstrating the same phenomena.

[LG-32] Logarithmic Regret for Online KL-Regularized Reinforcement Learning

链接: https://arxiv.org/abs/2502.07460
作者: Heyang Zhao,Chenlu Ye,Wei Xiong,Quanquan Gu,Tong Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citepxiong2024iterative, xie2024exploratory,zhao2024sharp, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an \mathcalO\big(\eta\log (N_\mathcal R T)\cdot d_\mathcal R\big) logarithmic regret bound, where \eta, N_\mathcal R,T,d_\mathcal R denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.

[LG-33] CapyMOA: Efficient Machine Learning for Data Streams in Python

链接: https://arxiv.org/abs/2502.07432
作者: Heitor Murilo Gomes,Anton Lee,Nuwan Gunasekara,Yibin Sun,Guilherme Weigert Cassales,Justin Liu,Marco Heyden,Vitor Cerqueira,Maroua Bahri,Yun Sing Koh,Bernhard Pfahringer,Albert Bifet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:CapyMOA is an open-source library designed for efficient machine learning on streaming data. It provides a structured framework for real-time learning and evaluation, featuring a flexible data representation. CapyMOA includes an extensible architecture that allows integration with external frameworks such as MOA and PyTorch, facilitating hybrid learning approaches that combine traditional online algorithms with deep learning techniques. By emphasizing adaptability, scalability, and usability, CapyMOA allows researchers and practitioners to tackle dynamic learning challenges across various domains.

[LG-34] owards a Foundation Model for Physics-Informed Neural Networks: Multi-PDE Learning with Active Sampling

链接: https://arxiv.org/abs/2502.07425
作者: Keon Vin Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs) by embedding physical laws into neural network training. However, traditional PINN models are typically designed for single PDEs, limiting their generalizability across different physical systems. In this work, we explore the potential of a foundation PINN model capable of solving multiple PDEs within a unified architecture. We investigate the efficacy of a single PINN framework trained on four distinct PDEs-the Simple Harmonic Oscillator (SHO), the 1D Heat Equation, the 1D Wave Equation, and the 2D Laplace Equation, demonstrating its ability to learn diverse physical dynamics. To enhance sample efficiency, we incorporate Active Learning (AL) using Monte Carlo (MC) Dropout-based uncertainty estimation, selecting the most informative training samples iteratively. We evaluate different active learning strategies, comparing models trained on 10%, 20%, 30%, 40%, and 50% of the full dataset, and analyze their impact on solution accuracy. Our results indicate that targeted uncertainty sampling significantly improves performance with fewer training samples, leading to efficient learning across multiple PDEs. This work highlights the feasibility of a generalizable PINN-based foundation model, capable of adapting to different physics-based problems without redesigning network architectures. Our findings suggest that multi-PDE PINNs with active learning can serve as an effective approach for reducing computational costs while maintaining high accuracy in physics-based deep learning applications. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.07425 [cs.LG] (or arXiv:2502.07425v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Sample Weight Averag ing for Stable Prediction

链接: https://arxiv.org/abs/2502.07414
作者: Han Yu,Yue He,Renzhe Xu,Dongbai Li,Jiayin Zhang,Wenchao Zou,Peng Cui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The challenge of Out-of-Distribution (OOD) generalization poses a foundational concern for the application of machine learning algorithms to risk-sensitive areas. Inspired by traditional importance weighting and propensity weighting methods, prior approaches employ an independence-based sample reweighting procedure. They aim at decorrelating covariates to counteract the bias introduced by spurious correlations between unstable variables and the outcome, thus enhancing generalization and fulfilling stable prediction under covariate shift. Nonetheless, these methods are prone to experiencing an inflation of variance, primarily attributable to the reduced efficacy in utilizing training samples during the reweighting process. Existing remedies necessitate either environmental labels or substantially higher time costs along with additional assumptions and supervised information. To mitigate this issue, we propose SAmple Weight Averaging (SAWA), a simple yet efficacious strategy that can be universally integrated into various sample reweighting algorithms to decrease the variance and coefficient estimation error, thus boosting the covariate-shift generalization and achieving stable prediction across different environments. We prove its rationality and benefits theoretically. Experiments across synthetic datasets and real-world datasets consistently underscore its superiority against covariate shift.

[LG-36] Interpretable Rules for Online Failure Prediction: A Case Study on the Metro do Porto dataset

链接: https://arxiv.org/abs/2502.07394
作者: Matthias Jakobs,Bruno Veloso,Joao Gama
类目: Machine Learning (cs.LG)
*备注: Under submission at Information Fusion

点击查看摘要

Abstract:Due to their high predictive performance, predictive maintenance applications have increasingly been approached with Deep Learning techniques in recent years. However, as in other real-world application scenarios, the need for explainability is often stated but not sufficiently addressed. This study will focus on predicting failures on Metro trains in Porto, Portugal. While recent works have found high-performing deep neural network architectures that feature a parallel explainability pipeline, the generated explanations are fairly complicated and need help explaining why the failures are happening. This work proposes a simple online rule-based explainability approach with interpretable features that leads to straightforward, interpretable rules. We showcase our approach on MetroPT2 and find that three specific sensors on the Metro do Porto trains suffice to predict the failures present in the dataset with simple rules.

[LG-37] Effects of Random Edge-Dropping on Over-Squashing in Graph Neural Networks

链接: https://arxiv.org/abs/2502.07364
作者: Jasraj Singh,Keyue Jiang,Brooks Paige,Laura Toni
类目: Machine Learning (cs.LG)
*备注: 24 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Message Passing Neural Networks (MPNNs) are a class of Graph Neural Networks (GNNs) that leverage the graph topology to propagate messages across increasingly larger neighborhoods. The message-passing scheme leads to two distinct challenges: over-smoothing and over-squashing. While several algorithms, e.g. DropEdge and its variants – DropNode, DropAgg and DropGNN – have successfully addressed the over-smoothing problem, their impact on over-squashing remains largely unexplored. This represents a critical gap in the literature as failure to mitigate over-squashing would make these methods unsuitable for long-range tasks. In this work, we take the first step towards closing this gap by studying the aforementioned algorithms in the context of over-squashing. We present novel theoretical results that characterize the negative effects of DropEdge on sensitivity between distant nodes, suggesting its unsuitability for long-range tasks. Our findings are easily extended to its variants, allowing us to build a comprehensive understanding of how they affect over-squashing. We evaluate these methods using real-world datasets, demonstrating their detrimental effects. Specifically, we show that while DropEdge-variants improve test-time performance in short range tasks, they deteriorate performance in long-range ones. Our theory explains these results as follows: random edge-dropping lowers the effective receptive field of GNNs, which although beneficial for short-range tasks, misaligns the models on long-range ones. This forces the models to overfit to short-range artefacts in the training set, resulting in poor generalization. Our conclusions highlight the need to re-evaluate various methods designed for training deep GNNs, with a renewed focus on modelling long-range interactions.

[LG-38] Neural Flow Samplers with Shortcut Models

链接: https://arxiv.org/abs/2502.07337
作者: Wuhao Chen,Zijing Ou,Yingzhen Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sampling from unnormalized densities is a fundamental task across various domains. Flow-based samplers generate samples by learning a velocity field that satisfies the continuity equation, but this requires estimating the intractable time derivative of the partition function. While importance sampling provides an approximation, it suffers from high variance. To mitigate this, we introduce a velocity-driven Sequential Monte Carlo method combined with control variates to reduce variance. Additionally, we incorporate a shortcut model to improve efficiency by minimizing the number of sampling steps. Empirical results on both synthetic datasets and n -body system targets validate the effectiveness of our approach.

[LG-39] Long-term simulation of physical and mechanical behaviors using curriculum-transfer-learning based physics-informed neural networks

链接: https://arxiv.org/abs/2502.07325
作者: Yuan Guo,Zhuojia Fu,Jian Min,Shiyu Lin,Xiaoting Liu,Youssef F. Rashed,Xiaoying Zhuang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 31 pages, 18 figures

点击查看摘要

Abstract:This paper proposes a Curriculum-Transfer-Learning based physics-informed neural network (CTL-PINN) for long-term simulation of physical and mechanical behaviors. The main innovation of CTL-PINN lies in decomposing long-term problems into a sequence of short-term subproblems. Initially, the standard PINN is employed to solve the first sub-problem. As the simulation progresses, subsequent time-domain problems are addressed using a curriculum learning approach that integrates information from previous steps. Furthermore, transfer learning techniques are incorporated, allowing the model to effectively utilize prior training data and solve sequential time domain transfer problems. CTL-PINN combines the strengths of curriculum learning and transfer learning, overcoming the limitations of standard PINNs, such as local optimization issues, and addressing the inaccuracies over extended time domains encountered in CL-PINN and the low computational efficiency of TL-PINN. The efficacy and robustness of CTL-PINN are demonstrated through applications to nonlinear wave propagation, Kirchhoff plate dynamic response, and the hydrodynamic model of the Three Gorges Reservoir Area, showcasing its superior capability in addressing long-term computational challenges.

[LG-40] Learnable Residual-based Latent Denoising in Semantic Communication

链接: https://arxiv.org/abs/2502.07319
作者: Mingkai Xu,Yongpeng Wu,Yuxuan Shi,Xiang-Gen Xia,Wenjun Zhang,Ping Zhang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: This paper has been accepted by IEEE Wireless Communications Letters

点击查看摘要

Abstract:A latent denoising semantic communication (SemCom) framework is proposed for robust image transmission over noisy channels. By incorporating a learnable latent denoiser into the receiver, the received signals are preprocessed to effectively remove the channel noise and recover the semantic information, thereby enhancing the quality of the decoded images. Specifically, a latent denoising mapping is established by an iterative residual learning approach to improve the denoising efficiency while ensuring stable performance. Moreover, channel signal-to-noise ratio (SNR) is utilized to estimate and predict the latent similarity score (SS) for conditional denoising, where the number of denoising steps is adapted based on the predicted SS sequence, further reducing the communication latency. Finally, simulations demonstrate that the proposed framework can effectively and efficiently remove the channel noise at various levels and reconstruct visual-appealing images.

[LG-41] Generation of Drug-Induced Cardiac Reactions towards Virtual Clinical Trials

链接: https://arxiv.org/abs/2502.07297
作者: Qian Shao,Bang Du,Zepeng Li,Qiyuan Chen,Hongxia Xu,Jimeng Sun,Jian Wu,Jintai Chen
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Under review

点击查看摘要

Abstract:Clinical trials are pivotal in cardiac drug development, yet they often fail due to inadequate efficacy and unexpected safety issues, leading to significant financial losses. Using in-silico trials to replace a part of physical clinical trials, e.g., leveraging advanced generative models to generate drug-influenced electrocardiograms (ECGs), seems an effective method to reduce financial risk and potential harm to trial participants. While existing generative models have demonstrated progress in ECG generation, they fall short in modeling drug reactions due to limited fidelity and inability to capture individualized drug response patterns. In this paper, we propose a Drug-Aware Diffusion Model (DADM), which could simulate individualized drug reactions while ensuring fidelity. To ensure fidelity, we construct a set of ordinary differential equations to provide external physical knowledge (EPK) of the realistic ECG morphology. The EPK is used to adaptively constrain the morphology of the generated ECGs through a dynamic cross-attention (DCA) mechanism. Furthermore, we propose an extension of ControlNet to incorporate demographic and drug data, simulating individual drug reactions. We compare DADM with the other eight state-of-the-art ECG generative models on two real-world databases covering 8 types of drug regimens. The results demonstrate that DADM can more accurately simulate drug-induced changes in ECGs, improving the accuracy by at least 5.79% and recall by 8%.

[LG-42] reatment Effect Estimation for Exponential Family Outcomes using Neural Networks with Targeted Regularization

链接: https://arxiv.org/abs/2502.07295
作者: Jiahong Li,Zeqin Yang,Jiayi Dan,Jixing Xu,Zhichao Zou,Peng Zhen,Jiecheng Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Networks (NNs) have became a natural choice for treatment effect estimation due to their strong approximation capabilities. Nevertheless, how to design NN-based estimators with desirable properties, such as low bias and doubly robustness, still remains a significant challenge. A common approach to address this is targeted regularization, which modifies the objective function of NNs. However, existing works on targeted regularization are limited to Gaussian-distributed outcomes, significantly restricting their applicability in real-world scenarios. In this work, we aim to bridge this blank by extending this framework to the boarder exponential family outcomes. Specifically, we first derive the von-Mises expansion of the Average Dose function of Canonical Functions (ADCF), which inspires us how to construct a doubly robust estimator with good properties. Based on this, we develop a NN-based estimator for ADCF by generalizing functional targeted regularization to exponential families, and provide the corresponding theoretical convergence rate. Extensive experimental results demonstrate the effectiveness of our proposed model.

[LG-43] Supervised Contrastive Block Disentanglement

链接: https://arxiv.org/abs/2502.07281
作者: Taro Makino,Ji Won Park,Natasa Tagasovska,Takamasa Kudo,Paula Coelho,Jan-Christian Huetter,Heming Yao,Burkhard Hoeckendorf,Ana Carolina Leote,Stephen Ra,David Richmond,Kyunghyun Cho,Aviv Regev,Romain Lopez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world datasets often combine data collected under different experimental conditions. This yields larger datasets, but also introduces spurious correlations that make it difficult to model the phenomena of interest. We address this by learning two embeddings to independently represent the phenomena of interest and the spurious correlations. The embedding representing the phenomena of interest is correlated with the target variable y , and is invariant to the environment variable e . In contrast, the embedding representing the spurious correlations is correlated with e . The invariance to e is difficult to achieve on real-world datasets. Our primary contribution is an algorithm called Supervised Contrastive Block Disentanglement (SCBD) that effectively enforces this invariance. It is based purely on Supervised Contrastive Learning, and applies to real-world data better than existing approaches. We empirically validate SCBD on two challenging problems. The first problem is domain generalization, where we achieve strong performance on a synthetic dataset, as well as on Camelyon17-WILDS. We introduce a single hyperparameter \alpha to control the degree of invariance to e . When we increase \alpha to strengthen the degree of invariance, out-of-distribution performance improves at the expense of in-distribution performance. The second problem is batch correction, in which we apply SCBD to preserve biological signal and remove inter-well batch effects when modeling single-cell perturbations from 26 million Optical Pooled Screening images.

[LG-44] Beyond Confidence: Adaptive Abstention in Dual-Threshold Conformal Prediction for Autonomous System Perception

链接: https://arxiv.org/abs/2502.07255
作者: Divake Kumar,Nastaran Darabi,Sina Tayebati,Amit Ranjan Trivedi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety-critical perception systems require both reliable uncertainty quantification and principled abstention mechanisms to maintain safety under diverse operational conditions. We present a novel dual-threshold conformalization framework that provides statistically-guaranteed uncertainty estimates while enabling selective prediction in high-risk scenarios. Our approach uniquely combines a conformal threshold ensuring valid prediction sets with an abstention threshold optimized through ROC analysis, providing distribution-free coverage guarantees (\ge 1 - \alpha) while identifying unreliable predictions. Through comprehensive evaluation on CIFAR-100, ImageNet1K, and ModelNet40 datasets, we demonstrate superior robustness across camera and LiDAR modalities under varying environmental perturbations. The framework achieves exceptional detection performance (AUC: 0.993\to0.995) under severe conditions while maintaining high coverage (90.0%) and enabling adaptive abstention (13.5%\to63.4%\pm0.5) as environmental severity increases. For LiDAR-based perception, our approach demonstrates particularly strong performance, maintaining robust coverage (84.5%) while appropriately abstaining from unreliable predictions. Notably, the framework shows remarkable stability under heavy perturbations, with detection performance (AUC: 0.995\pm0.001) significantly outperforming existing methods across all modalities. Our unified approach bridges the gap between theoretical guarantees and practical deployment needs, offering a robust solution for safety-critical autonomous systems operating in challenging real-world conditions.

[LG-45] Simplifying Adversarially Robust PAC Learning with Tolerance

链接: https://arxiv.org/abs/2502.07232
作者: Hassan Ashtiani,Vinayak Pathak,Ruth Urner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarially robust PAC learning has proved to be challenging, with the currently best known learners [Montasser et al., 2021a] relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning with tolerance [Ashtiani et al., 2023, Bhattacharjee et al., 2023, Raman et al., 2024] and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class H. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on H. Even though our learner is improper, it is “almost proper” in the sense that it outputs a hypothesis that is “similar” to a hypothesis in H. We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi-supervised algorithm of Attias et al. [2022a], but avoids the use of intricate subroutines from previous works, and is “almost proper.” Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.07232 [cs.LG] (or arXiv:2502.07232v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07232 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

链接: https://arxiv.org/abs/2502.07222
作者: Yiming Chen,Yuan Zhang,Yin Liu,Kun Yuan,Zaiwen Wen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

[LG-47] Improve the Training Efficiency of DRL for Wireless Communication Resource Allocation: The Role of Generative Diffusion Models

链接: https://arxiv.org/abs/2502.07211
作者: Xinren Zhang,Jiadong Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic resource allocation in mobile wireless networks involves complex, time-varying optimization problems, motivating the adoption of deep reinforcement learning (DRL). However, most existing works rely on pre-trained policies, overlooking dynamic environmental changes that rapidly invalidate the policies. Periodic retraining becomes inevitable but incurs prohibitive computational costs and energy consumption-critical concerns for resource-constrained wireless systems. We identify three root causes of inefficient retraining: high-dimensional state spaces, suboptimal action spaces exploration-exploitation trade-offs, and reward design limitations. To overcome these limitations, we propose Diffusion-based Deep Reinforcement Learning (D2RL), which leverages generative diffusion models (GDMs) to holistically enhance all three DRL components. Iterative refinement process and distribution modelling of GDMs enable (1) the generation of diverse state samples to improve environmental understanding, (2) balanced action space exploration to escape local optima, and (3) the design of discriminative reward functions that better evaluate action quality. Our framework operates in two modes: Mode I leverages GDMs to explore reward spaces and design discriminative reward functions that rigorously evaluate action quality, while Mode II synthesizes diverse state samples to enhance environmental understanding and generalization. Extensive experiments demonstrate that D2RL achieves faster convergence and reduced computational costs over conventional DRL methods for resource allocation in wireless communications while maintaining competitive policy performance. This work underscores the transformative potential of GDMs in overcoming fundamental DRL training bottlenecks for wireless networks, paving the way for practical, real-time deployments.

[LG-48] Enhancing Physics-Informed Neural Networks Through Feature Engineering

链接: https://arxiv.org/abs/2502.07209
作者: Shaghayegh Fazliani,Zachary Frangella,Madeleine Udell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) seek to solve partial differential equations (PDEs) with deep learning. Mainstream approaches that deploy fully-connected multi-layer deep learning architectures require prolonged training to achieve even moderate accuracy, while recent work on feature engineering allows higher accuracy and faster convergence. This paper introduces SAFE-NET, a Single-layered Adaptive Feature Engineering NETwork that achieves orders-of-magnitude lower errors with far fewer parameters than baseline feature engineering methods. SAFE-NET returns to basic ideas in machine learning, using Fourier features, a simplified single hidden layer network architecture, and an effective optimizer that improves the conditioning of the PINN optimization problem. Numerical results show that SAFE-NET converges faster and typically outperforms deeper networks and more complex architectures. It consistently uses fewer parameters – on average, 65% fewer than the competing feature engineering methods – while achieving comparable accuracy in less than 30% of the training epochs. Moreover, each SAFE-NET epoch is 95% faster than those of competing feature engineering approaches. These findings challenge the prevailing belief that modern PINNs effectively learn features in these scientific applications and highlight the efficiency gains possible through feature engineering.

[LG-49] Fixed-Confidence Best Arm Identification with Decreasing Variance

链接: https://arxiv.org/abs/2502.07199
作者: Tamojeet Roychowdhury,Kota Srinivas Reddy,Krishna P Jagannathan,Sharayu Moharir
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 6 pages, 2 figures, accepted in the National Conference on Communications 2025

点击查看摘要

Abstract:We focus on the problem of best-arm identification in a stochastic multi-arm bandit with temporally decreasing variances for the arms’ rewards. We model arm rewards as Gaussian random variables with fixed means and variances that decrease with time. The cost incurred by the learner is modeled as a weighted sum of the time needed by the learner to identify the best arm, and the number of samples of arms collected by the learner before termination. Under this cost function, there is an incentive for the learner to not sample arms in all rounds, especially in the initial rounds. On the other hand, not sampling increases the termination time of the learner, which also increases cost. This trade-off necessitates new sampling strategies. We propose two policies. The first policy has an initial wait period with no sampling followed by continuous sampling. The second policy samples periodically and uses a weighted average of the rewards observed to identify the best arm. We provide analytical guarantees on the performance of both policies and supplement our theoretical results with simulations which show that our polices outperform the state-of-the-art policies for the classical best arm identification problem.

[LG-50] Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits

链接: https://arxiv.org/abs/2502.07193
作者: Long-Fei Li,Yu-Yang Qian,Peng Zhao,Zhi-Hua Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a widely used approach for aligning Large Language Models (LLMs) with human preferences. While recent advancements have provided valuable insights into various stages and settings of RLHF, a comprehensive theoretical understanding of the entire RLHF pipeline remains lacking. Towards this end, we propose a unified framework for the RLHF pipeline from the view of contextual bandits and provide provable efficiency guarantees. In particular, we decompose the RLHF process into two distinct stages: (post-)training and deployment, exploring both passive and active data collection strategies during the training phase. By employing the Bradley-Terry preference model with a linearly parameterized reward function, we reformulate RLHF as a contextual preference bandit problem. We then develop novel algorithms for each stage, demonstrating significant improvements over existing approaches in both statistical and computational efficiency. Finally, we apply our method to train and deploy Llama-3-8B-Instruct on the Ultrafeedback-binarized dataset, and empirical results confirm the effectiveness of our approach.

[LG-51] Exploring Neural Network Pruning with Screening Methods

链接: https://arxiv.org/abs/2502.07189
作者: Mingyuan Wang,Yangzi Guo,Sida Liu,Yanwen Xiao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Deep neural networks (DNNs) such as convolutional neural networks (CNNs) for visual tasks, recurrent neural networks (RNNs) for sequence data, and transformer models for rich linguistic or multimodal tasks, achieved unprecedented performance on a wide range of tasks. The impressive performance of modern DNNs is partially attributed to their sheer scale. The latest deep learning models have tens to hundreds of millions of parameters which makes the inference processes resource-intensive. The high computational complexity of these networks prevents their deployment on resource-limited devices such as mobile platforms, IoT devices, and edge computing systems because these devices require energy-efficient and real-time processing capabilities. This paper proposes and evaluates a network pruning framework that eliminates non-essential parameters based on a statistical analysis of network component significance across classification categories. The proposed method uses screening methods coupled with a weighted scheme to assess connection and channel contributions for unstructured and structured pruning which allows for the elimination of unnecessary network elements without significantly degrading model performance. Extensive experimental validation on real-world vision datasets for both fully connected neural networks (FNNs) and CNNs has shown that the proposed framework produces competitive lean networks compared to the original networks. Moreover, the proposed framework outperforms state-of-art network pruning methods in two out of three cases.

[LG-52] Local Regularizers Are Not Transductive Learners

链接: https://arxiv.org/abs/2502.07187
作者: Sky Jafar,Julian Asilis,Shaddin Dughmi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:We partly resolve an open question raised by Asilis et al. (COLT 2024): whether the algorithmic template of local regularization – an intriguing generalization of explicit regularization, a.k.a. structural risk minimization – suffices to learn all learnable multiclass problems. Specifically, we provide a negative answer to this question in the transductive model of learning. We exhibit a multiclass classification problem which is learnable in both the transductive and PAC models, yet cannot be learned transductively by any local regularizer. The corresponding hypothesis class, and our proof, are based on principles from cryptographic secret sharing. We outline challenges in extending our negative result to the PAC model, leaving open the tantalizing possibility of a PAC/transductive separation with respect to local regularization.

[LG-53] MatrixKAN: Parallelized Kolmogorov-Arnold Network

链接: https://arxiv.org/abs/2502.07176
作者: Cale Coffman,Lizhong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) are a new class of neural network architecture representing a promising alternative to the Multilayer Perceptron (MLP), demonstrating improved expressiveness and interpretability. However, KANs suffer from slow training and inference speeds relative to MLPs due in part to the recursive nature of the underlying B-spline calculations. This issue is particularly apparent with respect to KANs utilizing high-degree B-splines, as the number of required non-parallelizable recursions is proportional to B-spline degree. We solve this issue by proposing MatrixKAN, a novel optimization that parallelizes B-spline calculations with matrix representation and operations, thus significantly improving effective computation time for models utilizing high-degree B-splines. In this paper, we demonstrate the superior scaling of MatrixKAN’s computation time relative to B-spline degree. Further, our experiments demonstrate speedups of approximately 40x relative to KAN, with significant additional speedup potential for larger datasets or higher spline degrees.

[LG-54] Bayesian Optimization for Building Social-Influence-Free Consensus

链接: https://arxiv.org/abs/2502.07166
作者: Masaki Adachi,Siu Lun Chau,Wenjie Xu,Anurag Singh,Michael A. Osborne,Krikamol Muandet
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 50 pages, 8 figures

点击查看摘要

Abstract:We introduce Social Bayesian Optimization (SBO), a vote-efficient algorithm for consensus-building in collective decision-making. In contrast to single-agent scenarios, collective decision-making encompasses group dynamics that may distort agents’ preference feedback, thereby impeding their capacity to achieve a social-influence-free consensus – the most preferable decision based on the aggregated agent utilities. We demonstrate that under mild rationality axioms, reaching social-influence-free consensus using noisy feedback alone is impossible. To address this, SBO employs a dual voting system: cheap but noisy public votes (e.g., show of hands in a meeting), and more accurate, though expensive, private votes (e.g., one-to-one interview). We model social influence using an unknown social graph and leverage the dual voting system to efficiently learn this graph. Our theoretical findigns show that social graph estimation converges faster than the black-box estimation of agents’ utilities, allowing us to reduce reliance on costly private votes early in the process. This enables efficient consensus-building primarily through noisy public votes, which are debiased based on the estimated social graph to infer social-influence-free feedback. We validate the efficacy of SBO across multiple real-world applications, including thermal comfort, team building, travel negotiation, and energy trading collaboration.

[LG-55] Conditional Distribution Quantization in Machine Learning

链接: https://arxiv.org/abs/2502.07151
作者: Blaise Delattre,Sylvain Delattre,Alexandre Vérine,Alexandre Allauzen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conditional expectation \mathbbE(Y \mid X) often fails to capture the complexity of multimodal conditional distributions \mathcalL(Y \mid X). To address this, we propose using n-point conditional quantizations–functional mappings of X that are learnable via gradient descent–to approximate \mathcalL(Y \mid X). This approach adapts Competitive Learning Vector Quantization (CLVQ), tailored for conditional distributions. It goes beyond single-valued predictions by providing multiple representative points that better reflect multimodal structures. It enables the approximation of the true conditional law in the Wasserstein distance. The resulting framework is theoretically grounded and useful for uncertainty quantification and multimodal data generation tasks. For example, in computer vision inpainting tasks, multiple plausible reconstructions may exist for the same partially observed input image X. We demonstrate the effectiveness of our approach through experiments on synthetic and real-world datasets.

[LG-56] Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates NEURIPS2024 DATE

链接: https://arxiv.org/abs/2502.07141
作者: Jincheng Mei,Bo Dai,Alekh Agarwal,Sharan Vaswani,Anant Raj,Csaba Szepesvari,Dale Schuurmans
类目: Machine Learning (cs.LG)
*备注: Updated version for a paper published at NeurIPS 2024

点击查看摘要

Abstract:We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emphany constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.

[LG-57] One-Shot Learning for k-SAT

链接: https://arxiv.org/abs/2502.07135
作者: Andreas Galanis,Leslie Ann Goldberg,Xusheng Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Consider a k -SAT formula \Phi where every variable appears at most d times, and let \sigma be a satisfying assignment of \Phi sampled proportionally to e^\beta m(\sigma) where m(\sigma) is the number of variables set to true and \beta is a real parameter. Given \Phi and \sigma , can we learn the value of \beta efficiently? This problem falls into a recent line of works about single-sample (“one-shot”) learning of Markov random fields. The k -SAT setting we consider here was recently studied by Galanis, Kandiros, and Kalavasis (SODA’24) where they showed that single-sample learning is possible when roughly d\leq 2^k/6.45 and impossible when d\geq (k+1) 2^k-1 . Crucially, for their impossibility results they used the existence of unsatisfiable instances which, aside from the gap in d , left open the question of whether the feasibility threshold for one-shot learning is dictated by the satisfiability threshold of k -SAT formulas of bounded degree. Our main contribution is to answer this question negatively. We show that one-shot learning for k -SAT is infeasible well below the satisfiability threshold; in fact, we obtain impossibility results for degrees d as low as k^2 when \beta is sufficiently large, and bootstrap this to small values of \beta when d scales exponentially with k , via a probabilistic construction. On the positive side, we simplify the analysis of the learning algorithm and obtain significantly stronger bounds on d in terms of \beta . In particular, for the uniform case \beta\rightarrow 0 that has been studied extensively in the sampling literature, our analysis shows that learning is possible under the condition d\lesssim 2^k/2 . This is nearly optimal (up to constant factors) in the sense that it is known that sampling a uniformly-distributed satisfying assignment is NP-hard for d\gtrsim 2^k/2 . Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2502.07135 [cs.DS] (or arXiv:2502.07135v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2502.07135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] Fourier-enhanced Neural Networks For Systems Biology Applications

链接: https://arxiv.org/abs/2502.07129
作者: Enze Xu,Minghan Chen
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In the field of systems biology, differential equations are commonly used to model biological systems, but solving them for large-scale and complex systems can be computationally expensive. Recently, the integration of machine learning and mathematical modeling has offered new opportunities for scientific discoveries in biology and health. The emerging physics-informed neural network (PINN) has been proposed as a solution to this problem. However, PINN can be computationally expensive and unreliable for complex biological systems. To address these issues, we propose the Fourier-enhanced Neural Networks for systems biology (SB-FNN). SB-FNN uses an embedded Fourier neural network with an adaptive activation function and a cyclic penalty function to optimize the prediction of biological dynamics, particularly for biological systems that exhibit oscillatory patterns. Experimental results demonstrate that SB-FNN achieves better performance and is more efficient than PINN for handling complex biological models. Experimental results on cellular and population models demonstrate that SB-FNN outperforms PINN in both accuracy and efficiency, making it a promising alternative approach for handling complex biological models. The proposed method achieved better performance on six biological models and is expected to replace PINN as the most advanced method in systems biology.

[LG-59] SAFE: Self-Supervised Anomaly Detection Framework for Intrusion Detection AAAI-25

链接: https://arxiv.org/abs/2502.07119
作者: Elvin Li,Zhengli Shang,Onat Gungor,Tajana Rosing
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by the AAAI-25 Workshop on Artificial Intelligence for Cyber Security (AICS)

点击查看摘要

Abstract:The proliferation of IoT devices has significantly increased network vulnerabilities, creating an urgent need for effective Intrusion Detection Systems (IDS). Machine Learning-based IDS (ML-IDS) offer advanced detection capabilities but rely on labeled attack data, which limits their ability to identify unknown threats. Self-Supervised Learning (SSL) presents a promising solution by using only normal data to detect patterns and anomalies. This paper introduces SAFE, a novel framework that transforms tabular network intrusion data into an image-like format, enabling Masked Autoencoders (MAEs) to learn robust representations of network behavior. The features extracted by the MAEs are then incorporated into a lightweight novelty detector, enhancing the effectiveness of anomaly detection. Experimental results demonstrate that SAFE outperforms the state-of-the-art anomaly detection method, Scale Learning-based Deep Anomaly Detection method (SLAD), by up to 26.2% and surpasses the state-of-the-art SSL-based network intrusion detection approach, Anomal-E, by up to 23.5% in F1-score.

[LG-60] Likelihood-Free Estimation for Spatiotemporal Hawkes processes with missing data and application to predictive policing

链接: https://arxiv.org/abs/2502.07111
作者: Pramit Das,Moulinath Banerjee,Yuekai Sun
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:With the growing use of AI technology, many police departments use forecasting software to predict probable crime hotspots and allocate patrolling resources effectively for crime prevention. The clustered nature of crime data makes self-exciting Hawkes processes a popular modeling choice. However, one significant challenge in fitting such models is the inherent missingness in crime data due to non-reporting, which can bias the estimated parameters of the predictive model, leading to inaccurate downstream hotspot forecasts, often resulting in over or under-policing in various communities, especially the vulnerable ones. Our work introduces a Wasserstein Generative Adversarial Networks (WGAN) driven likelihood-free approach to account for unreported crimes in Spatiotemporal Hawkes models. We demonstrate through empirical analysis how this methodology improves the accuracy of parametric estimation in the presence of data missingness, leading to more reliable and efficient policing strategies.

[LG-61] Game of Coding With an Unknown Adversary

链接: https://arxiv.org/abs/2502.07109
作者: Hanzaleh Akbarinodehi,Parsa Moradi,Mohammad Ali Maddah-Ali
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by emerging decentralized applications, the \emphgame of coding framework has been recently introduced to address scenarios where the adversary’s control over coded symbols surpasses the fundamental limits of traditional coding theory. Still, the reward mechanism available in decentralized systems, motivates the adversary to act rationally. While the decoder, as the data collector (DC), has an acceptance and rejection mechanism, followed by an estimation module, the adversary aims to maximize its utility, as an increasing function of (1) the chance of acceptance (to increase the reward), and (2) estimation error. On the other hand, the decoder also adjusts its acceptance rule to maximize its own utility, as (1) an increasing function of the chance of acceptance (to keep the system functional), (2) decreasing function of the estimation error. Prior works within this framework rely on the assumption that the game is complete, that is, both the DC and the adversary are fully aware of each other’s utility functions. However, in practice, the decoder is often unaware of the utility of the adversary. To address this limitation, we develop an algorithm enabling the DC to commit to a strategy that achieves within the vicinity of the equilibrium, without knowledge of the adversary’s utility function. Our approach builds on an observation that at the equilibrium, the relationship between the probability of acceptance and the mean squared error (MSE) follows a predetermined curve independent of the specific utility functions of the players. By exploiting this invariant relationship, the DC can iteratively refine its strategy based on observable parameters, converging to a near-optimal solution. We provide theoretical guarantees on sample complexity and accuracy of the proposed scheme.

[LG-62] Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring

链接: https://arxiv.org/abs/2502.07087
作者: Alex Heyman,Joel Zylberberg
类目: Machine Learning (cs.LG)
*备注: 23 pages (8 excluding references and appendices); 8 figures (3 excluding appendices)

点击查看摘要

Abstract:Contemporary large language models are powerful problem-solving tools, but they exhibit weaknesses in their reasoning abilities which ongoing research seeks to mitigate. We investigate graph coloring as a means of evaluating an LLM’s capacities for systematic step-by-step reasoning and possibility space exploration, as well as effects of semantic problem framing. We test Claude 3.5 Sonnet, Llama 3.1 405B, Gemini 1.5 Pro, GPT-4o, o1-mini, and DeepSeek-R1 on a dataset of k -coloring problems with 2 \leq k \leq 4 and vertex count 4 \leq n \leq 8 , using partial algorithmic solvers to further categorize problems by difficulty. In addition to substantial but varying framing effects, we find that all models except o1-mini and R1 exhibit 60% error rates on difficult problem types in all frames ( 15% for o1-mini and 10% for R1), and no model achieves perfect accuracy even in the simple domain of 2-coloring 4-vertex graphs. Our results highlight both the considerable recent progress in LLM systematic reasoning and the limits of its reliability, especially in relation to increasing computational costs. We expect that more complex graph coloring problems, and procedural generation of arbitrary-complexity reasoning problems more broadly, offer further untapped potential for LLM benchmarking.

[LG-63] Fast Clustering of Categorical Big Data

链接: https://arxiv.org/abs/2502.07081
作者: Bipana Thapaliya,Yu Zhuang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:The K-Modes algorithm, developed for clustering categorical data, is of high algorithmic simplicity but suffers from unreliable performances in clustering quality and clustering efficiency, both heavily influenced by the choice of initial cluster centers. In this paper, we investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be when used as initial centers for the K-Modes. The BK-Modes works by splitting a dataset into multiple clusters iteratively with one cluster being chosen and bisected into two clusters in each iteration. We use the sum of distances of data to their cluster centers as the selection metric to choose a cluster to be bisected in each iteration. This iterative process stops when K clusters are produced. The centers of these K clusters are then used as the initial cluster centers for the K-Modes. Experimental studies of the BK-Modes were carried out and were compared against the K-Modes with multiple sets of initial cluster centers as well as the best of the existing methods we found so far in our survey. Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.

[LG-64] Boosting of Classification Models with Human-in-the-Loop Computational Visual Knowledge Discovery

链接: https://arxiv.org/abs/2502.07039
作者: Alice Williams,Boris Kovalerchuk
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Preprint

点击查看摘要

Abstract:High-risk artificial intelligence and machine learning classification tasks, such as healthcare diagnosis, require accurate and interpretable prediction models. However, classifier algorithms typically sacrifice individual case-accuracy for overall model accuracy, limiting analysis of class overlap areas regardless of task significance. The Adaptive Boosting meta-algorithm, which won the 2003 Gödel Prize, analytically assigns higher weights to misclassified cases to reclassify. However, it relies on weaker base classifiers that are iteratively strengthened, limiting improvements from base classifiers. Combining visual and computational approaches enables selecting stronger base classifiers before boosting. This paper proposes moving boosting methodology from focusing on only misclassified cases to all cases in the class overlap areas using Computational and Interactive Visual Learning (CIVL) with a Human-in-the-Loop. It builds classifiers in lossless visualizations integrating human domain expertise and visual insights. A Divide and Classify process splits cases to simple and complex, classifying these individually through computational analysis and data visualization with lossless visualization spaces of Parallel Coordinates or other General Line Coordinates. After finding pure and overlap class areas simple cases in pure areas are classified, generating interpretable sub-models like decision rules in Propositional and First-order Logics. Only multidimensional cases in the overlap areas are losslessly visualized simplifying end-user cognitive tasks to identify difficult case patterns, including engineering features to form new classifiable patterns. Demonstration shows a perfectly accurate and losslessly interpretable model of the Iris dataset, and simulated data shows generalized benefits to accuracy and interpretability of models, increasing end-user confidence in discovered models.

[LG-65] Federated Sinkhorn

链接: https://arxiv.org/abs/2502.07021
作者: Jeremy Kulcsar,Vyacheslav Kungurtsev,Georgios Korpas,Giulio Giaconi,William Shoosmith
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work we investigate the potential of solving the discrete Optimal Transport (OT) problem with entropy regularization in a federated learning setting. Recall that the celebrated Sinkhorn algorithm transforms the classical OT linear program into strongly convex constrained optimization, facilitating first order methods for otherwise intractably large problems. A common contemporary setting that remains an open problem as far as the application of Sinkhorn is the presence of data spread across clients with distributed inter-communication, either due to clients whose privacy is a concern, or simply by necessity of processing and memory hardware limitations. In this work we investigate various natural procedures, which we refer to as Federated Sinkhorn, that handle distributed environments where data is partitioned across multiple clients. We formulate the problem as minimizing the transport cost with an entropy regularization term, subject to marginal constraints, where block components of the source and target distribution vectors are locally known to clients corresponding to each block. We consider both synchronous and asynchronous variants as well as all-to-all and server-client communication topology protocols. Each procedure allows clients to compute local operations on their data partition while periodically exchanging information with others. We provide theoretical guarantees on convergence for the different variants under different possible conditions. We empirically demonstrate the algorithms performance on synthetic datasets and a real-world financial risk assessment application. The investigation highlights the subtle tradeoffs associated with computation and communication time in different settings and how they depend on problem size and sparsity.

[LG-66] DROP: Poison Dilution via Knowledge Distillation for Federated Learning

链接: https://arxiv.org/abs/2502.07011
作者: Georgios Syros,Anshuman Suri,Farinaz Koushanfar,Cristina Nita-Rotaru,Alina Oprea
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning is vulnerable to adversarial manipulation, where malicious clients can inject poisoned updates to influence the global model’s behavior. While existing defense mechanisms have made notable progress, they fail to protect against adversaries that aim to induce targeted backdoors under different learning and attack configurations. To address this limitation, we introduce DROP (Distillation-based Reduction Of Poisoning), a novel defense mechanism that combines clustering and activity-tracking techniques with extraction of benign behavior from clients via knowledge distillation to tackle stealthy adversaries that manipulate low data poisoning rates and diverse malicious client ratios within the federation. Through extensive experimentation, our approach demonstrates superior robustness compared to existing defenses across a wide range of learning configurations. Finally, we evaluate existing defenses and our method under the challenging setting of non-IID client data distribution and highlight the challenges of designing a resilient FL defense in this setting.

[LG-67] Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects

链接: https://arxiv.org/abs/2502.07005
作者: Tai Hoang,Huy Le,Philipp Becker,Vien Anh Ngo,Gerhard Neumann
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages main text, 30 pages all included

点击查看摘要

Abstract:Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterogeneous graph that comprises smaller sub-graphs, such as actuators and objects, accompanied by different edge types describing their interactions. This graph representation serves as a unified structure for both rigid and deformable objects tasks, and can be extended further to tasks comprising multiple actuators. To evaluate this setup, we present a novel and challenging reinforcement learning benchmark, including rigid insertion of diverse objects, as well as rope and cloth manipulation with multiple end-effectors. These tasks present a large search space, as both the initial and target configurations are uniformly sampled in 3D space. To address this issue, we propose a novel graph-based policy model, dubbed Heterogeneous Equivariant Policy (HEPi), utilizing SE(3) equivariant message passing networks as the main backbone to exploit the geometric symmetry. In addition, by modeling explicit heterogeneity, HEPi can outperform Transformer-based and non-heterogeneous equivariant policies in terms of average returns, sample efficiency, and generalization to unseen objects. Comments: 10 pages main text, 30 pages all included Subjects: Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2502.07005 [cs.LG] (or arXiv:2502.07005v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.07005 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICLR 2025, The Thirteenth International Conference on Learning Representations

[LG-68] Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models

链接: https://arxiv.org/abs/2502.06999
作者: Siddarth Venkatraman,Mohsin Hasan,Minsu Kim,Luca Scimeca,Marcin Sendera,Yoshua Bengio,Glen Berseth,Nikolay Malkin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Any well-behaved generative model over a variable \mathbfx can be expressed as a deterministic transformation of an exogenous (‘outsourced’) Gaussian noise variable \mathbfz : \mathbfx=f_\theta(\mathbfz) . In such a model (e.g., a VAE, GAN, or continuous-time flow-based model), sampling of the target variable \mathbfx \sim p_\theta(\mathbfx) is straightforward, but sampling from a posterior distribution of the form p(\mathbfx\mid\mathbfy) \propto p_\theta(\mathbfx)r(\mathbfx,\mathbfy) , where r is a constraint function depending on an auxiliary variable \mathbfy , is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ( \mathbfz ). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples f_\theta(\mathbfz) are distributed according to the posterior in the data space ( \mathbfx ). For many models and constraints of interest, the posterior in the noise space is smoother than the posterior in the data space, making it more amenable to such amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably both with current amortized and non-amortized inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.

[LG-69] Machine Learning Fleet Efficiency: Analyzing and Optimizing Large-Scale Google TPU Systems with ML Productivity Goodput

链接: https://arxiv.org/abs/2502.06982
作者: Arissa Wongpanich,Tayo Oguntebi,Jose Baiocchi Paredes,Yu Emma Wang,Phitchaya Mangpo Phothilimthana,Ritwika Mitra,Zongwei Zhou,Naveen Kumar,Vijay Janapa Reddi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent years have seen the emergence of machine learning (ML) workloads deployed in warehouse-scale computing (WSC) settings, also known as ML fleets. As the computational demands placed on ML fleets have increased due to the rise of large models and growing demand for ML applications, it has become increasingly critical to measure and improve the efficiency of such systems. However, there is not yet an established methodology to characterize ML fleet performance and identify potential performance optimizations accordingly. This paper presents a large-scale analysis of an ML fleet based on Google’s TPUs, introducing a framework to capture fleet-wide efficiency, systematically evaluate performance characteristics, and identify optimization strategies for the fleet. We begin by defining an ML fleet, outlining its components, and analyzing an example Google ML fleet in production comprising thousands of accelerators running diverse workloads. Our study reveals several critical insights: first, ML fleets extend beyond the hardware layer, with model, data, framework, compiler, and scheduling layers significantly impacting performance; second, the heterogeneous nature of ML fleets poses challenges in characterizing individual workload performance; and third, traditional utilization-based metrics prove insufficient for ML fleet characterization. To address these challenges, we present the “ML Productivity Goodput” (MPG) metric to measure ML fleet efficiency. We show how to leverage this metric to characterize the fleet across the ML system stack. We also present methods to identify and optimize performance bottlenecks using MPG, providing strategies for managing warehouse-scale ML systems in general. Lastly, we demonstrate quantitative evaluations from applying these methods to a real ML fleet for internal-facing Google TPU workloads, where we observed tangible improvements.

[LG-70] User-Preference Meets Pareto-Optimality: Multi-Objective Bayesian Optimization with Local Gradient Search

链接: https://arxiv.org/abs/2502.06971
作者: Joshua Hang Sai Ip,Ankush Chakrabarty,Ali Mesbah,Diego Romeres
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Incorporating user preferences into multi-objective Bayesian optimization (MOBO) allows for personalization of the optimization procedure. Preferences are often abstracted in the form of an unknown utility function, estimated through pairwise comparisons of potential outcomes. However, utility-driven MOBO methods can yield solutions that are dominated by nearby solutions, as non-dominance is not enforced. Additionally, classical MOBO commonly relies on estimating the entire Pareto-front to identify the Pareto-optimal solutions, which can be expensive and ignore user preferences. Here, we present a new method, termed preference-utility-balanced MOBO (PUB-MOBO), that allows users to disambiguate between near-Pareto candidate solutions. PUB-MOBO combines utility-based MOBO with local multi-gradient descent to refine user-preferred solutions to be near-Pareto-optimal. To this end, we propose a novel preference-dominated utility function that concurrently preserves user-preferences and dominance amongst candidate solutions. A key advantage of PUB-MOBO is that the local search is restricted to a (small) region of the Pareto-front directed by user preferences, alleviating the need to estimate the entire Pareto-front. PUB-MOBO is tested on three synthetic benchmark problems: DTLZ1, DTLZ2 and DH1, as well as on three real-world problems: Vehicle Safety, Conceptual Marine Design, and Car Side Impact. PUB-MOBO consistently outperforms state-of-the-art competitors in terms of proximity to the Pareto-front and utility regret across all the problems.

[LG-71] Model Diffusion for Certifiable Few-shot Transfer Learning

链接: https://arxiv.org/abs/2502.06970
作者: Fady Rezk,Royson Lee,Henry Gouk,Timothy Hospedales,Minyoung Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In modern large-scale deep learning, a prevalent and effective workflow for solving low-data problems is adapting powerful pre-trained foundation models (FMs) to new tasks via parameter-efficient fine-tuning (PEFT). However, while empirically effective, the resulting solutions lack generalisation guarantees to certify their accuracy - which may be required for ethical or legal reasons prior to deployment in high-importance applications. In this paper we develop a novel transfer learning approach that is designed to facilitate non-vacuous learning theoretic generalisation guarantees for downstream tasks, even in the low-shot regime. Specifically, we first use upstream tasks to train a distribution over PEFT parameters. We then learn the downstream task by a sample-and-evaluate procedure – sampling plausible PEFTs from the trained diffusion model and selecting the one with the highest likelihood on the downstream data. Crucially, this confines our model hypothesis to a finite set of PEFT samples. In contrast to learning in the typical continuous hypothesis spaces of neural network weights, this facilitates tighter risk certificates. We instantiate our bound and show non-trivial generalization guarantees compared to existing learning approaches which lead to vacuous bounds in the low-shot regime.

[LG-72] Analytic Personalized Federated Meta-Learning

链接: https://arxiv.org/abs/2502.06915
作者: Shunxian Gu,Chaoqun You,Deke Guo,Zhihao Qu,Bangbang Ren,Zaipeng Xie,Lailong Luo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analytic federated learning (AFL) which updates model weights only once by using closed-form least-square (LS) solutions can reduce abundant training time in gradient-free federated learning (FL). The current AFL framework cannot support deep neural network (DNN) training, which hinders its implementation on complex machine learning tasks. Meanwhile, it overlooks the heterogeneous data distribution problem that restricts the single global model from performing well on each client’s task. To overcome the first challenge, we propose an AFL framework, namely FedACnnL, in which we resort to a novel local analytic learning method (ACnnL) and model the training of each layer as a distributed LS problem. For the second challenge, we propose an analytic personalized federated meta-learning framework, namely pFedACnnL, which is inherited from FedACnnL. In pFedACnnL, clients with similar data distribution share a common robust global model for fast adapting it to local tasks in an analytic manner. FedACnnL is theoretically proven to require significantly shorter training time than the conventional zeroth-order (i.e. gradient-free) FL frameworks on DNN training while the reduction ratio is 98% in the experiment. Meanwhile, pFedACnnL achieves state-of-the-art (SOTA) model performance in most cases of convex and non-convex settings, compared with the previous SOTA frameworks.

[LG-73] Polynomial Regret Concentration of UCB for Non-Deterministic State Transitions

链接: https://arxiv.org/abs/2502.06900
作者: Can Cömer,Jannis Blüml,Cedric Derstroff,Kristian Kersting
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) has proven effective in solving decision-making problems in perfect information settings. However, its application to stochastic and imperfect information domains remains limited. This paper extends the theoretical framework of MCTS to stochastic domains by addressing non-deterministic state transitions, where actions lead to probabilistic outcomes. Specifically, building on the work of Shah et al. (2020), we derive polynomial regret concentration bounds for the Upper Confidence Bound algorithm in multi-armed bandit problems with stochastic transitions, offering improved theoretical guarantees. Our primary contribution is proving that these bounds also apply to non-deterministic environments, ensuring robust performance in stochastic settings. This broadens the applicability of MCTS to real-world decision-making problems with probabilistic outcomes, such as in autonomous systems and financial decision-making.

[LG-74] CluStRE: Streaming Graph Clustering with Multi-Stage Refinement

链接: https://arxiv.org/abs/2502.06879
作者: Adil Chhabra,Shai Dorian Peretz,Christian Schulz
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:We present CluStRE, a novel streaming graph clustering algorithm that balances computational efficiency with high-quality clustering using multi-stage refinement. Unlike traditional in-memory clustering approaches, CluStRE processes graphs in a streaming setting, significantly reducing memory overhead while leveraging re-streaming and evolutionary heuristics to improve solution quality. Our method dynamically constructs a quotient graph, enabling modularity-based optimization while efficiently handling large-scale graphs. We introduce multiple configurations of CluStRE to provide trade-offs between speed, memory consumption, and clustering quality. Experimental evaluations demonstrate that CluStRE improves solution quality by 89.8%, operates 2.6 times faster, and uses less than two-thirds of the memory required by the state-of-the-art streaming clustering algorithm on average. Moreover, our strongest mode enhances solution quality by up to 150% on average. With this, CluStRE achieves comparable solution quality to in-memory algorithms, i.e. over 96% of the quality of clustering approaches, including Louvain, effectively bridging the gap between streaming and traditional clustering methods.

[LG-75] Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification

链接: https://arxiv.org/abs/2502.06878
作者: Sukumar Kishanthan,Asela Hevapathige
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite extensive research spanning several decades, class imbalance is still considered a profound difficulty for both machine learning and deep learning models. While data oversampling is the foremost technique to address this issue, traditional sampling techniques are often decoupled from the training phase of the predictive model, resulting in suboptimal representations. To address this, we propose a novel learning framework that can generate synthetic data instances in a data-driven manner. The proposed framework formulates the oversampling process as a composition of discrete decision criteria, thereby enhancing the representation power of the model’s learning process. Extensive experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.

[LG-76] WirelessGPT : A Generative Pre-trained Multi-task Learning Framework for Wireless Communication

链接: https://arxiv.org/abs/2502.06877
作者: Tingting Yang,Ping Zhang,Mengfan Zheng,Yuxuan Shi,Liwen Jing,Jianbo Huang,Nan Li
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:This paper introduces WirelessGPT, a pioneering foundation model specifically designed for multi-task learning in wireless communication and sensing. Specifically, WirelessGPT leverages large-scale wireless channel datasets for unsupervised pretraining and extracting universal channel representations, which captures complex spatiotemporal dependencies. In fact,this task-agnostic design adapts WirelessGPT seamlessly to a wide range of downstream tasks, using a unified representation with minimal fine-tuning. By unifying communication and sensing functionalities, WirelessGPT addresses the limitations of task-specific models, offering a scalable and efficient solution for integrated sensing and communication (ISAC). With an initial parameter size of around 80 million, WirelessGPT demonstrates significant improvements over conventional methods and smaller AI models, reducing reliance on large-scale labeled data. As the first foundation model capable of supporting diverse tasks across different domains, WirelessGPT establishes a new benchmark, paving the way for future advancements in multi-task wireless systems.

[LG-77] Deep Ritz method with Fourier feature mapping: A deep learning approach for solving variational models of microstructure

链接: https://arxiv.org/abs/2502.06865
作者: Ensela Mema,Ting Wang,Jaroslaw Knap
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach that combines the Deep Ritz Method (DRM) with Fourier feature mapping to solve minimization problems comprised of multi-well, non-convex energy potentials. These problems present computational challenges as they lack a global minimum. Through an investigation of three benchmark problems in both 1D and 2D, we observe that DRM suffers from spectral bias pathology, limiting its ability to learn solutions with high frequencies. To overcome this limitation, we modify the method by introducing Fourier feature mapping. This modification involves applying a Fourier mapping to the input layer before it passes through the hidden and output layers. Our results demonstrate that Fourier feature mapping enables DRM to generate high-frequency, multiscale solutions for the benchmark problems in both 1D and 2D, offering a promising advancement in tackling complex non-convex energy minimization problems.

[LG-78] Poincaré Inequality for Local Log-Polyak-Lojasiewicz Measures : Non-asymptotic Analysis in Low-temperature Regime

链接: https://arxiv.org/abs/2502.06862
作者: Yun Gong,Zebang Shen,Niao He
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Functional Analysis (math.FA); Probability (math.PR); Machine Learning (stat.ML)
*备注: 29 pages. A short version which only includes the case of $α= 2$ . Please refer to the first version ( arXiv:2501.00429 ) for the case of $α 2$

点击查看摘要

Abstract:Potential functions in highly pertinent applications, such as deep learning in over-parameterized regime, are empirically observed to admit non-isolated minima. To understand the convergence behavior of stochastic dynamics in such landscapes, we propose to study the class of \logPLmeasure\ measures \mu_\epsilon \propto \exp(-V/\epsilon) , where the potential V satisfies a local Polyak-Łojasiewicz (PŁ) inequality, and its set of local minima is provably \emphconnected. Notably, potentials in this class can exhibit local maxima and we characterize its optimal set S to be a compact \mathcalC^2 \emphembedding submanifold of \mathbbR^d without boundary. The \emphnon-contractibility of S distinguishes our function class from the classical convex setting topologically. Moreover, the embedding structure induces a naturally defined Laplacian-Beltrami operator on S, and we show that its first non-trivial eigenvalue provides an \emph \epsilon -independent lower bound for the \Poincare\ constant in the \Poincare\ inequality of \mu_\epsilon . As a direct consequence, Langevin dynamics with such non-convex potential V and diffusion coefficient \epsilon converges to its equilibrium \mu_\epsilon at a rate of \tilde\mathcalO(1/\epsilon) , provided \epsilon is sufficiently small. Here \tilde\mathcalO hides logarithmic terms.

[LG-79] A Deep Learning Framework Integrating CNN and BiLSTM for Financial Systemic Risk Analysis and Prediction

链接: https://arxiv.org/abs/2502.06847
作者: Yu Cheng,Zhen Xu,Yuan Chen,Yuhan Wang,Zhenghao Lin,Jinsong Liu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:This study proposes a deep learning model based on the combination of convolutional neural network (CNN) and bidirectional long short-term memory network (BiLSTM) for discriminant analysis of financial systemic risk. The model first uses CNN to extract local patterns of multidimensional features of financial markets, and then models the bidirectional dependency of time series through BiLSTM, to comprehensively characterize the changing laws of systemic risk in spatial features and temporal dynamics. The experiment is based on real financial data sets. The results show that the model is significantly superior to traditional single models (such as BiLSTM, CNN, Transformer, and TCN) in terms of accuracy, recall, and F1 score. The F1-score reaches 0.88, showing extremely high discriminant ability. This shows that the joint strategy of combining CNN and BiLSTM can not only fully capture the complex patterns of market data but also effectively deal with the long-term dependency problem in time series data. In addition, this study also explores the robustness of the model in dealing with data noise and processing high-dimensional data, providing strong support for intelligent financial risk management. In the future, the research will further optimize the model structure, introduce methods such as reinforcement learning and multimodal data analysis, and improve the efficiency and generalization ability of the model to cope with a more complex financial environment.

[LG-80] orchResist: Open-Source Differentiable Resist Simulator

链接: https://arxiv.org/abs/2502.06838
作者: Zixiao Wang,Jieya Zhou,Su Zheng,Shuo Yin,Kaichao Liang,Shoubo Hu,Xiao Chen,Bei Yu
类目: Machine Learning (cs.LG)
*备注: SPIE Advanced Lithography + Patterning, 2025

点击查看摘要

Abstract:Recent decades have witnessed remarkable advancements in artificial intelligence (AI), including large language models (LLMs), image and video generative models, and embodied AI systems. These advancements have led to an explosive increase in the demand for computational power, challenging the limits of Moore’s Law. Optical lithography, a critical technology in semiconductor manufacturing, faces significant challenges due to its high costs. To address this, various lithography simulators have been developed. However, many of these simulators are limited by their inadequate photoresist modeling capabilities. This paper presents TorchResist, an open-source, differentiable photoresist this http URL employs an analytical approach to model the photoresist process, functioning as a white-box system with at most twenty interpretable parameters. Leveraging modern differentiable programming techniques and parallel computing on GPUs, TorchResist enables seamless co-optimization with other tools across multiple related tasks. Our experimental results demonstrate that TorchResist achieves superior accuracy and efficiency compared to existing solutions. The source code is publicly available.

[LG-81] Comparison of CNN-based deep learning architectures for unsteady CFD acceleration on small datasets

链接: https://arxiv.org/abs/2502.06837
作者: Sangam Khanal,Shilaj Baral,Joongoo Jeon
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 9 figures, 3 Tables

点击查看摘要

[LG-82] Reinforcement Learning on AYA Dyads to Enhance Medication Adherence

链接: https://arxiv.org/abs/2502.06835
作者: Ziping Xu,Hinal Jajal,Sung Won Choi,Inbal Nahum-Shani,Guy Shani,Alexandra M. Psihogios,Pei-Yao Hung,Susan Murphy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medication adherence is critical for the recovery of adolescents and young adults (AYAs) who have undergone hematopoietic cell transplantation (HCT). However, maintaining adherence is challenging for AYAs after hospital discharge, who experience both individual (e.g. physical and emotional symptoms) and interpersonal barriers (e.g., relational difficulties with their care partner, who is often involved in medication management). To optimize the effectiveness of a three-component digital intervention targeting both members of the dyad as well as their relationship, we propose a novel Multi-Agent Reinforcement Learning (MARL) approach to personalize the delivery of interventions. By incorporating the domain knowledge, the MARL framework, where each agent is responsible for the delivery of one intervention component, allows for faster learning compared with a flattened agent. Evaluation using a dyadic simulator environment, based on real clinical data, shows a significant improvement in medication adherence (approximately 3%) compared to purely random intervention delivery. The effectiveness of this approach will be further evaluated in an upcoming trial.

[LG-83] RLOMM: An Efficient and Robust Online Map Matching Framework with Reinforcement Learning SIGMOD2025

链接: https://arxiv.org/abs/2502.06825
作者: Minxiao Chen,Haitao Yuan,Nan Jiang,Zhihan Zheng,Sai Wu,Ao Zhou,Shangguang Wang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted by SIGMOD 2025

点击查看摘要

Abstract:Online map matching is a fundamental problem in location-based services, aiming to incrementally match trajectory data step-by-step onto a road network. However, existing methods fail to meet the needs for efficiency, robustness, and accuracy required by large-scale online applications, making this task still a challenging problem. This paper introduces a novel framework that achieves high accuracy and efficient matching while ensuring robustness in handling diverse scenarios. To improve efficiency, we begin by modeling the online map matching problem as an Online Markov Decision Process (OMDP) based on its inherent characteristics. This approach helps efficiently merge historical and real-time data, reducing unnecessary calculations. Next, to enhance the model’s robustness, we design a reinforcement learning method, enabling robust handling of real-time data from dynamically changing environments. In particular, we propose a novel model learning process and a comprehensive reward function, allowing the model to make reasonable current matches from a future-oriented perspective, and to continuously update and optimize during the decision-making process based on feedback. Lastly, to address the heterogeneity between trajectories and roads, we design distinct graph structures, facilitating efficient representation learning through graph and recurrent neural networks. To further align trajectory and road data, we introduce contrastive learning to decrease their distance in the latent space, thereby promoting effective integration of the two. Extensive evaluations on three real-world datasets confirm that our method significantly outperforms existing state-of-the-art solutions in terms of accuracy, efficiency and robustness.

[LG-84] CTR-Driven Advertising Image Generation with Multimodal Large Language Models WWW2025

链接: https://arxiv.org/abs/2502.06823
作者: Xingye Chen,Wei Feng,Zhenbang Du,Weizhen Wang,Yanyin Chen,Haohan Wang,Linkai Liu,Yaoyu Li,Jinyuan Zhao,Yu Li,Zheng Zhang,Jingjing Lv,Junjie Shen,Zhangang Lin,Jingping Shao,Yuanjie Shao,Xinge You,Changxin Gao,Nong Sang
类目: Machine Learning (cs.LG); Graphics (cs.GR); Information Retrieval (cs.IR)
*备注: Accepted to WWW 2025

点击查看摘要

Abstract:In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: this https URL.

[LG-85] Functional 3D Scene Synthesis through Human-Scene Optimization

链接: https://arxiv.org/abs/2502.06819
作者: Yao Wei,Matteo Toso,Pietro Morerio,Michael Ying Yang,Alessio Del Bue
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注: 17 pages, 14 figures

点击查看摘要

Abstract:This paper presents a novel generative approach that outputs 3D indoor environments solely from a textual description of the scene. Current methods often treat scene synthesis as a mere layout prediction task, leading to rooms with overlapping objects or overly structured scenes, with limited consideration of the practical usability of the generated environment. Instead, our approach is based on a simple, but effective principle: we condition scene synthesis to generate rooms that are usable by humans. This principle is implemented by synthesizing 3D humans that interact with the objects composing the scene. If this human-centric scene generation is viable, the room layout is functional and it leads to a more coherent 3D structure. To this end, we propose a novel method for functional 3D scene synthesis, which consists of reasoning, 3D assembling and optimization. We regard text guided 3D synthesis as a reasoning process by generating a scene graph via a graph diffusion network. Considering object functional co-occurrence, a new strategy is designed to better accommodate human-object interaction and avoidance, achieving human-aware 3D scene optimization. We conduct both qualitative and quantitative experiments to validate the effectiveness of our method in generating coherent 3D scene synthesis results.

[LG-86] Globality Strikes Back: Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2502.06818
作者: Jingyun Wang,Cilin Yan,Guoliang Kang
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In CLIP, patch-wise image representations mainly encode the homogeneous image-level properties and thus are not discriminative enough, hindering its application to the dense prediction task. Previous works make image features more distinct across patches, through making each patch mainly attend to itself or the neighboring patches within a narrow local window. However, with their modifications, the ability of CLIP to aggregate global context information, which is known to be useful for distinguishing confusing categories, is largely weakened. In this paper, we propose a new method named GCLIP, which mines the beneficial global knowledge of CLIP to facilitate the TF-OVSS task. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. In GCLIP, we merge the attention from the global token emerging blocks with the Query-Query attention to realize this goal. Secondly, we aim to make the Value embeddings of the last-block attention module more distinct and semantically correlated. To realize this, we design a novel channel suppression strategy. As the representation of each patch is finally determined by the attention weights and the Value embeddings, our method can generate more discriminative patch-level image features while absorbing global context information. Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

[LG-87] Honegumi: An Interface for Accelerating the Adoption of Bayesian Optimization in the Experimental Sciences

链接: https://arxiv.org/abs/2502.06815
作者: Sterling G. Baird,Andrew R. Falkowski,Taylor D. Sparks
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 7 pages, 3 figures, 1 table

点击查看摘要

Abstract:Bayesian optimization (BO) has emerged as a powerful tool for guiding experimental design and decision- making in various scientific fields, including materials science, chemistry, and biology. However, despite its growing popularity, the complexity of existing BO libraries and the steep learning curve associated with them can deter researchers who are not well-versed in machine learning or programming. To address this barrier, we introduce Honegumi, a user-friendly, interactive tool designed to simplify the process of creating advanced Bayesian optimization scripts. Honegumi offers a dynamic selection grid that allows users to configure key parameters of their optimization tasks, generating ready-to-use, unit-tested Python scripts tailored to their specific needs. Accompanying the interface is a comprehensive suite of tutorials that provide both conceptual and practical guidance, bridging the gap between theoretical understanding and practical implementation. Built on top of the Ax platform, Honegumi leverages the power of existing state-of-the-art libraries while restructuring the user experience to make advanced BO techniques more accessible to experimental researchers. By lowering the barrier to entry and providing educational resources, Honegumi aims to accelerate the adoption of advanced Bayesian optimization methods across various domains.

[LG-88] Harness Local Rewards for Global Benefits: Effective Text-to-Video Generation Alignment with Patch-level Reward Models

链接: https://arxiv.org/abs/2502.06812
作者: Shuting Wang,Haihong Tang,Zhicheng Dou,Chenyan Xiong
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:The emergence of diffusion models (DMs) has significantly improved the quality of text-to-video generation models (VGMs). However, current VGM optimization primarily emphasizes the global quality of videos, overlooking localized errors, which leads to suboptimal generation capabilities. To address this issue, we propose a post-training strategy for VGMs, HALO, which explicitly incorporates local feedback from a patch reward model, providing detailed and comprehensive training signals with the video reward model for advanced VGM optimization. To develop an effective patch reward model, we distill GPT-4o to continuously train our video reward model, which enhances training efficiency and ensures consistency between video and patch reward distributions. Furthermore, to harmoniously integrate patch rewards into VGM optimization, we introduce a granular DPO (Gran-DPO) algorithm for DMs, allowing collaborative use of both patch and video rewards during the optimization process. Experimental results indicate that our patch reward model aligns well with human annotations and HALO substantially outperforms the baselines across two evaluation methods. Further experiments quantitatively prove the existence of patch defects, and our proposed method could effectively alleviate this issue.

[LG-89] Efficient Diffusion Models: A Survey

链接: https://arxiv.org/abs/2502.06805
作者: Hui Shen,Jingxuan Zhang,Boning Xiong,Rui Hu,Shoufa Chen,Zhongwei Wan,Xin Wang,Yu Zhang,Zixuan Gong,Guangyin Bao,Chaofan Tao,Yongfeng Huang,Ye Yuan,Mi Zhang
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful generative models capable of producing high-quality contents such as images, videos, and audio, demonstrating their potential to revolutionize digital content creation. However, these capabilities come at the cost of their significant computational resources and lengthy generation time, underscoring the critical need to develop efficient techniques for practical deployment. In this survey, we provide a systematic and comprehensive review of research on efficient diffusion models. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient diffusion model topics from algorithm-level, system-level, and framework perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at this https URL. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient diffusion model research and inspire them to contribute to this important and exciting field.

[LG-90] Analyzing Geospatial and Socioeconomic Disparities in Breast Cancer Screening Among Populations in the United States: Machine Learning Approach

链接: https://arxiv.org/abs/2502.06800
作者: Soheil Hashtarkhani,Yiwang Zhou,Fekede Asefa Kumsa,Shelley White-Means,David L Schwartz,Arash Shaban-Nejad
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 11 Pages, 4 Figures, 2 Tables

点击查看摘要

Abstract:Breast cancer screening plays a pivotal role in early detection and subsequent effective management of the disease, impacting patient outcomes and survival rates. This study aims to assess breast cancer screening rates nationwide in the United States and investigate the impact of social determinants of health on these screening rates. Data on mammography screening at the census tract level for 2018 and 2020 were collected from the Behavioral Risk Factor Surveillance System. We developed a large dataset of social determinants of health, comprising 13 variables for 72337 census tracts. Spatial analysis employing Getis-Ord Gi statistics was used to identify clusters of high and low breast cancer screening rates. To evaluate the influence of these social determinants, we implemented a random forest model, with the aim of comparing its performance to linear regression and support vector machine models. The models were evaluated using R2 and root mean squared error metrics. Shapley Additive Explanations values were subsequently used to assess the significance of variables and direction of their influence. Geospatial analysis revealed elevated screening rates in the eastern and northern United States, while central and midwestern regions exhibited lower rates. The random forest model demonstrated superior performance, with an R2=64.53 and root mean squared error of 2.06 compared to linear regression and support vector machine models. Shapley Additive Explanations values indicated that the percentage of the Black population, the number of mammography facilities within a 10-mile radius, and the percentage of the population with at least a bachelor’s degree were the most influential variables, all positively associated with mammography screening rates.

[LG-91] Prompt-Aware Scheduling for Efficient Text-to-Image Inferencing System

链接: https://arxiv.org/abs/2502.06798
作者: Shubham Agarwal,Saud Iqbal,Subrata Mitra
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Graphics (cs.GR)
*备注: Poster presented at NSDI 2024

点击查看摘要

Abstract:Traditional ML models utilize controlled approximations during high loads, employing faster, but less accurate models in a process called accuracy scaling. However, this method is less effective for generative text-to-image models due to their sensitivity to input prompts and performance degradation caused by large model loading overheads. This work introduces a novel text-to-image inference system that optimally matches prompts across multiple instances of the same model operating at various approximation levels to deliver high-quality images under high loads and fixed budgets.

[LG-92] Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics

链接: https://arxiv.org/abs/2502.07749
作者: Tamsin James,Ben Williamson,Peter Tino,Nicole Wheeler
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:How can we identify causal genetic mechanisms that govern bacterial traits? Initial efforts entrusting machine learning models to handle the task of predicting phenotype from genotype return high accuracy scores. However, attempts to extract any meaning from the predictive models are found to be corrupted by falsely identified “causal” features. Relying solely on pattern recognition and correlations is unreliable, significantly so in bacterial genomics settings where high-dimensionality and spurious associations are the norm. Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants. In view of this, we set up open problems surrounding phenotype prediction from bacterial whole-genome datasets and extending those to learning causal effects, and discuss challenges that impact the reliability of a machine’s decision-making when faced with datasets of this nature.

[LG-93] Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold

链接: https://arxiv.org/abs/2502.07650
作者: Song Liu,Leyang Wang,Yakun Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to approximate the natural gradient of the KL divergence efficiently without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms.

[LG-94] Rethinking Timing Residuals: Advancing PET Detectors with Explicit TOF Corrections

链接: https://arxiv.org/abs/2502.07630
作者: Stephan Naunheim,Luis Lopes de Paiva,Vanessa Nadig,Yannick Kuhl,Stefan Gundacker,Florian Mueller,Volkmar Schulz
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PET is a functional imaging method that visualizes metabolic processes. TOF information can be derived from coincident detector signals and incorporated into image reconstruction to enhance the SNR. PET detectors are typically assessed by their CTR, but timing performance is degraded by various factors. Research on timing calibration seeks to mitigate these degradations and restore accurate timing information. While many calibration methods use analytical approaches, machine learning techniques have recently gained attention due to their flexibility. We developed a residual physics-based calibration approach that combines prior domain knowledge with the power of machine learning models. This approach begins with an initial analytical calibration addressing first-order skews. The remaining deviations, regarded as residual effects, are used to train machine learning models to eliminate higher-order skews. The key advantage is that the experimenter guides the learning process through the definition of timing residuals. In earlier studies, we developed models that directly predicted the expected time difference, which offered corrections only implicitly (implicit correction models). In this study, we introduce a new definition for timing residuals, enabling us to train models that directly predict correction values (explicit correction models). The explicit correction approach significantly simplifies data acquisition, improves linearity, and enhances timing performance from 371 \pm 6 ps to 281 \pm 5 ps for coincidences from 430 keV to 590 keV. Additionally, the new definition reduces model size, making it suitable for high-throughput applications like PET scanners. Experiments were conducted using two detector stacks composed of 4 \times 4 LYSO:Ce,Ca crystals ( 3.8\times 3.8\times 20 mm ^3 ) coupled to 4 \times 4 Broadcom NUV-MT SiPMs and digitized with the TOFPET2 ASIC.

[LG-95] Understanding the Generalization Error of Markov algorithms through Poissonization

链接: https://arxiv.org/abs/2502.07584
作者: Benjamin Dupuis,Maxime Haddouche,George Deligiannidis,Umut Simsekli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using continuous-time stochastic differential equation (SDE) proxies to stochastic optimization algorithms has proven fruitful for understanding their generalization abilities. A significant part of these approaches are based on the so-called ``entropy flows’‘, which greatly simplify the generalization analysis. Unfortunately, such well-structured entropy flows cannot be obtained for most discrete-time algorithms, and the existing SDE approaches remain limited to specific noise and algorithmic structures. We aim to alleviate this issue by introducing a generic framework for analyzing the generalization error of Markov algorithms through `Poissonization’, a continuous-time approximation of discrete-time processes with formal approximation guarantees. Through this approach, we first develop a novel entropy flow, which directly leads to PAC-Bayesian generalization bounds. We then draw novel links to modified versions of the celebrated logarithmic Sobolev inequalities (LSI), identify cases where such LSIs are satisfied, and obtain improved bounds. Beyond its generality, our framework allows exploiting specific properties of learning algorithms. In particular, we incorporate the noise structure of different algorithm types - namely, those with additional noise injections (noisy) and those without (non-noisy) - through various technical tools. This illustrates the capacity of our methods to achieve known (yet, Poissonized) and new generalization bounds.

[LG-96] Forecasting the future development in quality and value of professional football players for applications in team management

链接: https://arxiv.org/abs/2502.07528
作者: Koen W. van Arem,Floris Goes-Smit,Jakob Söhl
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: The article itself is on the pages 1-27. The data set used in this article is described in the appendix at the pages 28-35

点击查看摘要

Abstract:Transfers in professional football (soccer) are risky investments because of the large transfer fees and high risks involved. Although data-driven models can be used to improve transfer decisions, existing models focus on describing players’ historical progress, leaving their future performance unknown. Moreover, recent developments have called for the use of explainable models combined with uncertainty quantification of predictions. This paper assesses explainable machine learning models based on predictive accuracy and uncertainty quantification methods for the prediction of the future development in quality and transfer value of professional football players. Using a historical data set of data-driven indicators describing player quality and the transfer value of a football player, the models are trained to forecast player quality and player value one year ahead. These two prediction problems demonstrate the efficacy of tree-based models, particularly random forest and XGBoost, in making accurate predictions. In general, the random forest model is found to be the most suitable model because it provides accurate predictions as well as an uncertainty quantification method that naturally arises from the bagging procedure of the random forest model. Additionally, our research shows that the development of player performance contains nonlinear patterns and interactions between variables, and that time series information can provide useful information for the modeling of player performance metrics. Our research provides models to help football clubs make more informed, data-driven transfer decisions by forecasting player quality and transfer value.

[LG-97] Quantification of model error for inverse problems in the Weak Neural Variational Inference framework

链接: https://arxiv.org/abs/2502.07415
作者: Vincent C. Scholz,P.S. Koutsourelakis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:We present a novel extension of the Weak Neural Variational Inference (WNVI) framework for probabilistic material property estimation that explicitly quantifies model errors in PDE-based inverse problems. Traditional approaches assume the correctness of all governing equations, including potentially unreliable constitutive laws, which can lead to biased estimates and misinterpretations. Our proposed framework addresses this limitation by distinguishing between reliable governing equations, such as conservation laws, and uncertain constitutive relationships. By treating all state variables as latent random variables, we enforce these equations through separate sets of residuals, leveraging a virtual likelihood approach with weighted residuals. This formulation not only identifies regions where constitutive laws break down but also improves robustness against model uncertainties without relying on a fully trustworthy forward model. We demonstrate the effectiveness of our approach in the context of elastography, showing that it provides a structured, interpretable, and computationally efficient alternative to traditional model error correction techniques. Our findings suggest that the proposed framework enhances the accuracy and reliability of material property estimation by offering a principled way to incorporate uncertainty in constitutive modeling.

[LG-98] Bandit Optimal Transport

链接: https://arxiv.org/abs/2502.07397
作者: Lorenzo Croissant(CREST, FAIRPLAY, ENSAE)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-99] Uniform Kernel Prober

链接: https://arxiv.org/abs/2502.07369
作者: Soumya Mukherjee,Bharath K. Sriperumbudur
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 34 pages, 10 figures

点击查看摘要

[LG-100] PICTS: A Novel Deep Reinforcement Learning Approach for Dynamic P-I Control in Scanning Probe Microscopy

链接: https://arxiv.org/abs/2502.07326
作者: Ziwei Wei,Shuming Wei,Qibin Zeng,Wanheng Lu,Huajun Liu,Kaiyang Zeng
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:We have developed a Parallel Integrated Control and Training System, leveraging the deep reinforcement learning to dynamically adjust the control strategies in real time for scanning probe microscopy techniques.

[LG-101] Global Universal Scaling and Ultra-Small Parameterization in Machine Learning Interatomic Potentials with Super-Linearity

链接: https://arxiv.org/abs/2502.07293
作者: Yanxiao Hu,Ye Sheng,Jing Huang,Xiaoxin Xu,Yuyan Yang,Mingqiang Zhang,Yabei Wu,Caichao Ye,Jiong Yang,Wenqing Zhang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using machine learning (ML) to construct interatomic interactions and thus potential energy surface (PES) has become a common strategy for materials design and simulations. However, those current models of machine learning interatomic potential (MLIP) provide no relevant physical constrains, and thus may owe intrinsic out-of-domain difficulty which underlies the challenges of model generalizability and physical scalability. Here, by incorporating physics-informed Universal-Scaling law and nonlinearity-embedded interaction function, we develop a Super-linear MLIP with both Ultra-Small parameterization and greatly expanded expressive capability, named SUS2-MLIP. Due to the global scaling rooting in universal equation of state (UEOS), SUS2-MLIP not only has significantly-reduced parameters by decoupling the element space from coordinate space, but also naturally outcomes the out-of-domain difficulty and endows the potentials with inherent generalizability and scalability even with relatively small training dataset. The nonlinearity-enbeding transformation for interaction function expands the expressive capability and make the potentials super-linear. The SUS2-MLIP outperforms the state-of-the-art MLIP models with its exceptional computational efficiency especially for multiple-element materials and physical scalability in property prediction. This work not only presents a highly-efficient universal MLIP model but also sheds light on incorporating physical constraints into artificial-intelligence-aided materials simulation.

[LG-102] Negative Dependence as a toolbox for machine learning : review and new developments

链接: https://arxiv.org/abs/2502.07285
作者: Hoang-Son Tran,Vladimir Petrovic,Remi Bardenet,Subhroshekhar Ghosh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: Dedicated to the memory of Prof K.R. Parthasarathy: visionary, guru, and scientist par excellence

点击查看摘要

Abstract:Negative dependence is becoming a key driver in advancing learning capabilities beyond the limits of traditional independence. Recent developments have evidenced support towards negatively dependent systems as a learning paradigm in a broad range of fundamental machine learning challenges including optimization, sampling, dimensionality reduction and sparse signal recovery, often surpassing the performance of current methods based on statistical independence. The most popular negatively dependent model has been that of determinantal point processes (DPPs), which have their origins in quantum theory. However, other models, such as perturbed lattice models, strongly Rayleigh measures, zeros of random functions have gained salience in various learning applications. In this article, we review this burgeoning field of research, as it has developed over the past two decades or so. We also present new results on applications of DPPs to the parsimonious representation of neural networks. In the limited scope of the article, we mostly focus on aspects of this area to which the authors contributed over the recent years, including applications to Monte Carlo methods, coresets and stochastic gradient descent, stochastic networks, signal processing and connections to quantum computation. However, starting from basics of negative dependence for the uninitiated reader, extensive references are provided to a broad swath of related developments which could not be covered within our limited scope. While existing works and reviews generally focus on specific negatively dependent models (e.g. DPPs), a notable feature of this article is that it addresses negative dependence as a machine learning methodology as a whole. In this vein, it covers within its span an array of negatively dependent models and their applications well beyond DPPs, thereby putting forward a very general and rather unique perspective.

[LG-103] Riemannian Proximal Sampler for High-accuracy Sampling on Manifolds

链接: https://arxiv.org/abs/2502.07265
作者: Yunrui Guan,Krishnakumar Balasubramanian,Shiqian Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We introduce the Riemannian Proximal Sampler, a method for sampling from densities defined on Riemannian manifolds. The performance of this sampler critically depends on two key oracles: the Manifold Brownian Increments (MBI) oracle and the Riemannian Heat-kernel (RHK) oracle. We establish high-accuracy sampling guarantees for the Riemannian Proximal Sampler, showing that generating samples with \varepsilon -accuracy requires O(\log(1/\varepsilon)) iterations in Kullback-Leibler divergence assuming access to exact oracles and O(\log^2(1/\varepsilon)) iterations in the total variation metric assuming access to sufficiently accurate inexact oracles. Furthermore, we present practical implementations of these oracles by leveraging heat-kernel truncation and Varadhan’s asymptotics. In the latter case, we interpret the Riemannian Proximal Sampler as a discretization of the entropy-regularized Riemannian Proximal Point Method on the associated Wasserstein space. We provide preliminary numerical results that illustrate the effectiveness of the proposed methodology.

[LG-104] Enhancing Robustness Of Digital Shadow For CO2 Storag e Monitoring With Augmented Rock Physics Modeling

链接: https://arxiv.org/abs/2502.07171
作者: Abhinav Prakash Gahlot,Felix J. Herrmann
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:To meet climate targets, the IPCC underscores the necessity of technologies capable of removing gigatonnes of CO2 annually, with Geological Carbon Storage (GCS) playing a central role. GCS involves capturing CO2 and injecting it into deep geological formations for long-term storage, requiring precise monitoring to ensure containment and prevent leakage. Time-lapse seismic imaging is essential for tracking CO2 migration but often struggles to capture the complexities of multi-phase subsurface flow. Digital Shadows (DS), leveraging machine learning-driven data assimilation techniques such as nonlinear Bayesian filtering and generative AI, provide a more detailed, uncertainty-aware monitoring approach. By incorporating uncertainties in reservoir properties, DS frameworks improve CO2 migration forecasts, reducing risks in GCS operations. However, data assimilation depends on assumptions regarding reservoir properties, rock physics models, and initial conditions, which, if inaccurate, can compromise prediction reliability. This study demonstrates that augmenting forecast ensembles with diverse rock physics models mitigates the impact of incorrect assumptions and improves predictive accuracy, particularly in differentiating uniform versus patchy saturation models.

[LG-105] Advancing Geological Carbon Storag e Monitoring With 3d Digital Shadow Technology

链接: https://arxiv.org/abs/2502.07169
作者: Abhinav Prakash Gahlot,Rafael Orozco,Felix J. Herrmann
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

[LG-106] Online Covariance Matrix Estimation in Sketched Newton Methods

链接: https://arxiv.org/abs/2502.07114
作者: Wei Kuang,Mihai Anitescu,Sen Na
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computation (stat.CO)
*备注: 52 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched Newton method that leverages a randomized sketching technique to perform an approximate Newton step in each iteration, thereby eliminating the computational bottleneck of second-order methods. While existing studies have established the asymptotic normality of sketched Newton methods, a consistent estimator of the limiting covariance matrix remains an open problem. We propose a fully online covariance matrix estimator that is constructed entirely from the Newton iterates and requires no matrix factorization. Compared to covariance estimators for first-order online methods, our estimator for second-order methods is batch-free. We establish the consistency and convergence rate of our estimator, and coupled with asymptotic normality results, we can then perform online statistical inference for the model parameters based on sketched Newton methods. We also discuss the extension of our estimator to constrained problems, and demonstrate its superior performance on regression problems as well as benchmark problems in the CUTEst set.

[LG-107] Confidence Intervals for Evaluation of Data Mining

链接: https://arxiv.org/abs/2502.07016
作者: Zheng Yuan,Wenxin Jiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-108] Epistemic Uncertainty in Conformal Scores: A Unified Approach

链接: https://arxiv.org/abs/2502.06995
作者: Luben M. C. Cabezas,Vagner S. Santos,Thiago R. Ramos,Rafael Izbicki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction methods create prediction bands with distribution-free guarantees but do not explicitly capture epistemic uncertainty, which can lead to overconfident predictions in data-sparse regions. Although recent conformal scores have been developed to address this limitation, they are typically designed for specific tasks, such as regression or quantile regression. Moreover, they rely on particular modeling choices for epistemic uncertainty, restricting their applicability. We introduce \textttEPICSCORE , a model-agnostic approach that enhances any conformal score by explicitly integrating epistemic uncertainty. Leveraging Bayesian techniques such as Gaussian Processes, Monte Carlo Dropout, or Bayesian Additive Regression Trees, \textttEPICSCORE adaptively expands predictive intervals in regions with limited data while maintaining compact intervals where data is abundant. As with any conformal method, it preserves finite-sample marginal coverage. Additionally, it also achieves asymptotic conditional coverage. Experiments demonstrate its good performance compared to existing methods. Designed for compatibility with any Bayesian model, but equipped with distribution-free guarantees, \textttEPICSCORE provides a general-purpose framework for uncertainty quantification in prediction problems.

[LG-109] Dual Conic Proxy for Semidefinite Relaxation of AC Optimal Power Flow

链接: https://arxiv.org/abs/2502.06978
作者: Guancheng Qiu,Mathieu Tanneau,Pascal Van Hentenryck
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-110] Diffusion-empowered AutoPrompt MedSAM

链接: https://arxiv.org/abs/2502.06817
作者: Peng Huang,Shu Hu,Bo Peng,Jiashu Zhang,Hongtu Zhu,Xi Wu,Xin Wang
类目: Image and Video Processing (eess.IV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM’s image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model’s ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model’s predictions, transforming MedSAM’s semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM’s pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at this https URL.

信息检索

[IR-0] IU4Rec: Interest Unit-Based Product Organization and Recommendation for E-Commerce Platform KDD25

链接: https://arxiv.org/abs/2502.07658
作者: Wenhao Wu,Xiaojie Li,Lin Wang,Jialiang Zhou,Di Wu,Qinye Xie,Qingheng Zhang,Yin Zhang,Shuguang Han,Fei Huang,Junfeng Chen
类目: Information Retrieval (cs.IR)
*备注: Under review at KDD25 ADS. This work has already been deployed on the Xianyu platform in Alibaba. arXiv admin note: substantial text overlap with arXiv:2403.06747

点击查看摘要

Abstract:Most recommendation systems typically follow a product-based paradigm utilizing user-product interactions to identify the most engaging items for users. However, this product-based paradigm has notable drawbacks for Xianyu~\footnoteXianyu is China’s largest online C2C e-commerce platform where a large portion of the product are post by individual sellers. Most of the product on Xianyu posted from individual sellers often have limited stock available for distribution, and once the product is sold, it’s no longer available for distribution. This result in most items distributed product on Xianyu having relatively few interactions, affecting the effectiveness of traditional recommendation depending on accumulating user-item interactions. To address these issues, we introduce \textbfIU4Rec, an \textbfInterest \textbfUnit-based two-stage \textbfRecommendation system framework. We first group products into clusters based on attributes such as category, image, and semantics. These IUs are then integrated into the Recommendation system, delivering both product and technological innovations. IU4Rec begins by grouping products into clusters based on attributes such as category, image, and semantics, forming Interest Units (IUs). Then we redesign the recommendation process into two stages. In the first stage, the focus is on recommend these Interest Units, capturing broad-level interests. In the second stage, it guides users to find the best option among similar products within the selected Interest Unit. User-IU interactions are incorporated into our ranking models, offering the advantage of more persistent IU behaviors compared to item-specific interactions. Experimental results on the production dataset and online A/B testing demonstrate the effectiveness and superiority of our proposed IU-centric recommendation approach.

[IR-1] ETimeline: An Extensive Timeline Generation Dataset based on Large Language Model

链接: https://arxiv.org/abs/2502.07474
作者: Xiaochen Liu,Yanan Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Timeline generation is of great significance for a comprehensive understanding of the development of events over time. Its goal is to organize news chronologically, which helps to identify patterns and trends that may be obscured when viewing news in isolation, making it easier to track the development of stories and understand the interrelationships between key events. Timelines are now common in various commercial products, but academic research in this area is notably scarce. Additionally, the current datasets are in need of refinement for enhanced utility and expanded coverage. In this paper, we propose ETimeline, which encompasses over 13,000 news articles, spanning 600 bilingual timelines across 28 news domains. Specifically, we gather a candidate pool of more than 120,000 news articles and employ the large language model (LLM) Pipeline to improve performance, ultimately yielding the ETimeline. The data analysis underscores the appeal of ETimeline. Additionally, we also provide the news pool data for further research and analysis. This work contributes to the advancement of timeline generation research and supports a wide range of tasks, including topic generation and event relationships. We believe that this dataset will serve as a catalyst for innovative research and bridge the gap between academia and industry in understanding the practical application of technology services. The dataset is available at this https URL

[IR-2] Prompt-Based Document Modifications In Ranking Competitions

链接: https://arxiv.org/abs/2502.07315
作者: Niv Bardas,Tommy Mordo,Oren Kurland,Moshe Tennenholtz,Gal Zur
类目: Information Retrieval (cs.IR); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study prompting-based approaches with Large Language Models (LLMs) for modifying documents so as to promote their ranking in a competitive search setting. Our methods are inspired by prior work on leveraging LLMs as rankers. We evaluate our approach by deploying it as a bot in previous ranking competitions and in competitions we organized. Our findings demonstrate that our approach effectively improves document ranking while preserving high levels of faithfulness to the original content and maintaining overall document quality.

[IR-3] CreAgent : Towards Long-Term Evaluation of Recommender System under Platform-Creator Information Asymmetry

链接: https://arxiv.org/abs/2502.07307
作者: Xiaopeng Ye,Chen Xu,Zhongxiang Sun,Jun Xu,Gang Wang,Zhenhua Dong,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Ensuring the long-term sustainability of recommender systems (RS) emerges as a crucial issue. Traditional offline evaluation methods for RS typically focus on immediate user feedback, such as clicks, but they often neglect the long-term impact of content creators. On real-world content platforms, creators can strategically produce and upload new items based on user feedback and preference trends. While previous studies have attempted to model creator behavior, they often overlook the role of information asymmetry. This asymmetry arises because creators primarily have access to feedback on the items they produce, while platforms possess data on the entire spectrum of user feedback. Current RS simulators, however, fail to account for this asymmetry, leading to inaccurate long-term evaluations. To address this gap, we propose CreAgent, a Large Language Model (LLM)-empowered creator simulation agent. By incorporating game theory’s belief mechanism and the fast-and-slow thinking framework, CreAgent effectively simulates creator behavior under conditions of information asymmetry. Additionally, we enhance CreAgent’s simulation ability by fine-tuning it using Proximal Policy Optimization (PPO). Our credibility validation experiments show that CreAgent aligns well with the behaviors between real-world platform and creator, thus improving the reliability of long-term RS evaluations. Moreover, through the simulation of RS involving CreAgents, we can explore how fairness- and diversity-aware RS algorithms contribute to better long-term performance for various stakeholders. CreAgent and the simulation platform are publicly available at this https URL.

[IR-4] Flow Matching for Collaborative Filtering

链接: https://arxiv.org/abs/2502.07303
作者: Chengkai Liu,Yangtian Zhang,Jianling Wang,Rex Ying,James Caverlee
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative models have shown great promise in collaborative filtering by capturing the underlying distribution of user interests and preferences. However, existing approaches struggle with inaccurate posterior approximations and misalignment with the discrete nature of recommendation data, limiting their expressiveness and real-world performance. To address these limitations, we propose FlowCF, a novel flow-based recommendation system leveraging flow matching for collaborative filtering. We tailor flow matching to the unique challenges in recommendation through two key innovations: (1) a behavior-guided prior that aligns with user behavior patterns to handle the sparse and heterogeneous user-item interactions, and (2) a discrete flow framework to preserve the binary nature of implicit feedback while maintaining the benefits of flow matching, such as stable training and efficient inference. Extensive experiments demonstrate that FlowCF achieves state-of-the-art recommendation accuracy across various datasets with the fastest inference speed, making it a compelling approach for real-world recommender systems.

[IR-5] DOGR: Leverag ing Document-Oriented Contrastive Learning in Generative Retrieval

链接: https://arxiv.org/abs/2502.07219
作者: Penghao Lu,Xin Dong,Yuansheng Zhou,Lei Cheng,Chuan Yuan,Linjian Mo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative retrieval constitutes an innovative approach in in- formation retrieval, leveraging generative language models (LM) to generate a ranked list of document identifiers (do- cid) for a given query. It simplifies the retrieval pipeline by replacing the large external index with model parameters. However, existing works merely learned the relationship be- tween queries and document identifiers, which is unable to directly represent the relevance between queries and docu- ments. To address the above problem, we propose a novel and general generative retrieval framework, namely Leverag- ing Document-Oriented Contrastive Learning in Generative Retrieval (DOGR), which leverages contrastive learning to improve generative retrieval tasks. It adopts a two-stage learn- ing strategy that captures the relationship between queries and documents comprehensively through direct interactions. Furthermore, negative sampling methods and correspond- ing contrastive learning objectives are implemented to en- hance the learning of semantic representations, thereby pro- moting a thorough comprehension of the relationship be- tween queries and documents. Experimental results demon- strate that DOGR achieves state-of-the-art performance com- pared to existing generative retrieval methods on two public benchmark datasets. Further experiments have shown that our framework is generally effective for common identifier con- struction techniques.

[IR-6] Repository-level Code Search with Neural Retrieval Methods

链接: https://arxiv.org/abs/2502.07067
作者: Siddharth Gandhi,Luyu Gao,Jamie Callan
类目: Information Retrieval (cs.IR)
*备注: 16 pages

点击查看摘要

Abstract:This paper presents a multi-stage reranking system for repository-level code search, which leverages the vastly available commit histories of large open-source repositories to aid in bug fixing. We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user’s question or bug. The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files. By learning patterns from diverse repositories and their commit histories, the system can surface relevant files for the task at hand. The system leverages both commit messages and source code for relevance matching, and is evaluated in both normal and oracle settings. Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline, across a diverse set of queries, demonstrating the effectiveness this approach. We hope this work aids LLM agents as a tool for better code search and understanding. Our code and results obtained are publicly available.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-12

目录

概览 (2025-02-12)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载