Arxiv今日论文 | 2025-02-27

本篇博文主要内容为 2025-02-27 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在局部知识编辑过程中的稳定性与性能退化问题。关键在于揭示了在连续局部更新过程中，更新矩阵的Frobenius范数不断增加的现象，并且这种增长导致模型内部激活向量的子空间发生位移，从而影响模型的整体平衡和下游任务表现。论文强调了连续和局部顺序知识编辑的技术挑战及其对模型稳定性和实用性的影响。

链接: https://arxiv.org/abs/2502.19416
作者: Akshat Gupta,Christine Fang,Atahan Ozdemir,Maochuan Lu,Ahmed Alaa,Thomas Hartvigsen,Gopala Anumanchipalli
机构: unknown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for Oral Presentation at KnowFM @ AAAI 2025. arXiv admin note: text overlap with arXiv:2502.01636

点击查看摘要

Abstract:This study investigates the impact of localized updates to large language models (LLMs), specifically in the context of knowledge editing - a task aimed at incorporating or modifying specific facts without altering broader model capabilities. We first show that across different post-training interventions like continuous pre-training, full fine-tuning and LORA-based fine-tuning, the Frobenius norm of the updated matrices always increases. This increasing norm is especially detrimental for localized knowledge editing, where only a subset of matrices are updated in a model . We reveal a consistent phenomenon across various editing techniques, including fine-tuning, hypernetwork-based approaches, and locate-and-edit methods: the norm of the updated matrix invariably increases with successive updates. Such growth disrupts model balance, particularly when isolated matrices are updated while the rest of the model remains static, leading to potential instability and degradation of downstream performance. Upon deeper investigations of the intermediate activation vectors, we find that the norm of internal activations decreases and is accompanied by shifts in the subspaces occupied by these activations, which shows that these activation vectors now occupy completely different regions in the representation space compared to the unedited model. With our paper, we highlight the technical challenges with continuous and localized sequential knowledge editing and their implications for maintaining model stability and utility.
zh

[NLP-1] Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLM s

【速读】：该论文旨在解决科学知识因付费墙、许可协议和版权规则而受限传播与重用的问题。论文的关键解决方案是将学术文档转换为知识单元（Knowledge Units），利用大型语言模型（LLMs）提取结构化数据，捕捉实体、属性和关系，而不包含风格性内容。论文提供了证据表明这种知识单元形式在法律上具有可辩护性，并且能够保留原始文本中约95%的事实性知识。通过这种方法，论文主张能够在尊重版权的同时实现科学知识的民主化访问。

链接: https://arxiv.org/abs/2502.19413
作者: Christoph Schuhmann,Gollam Rabby,Ameya Prabhu,Tawsif Ahmed,Andreas Hochlehnert,Huu Nguyen,Nick Akinci Heidrich,Ludwig Schmidt,Robert Kaczmarczyk,Sören Auer,Jenia Jitsev,Matthias Bethge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.
zh

[NLP-2] he Mighty ToRR: A Benchmark for Table Reasoning and Robustness

【速读】：该论文旨在解决模型在表格数据上的性能评估与鲁棒性不足的问题。关键在于创建了一个名为ToRR（Table Reasoning and Robustness）的基准测试，涵盖多种表格推理任务的数据集，并通过多格式测试和多提示配置来全面评估模型的性能和鲁棒性。研究表明，即使表现良好的模型在处理表格数据任务时也表现出脆弱的行为，而多格式测试对于可靠估计模型能力至关重要。

链接: https://arxiv.org/abs/2502.19412
作者: Shir Ashury-Tahan,Yifan Mai,Rajmohan C,Ariel Gera,Yotam Perlitz,Asaf Yehudai,Elron Bandel,Leshem Choshen,Eyal Shnarch,Percy Liang,Michal Shmueli-Scheuer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, that measures model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
zh

[NLP-3] Code to Think Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning -Driven Code Intelligence in LLM s

【速读】：该论文旨在探讨代码如何作为结构化媒介增强推理能力，并分析推理能力的提升如何将代码智能从基本完成任务提升至具备高级功能。关键解决方案在于利用代码提供的可验证执行路径、逻辑分解能力和运行时验证机制，从而实现模型在复杂软件工程任务中的规划与调试能力。最终目标是通过加强这种协同作用来提升大规模语言模型（LLMs）在这两方面的能力。

链接: https://arxiv.org/abs/2502.19411
作者: Dayu Yang,Tianyang Liu,Daoan Zhang,Antoine Simoulin,Xiaoyi Liu,Yuwei Cao,Zhaopu Teng,Xin Qian,Grey Yang,Jiebo Luo,Julian McAuley
机构: Meta AI(元宇宙人工智能实验室); University of California, San Diego(加州大学圣地亚哥分校); University of Rochester(罗彻斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Project Repo: this https URL

点击查看摘要

Abstract:In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM’s performance in both areas.
zh

[NLP-4] ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在处理图像序列推理时存在的挑战，特别是这些模型难以识别图像间的序列结构，并常常将图像独立对待的问题。解决方案的关键在于引入ImageChain框架，通过将视觉序列建模为多轮对话来增强MLLMs的序列推理能力。在ImageChain中，图像与相应的文本描述交织形成控制性对话，从而显式捕捉时间依赖性和叙事进展，优化了场景描述任务的表现，实现了从3.7%到19%不等的性能提升，并在跨领域的零样本应用中展现了鲁棒性。

链接: https://arxiv.org/abs/2502.19409
作者: Danae Sánchez Villegas,Ingo Ziegler,Desmond Elliott
机构: University of Copenhagen (哥本哈根大学), Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code, dataset, and checkpoints are publicly available at this https URL

点击查看摘要

Abstract:Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task – achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
zh

[NLP-5] Learning Code-Edit Embedding to Model Student Debugging Behavior

【速读】：该论文旨在解决编程作业反馈过程中对学生调试行为分析不足的问题。论文的关键解决方案在于提出了一种基于编码器-解码器的模型，通过学习学生连续代码提交之间的有意义的代码编辑嵌入（code-edit embeddings），来捕捉其调试行为。该模型利用学生代码提交是否通过每个测试用例的信息，微调大规模语言模型（LLMs）以学习代码编辑表示，从而实现保持学生编码风格的同时改进测试用例正确性的个性化下一步代码建议。此外，通过聚类技术分析学生代码编辑模式，揭示常见的学生错误和调试行为。

链接: https://arxiv.org/abs/2502.19407
作者: Hasnain Heickal,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Providing effective feedback for programming assignments in computer science education can be challenging: students solve problems by iteratively submitting code, executing it, and using limited feedback from the compiler or the auto-grader to debug. Analyzing student debugging behavior in this process may reveal important insights into their knowledge and inform better personalized support tools. In this work, we propose an encoder-decoder-based model that learns meaningful code-edit embeddings between consecutive student code submissions, to capture their debugging behavior. Our model leverages information on whether a student code submission passes each test case to fine-tune large language models (LLMs) to learn code editing representations. It enables personalized next-step code suggestions that maintain the student’s coding style while improving test case correctness. Our model also enables us to analyze student code-editing patterns to uncover common student errors and debugging behaviors, using clustering techniques. Experimental results on a real-world student code submission dataset demonstrate that our model excels at code reconstruction and personalized code suggestion while revealing interesting patterns in student debugging behavior.
zh

[NLP-6] heoremExplainAgent : Towards Multimodal Explanations for LLM Theorem Understanding

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成连贯且具有教育意义的视觉解释方面的局限性。论文的关键解决方案是引入TheoremExplainAgent，这是一种基于代理的方法，利用Manim动画生成长篇定理解释视频（超过5分钟）。此外，为了系统评估多模态定理解释，提出了TheoremExplainBench基准测试。研究结果表明，基于代理的规划对于生成详细的长篇视频至关重要，并且o3-mini代理的成功率达到93.8%，总体得分为0.77。然而，尽管如此，大多数生成的视频在视觉元素布局上仍存在一些小问题。

链接: https://arxiv.org/abs/2502.19400
作者: Max Ku,Thomas Chong,Jonathan Leung,Krish Shah,Alvin Yu,Wenhu Chen
机构: University of Waterloo(滑铁卢大学); Votee AI; Vector Institute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.
zh

[NLP-7] Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis

【速读】：该论文旨在解决在自监督学习模型（如wav2vec2、HuBERT、WavLM和Whisper）生成的语音嵌入中，语言信息与副语言特征难以分离的问题。解决方案的关键在于通过将语音嵌入回归到其对应的文本嵌入，并使用残差作为语音语调的表示，从而实现副语言特征与语言内容的有效解耦。实验结果表明，这种方法显著提高了语调分类性能，并增强了线性可分性，使得即使是简单的模型如逻辑回归也能获得更好的分类效果。

链接: https://arxiv.org/abs/2502.19387
作者: Hamdan Al Ahbabi,Gautier Marti,Saeed AlMarri,Ibrahim Elfadel
机构: Khalifa University; ADIA; Khalifa University; Khalifa University
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features. These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.
zh

[NLP-8] DataMan: Data Manager for Pre-training Large Language Models ICLR2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）预训练数据选择缺乏全面且明确指导准则的问题。解决方案的关键在于通过“逆向思维”启发，让LLMs自我识别哪些标准有利于其性能提升。具体而言，论文从文本困惑度（Perplexity, PPL）异常的原因中推导出14个质量标准，并引入15个常见应用领域以支持领域混合。为此，论文训练了一个名为Data Manager（DataMan）的数据管理器，用于学习质量评分和领域识别，并使用它标注了一个包含14个质量评分和领域类型的4470亿词预训练语料库。实验结果验证了该方法的有效性，表明在使用DataMan选择的300亿词数据进行训练后，13亿参数的语言模型在上下文学习（In-Context Learning, ICL）、困惑度和指令跟随能力方面均显著优于最先进的基线模型。

链接: https://arxiv.org/abs/2502.19363
作者: Ru Peng,Kexin Yang,Yawen Zeng,Junyang Lin,Dayiheng Liu,Junbo Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR2025 paper

点击查看摘要

Abstract:The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking’’ – prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan’s domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.
zh

[NLP-9] Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning ?

【速读】：该论文旨在探究长链思维（Chain-of-Thought, CoT）推理步骤的质量，并评估现有大型语言模型（Large Language Models, LLMs）在这些长CoT推理中的错误检测能力。为实现这一目标，论文引入了DeltaBench，其中包括来自不同一阶模型（o1-like models），如QwQ和DeepSeek-R1，针对数学、代码和一般推理等不同任务生成的长CoT推理结果。解决方案的关键在于通过DeltaBench进行细粒度分析，以发现不同一阶模型的有效性和效率，并对现有的过程奖励模型（Process Reward Models, PRMs）和批评模型进行广泛评估，从而探讨现有PRMs和批评模型的边界与局限性。

链接: https://arxiv.org/abs/2502.19361
作者: Yancheng He,Shilong Li,Jiaheng Liu,Weixun Wang,Xingyuan Bu,Ge Zhang,Zhongyuan Peng,Zhaoxiang Zhang,Wenbo Su,Bo Zheng
机构: Alibaba Group (阿里集团); M-A-P (未知缩写); CASIA (未知)
类目: Computation and Language (cs.CL)
备注: The first three authors contributed equally, 27 pages

点击查看摘要

Abstract:Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
zh

[NLP-10] Controlled Diversity: Length-optimized Natural Language Generation ISCA

【速读】：该论文旨在解决大型语言模型（LLMs）无法根据严格的长度要求调整输出长度的问题，这一能力的提升将增强其在需要满足多样化用户和系统需求的应用中的实用性。解决方案的关键在于通过扩充现有数据集并应用现有的微调技术来训练模型，使其能够更好地遵循长度要求，同时保持或改善整体响应质量。研究结果表明，这种方法可以成功使模型遵循长度要求，并且当使用包含模型自身响应的训练数据集时，可以避免降低响应质量的问题。

链接: https://arxiv.org/abs/2502.19347
作者: Diana Marie Schenke,Timo Baumann
机构: Faculty of Informatics and Mathematics, OTH Regensburg (信息学与数学学院，奥格斯堡应用技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:LLMs are not generally able to adjust the length of their outputs based on strict length requirements, a capability that would improve their usefulness in applications that require adherence to diverse user and system requirements. We present an approach to train LLMs to acquire this capability by augmenting existing data and applying existing fine-tuning techniques, which we compare based on the trained models’ adherence to the length requirement and overall response quality relative to the baseline model. Our results demonstrate that these techniques can be successfully applied to train LLMs to adhere to length requirements, with the trained models generating texts which better align to the length requirements. Our results indicate that our method may change the response quality when using training data that was not generated by the baseline model. This allows simultaneous alignment to another training objective in certain scenarios, but is undesirable otherwise. Training on a dataset containing the model’s own responses eliminates this issue.
zh

[NLP-11] Evaluating LLM s and Pre-trained Models for Text Summarization Across Diverse Datasets

【速读】：该论文旨在评估四种领先的预训练开源大型语言模型（Pre-trained Open-source Large Language Models），即BART、FLAN-T5、LLaMA-3-8B和Gemma-7B，在五种不同数据集上的文本摘要能力。关键解决方案在于采用广泛认可的自动评价指标，包括ROUGE-1、ROUGE-2、ROUGE-L、BERTScore和METEOR，来全面评估这些模型在生成连贯且信息丰富的摘要方面的性能。

链接: https://arxiv.org/abs/2502.19339
作者: Tohida Rehman,Soumabha Ghosh,Kuntal Das,Souvik Bhattacharjee,Debarshi Kumar Sanyal,Samiran Chattopadhyay
机构: Jadavpur University (贾达普大学); Techno India University (Techno印度大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Text summarization plays a crucial role in natural language processing by condensing large volumes of text into concise and coherent summaries. As digital content continues to grow rapidly and the demand for effective information retrieval increases, text summarization has become a focal point of research in recent years. This study offers a thorough evaluation of four leading pre-trained and open-source large language models: BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News Summary, XSum, and BBC News. The evaluation employs widely recognized automatic metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess the models’ capabilities in generating coherent and informative summaries. The results reveal the comparative strengths and limitations of these models in processing various text types.
zh

[NLP-12] Agent ic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

【速读】：该论文旨在解决现有奖励模型（Reward Models, RMs）主要依赖于人类偏好而忽视可验证正确性信号的问题。这些可验证正确性信号在训练大型语言模型（Large Language Models, LLMs）中展现出强大的潜力。论文的关键解决方案是提出了一种称为“主动奖励建模”（agentic reward modeling）的方法，该方法结合了奖励模型与来自不同方面的可验证正确性信号，以提供更可靠的奖励。具体而言，文中实现了一个名为“RewardAgent”的奖励代理，它将人类偏好奖励与两个可验证信号（事实性和指令遵循性）相结合，从而提供更为可靠的奖励。通过广泛的实验，证明了RewardAgent显著优于传统的奖励模型，并且在不同的自然语言处理基准测试中实现了优越的性能。

链接: https://arxiv.org/abs/2502.19328
作者: Hao Peng,Yunjia Qi,Xiaozhi Wang,Zijun Yao,Bin Xu,Lei Hou,Juanzi Li
机构: Department of Computer Science and Technology, Tsinghua University (计算机科学与技术系, 清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (this https URL).
zh

[NLP-13] Shh dont say that! Domain Certification in LLM s ICLR

【速读】：该论文旨在解决大型语言模型（LLMs）在执行受限任务时，因对抗性攻击可能导致输出超出预定领域的问题。论文的关键解决方案是引入域认证（domain certification），这是一种能够准确刻画语言模型域外行为的保证机制，并提出了一种名为VALID的方法，通过提供对抗性边界作为证书来实现这一目标。

链接: https://arxiv.org/abs/2502.19320
作者: Cornelius Emde,Alasdair Paren,Preetham Arvind,Maxime Kayser,Tom Rainforth,Thomas Lukasiewicz,Bernard Ghanem,Philip H.S. Torr,Adel Bibi
机构: University of Oxford(牛津大学); Vienna University of Technology(维也纳技术大学); King Abdullah University of Science and Technology(国王阿卜杜拉科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 10 pages, includes appendix Published in International Conference on Learning Representations (ICLR) 2025

点击查看摘要

Abstract:Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.
zh

[NLP-14] FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLM s Elicits Effective Personalization to Real Users

【速读】：该论文旨在解决大规模语言模型 (LLMs) 个性化不足的问题，特别是在虚拟助手和内容推荐等用户交互应用中。论文的关键解决方案是提出Few-Shot Preference Optimization (FSPO)，将奖励建模重新定义为元学习问题。通过这种方法，LLMs能够利用少量带标签的用户偏好快速适应用户，从而构建个性化的奖励函数。此外，由于真实偏好数据稀缺且难以大规模收集，论文还提出了精心设计的方法来构造合成偏好数据集，生成超过100万条合成个性化偏好。成功将合成数据迁移到真实用户的关键在于确保数据具有高多样性和一致的自洽结构。实验结果显示，FSPO在针对合成用户和真实用户的开放式问答任务中分别达到了87%和72%的胜率。

链接: https://arxiv.org/abs/2502.19312
作者: Anikait Singh,Sheryl Hsu,Kyle Hsu,Eric Mitchell,Stefano Ermon,Tatsunori Hashimoto,Archit Sharma,Chelsea Finn
机构: Stanford University; OpenAI; Google DeepMind (谷歌深度思维); Stanford University; Stanford University; Stanford University; Stanford University; Stanford University; Stanford University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
备注: Website: this https URL

点击查看摘要

Abstract:Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.
zh

[NLP-15] CritiQ: Mining Data Quality Criteria from Human Preferences

【速读】：该论文旨在解决语言模型在高质数据依赖方面的问题，现有方法依赖手动设计启发式方法、现有模型的困惑度、训练分类器或精心设计的提示工程，这些方法需要大量专家经验和人工标注工作，并引入偏见。论文提出了一种名为CritiQ的新颖数据选择方法，通过仅需约30个人工标注对即可自动挖掘人类偏好中的数据质量标准，并执行高效的数据选择。关键解决方案在于CritiQ Flow组件，它使用管理代理来进化质量标准，同时利用工人代理进行成对判断。此外，构建了一个知识库以提取先前工作的质量标准，从而增强CritiQ Flow。这种方法相比基于困惑度和分类器的方法更具可解释性，并具有重用价值。最终，通过训练CritiQ评分器给出质量分数并执行高效数据选择，验证了该方法在代码、数学和逻辑领域的有效性。

链接: https://arxiv.org/abs/2502.19279
作者: Honglin Guo,Kai Lv,Qipeng Guo,Tianyi Liang,Zhiheng Xi,Demin Song,Qiuyinzhe Zhang,Yu Sun,Kai Chen,Xipeng Qiu,Tao Gui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only \sim 30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
zh

[NLP-16] Disentangled VAD Representations via a Variational Framework for Political Stance Detection

【速读】：该论文旨在解决情感信息在立场检测中的有效整合问题，并特别关注细粒度情感标注的作用。论文的关键解决方案是提出了一种新的立场检测框架PoliStance-VAE，该框架利用变分自编码器（Variational Autoencoder, VAE）从社交媒体上的政治话语中解缠潜在的情感特征——价值（value）、唤醒（arousal）和支配（dominance）（VAD），从而提高立场检测的性能。通过这种方式，PoliStance-VAE在目标内和跨目标的立场检测场景中表现出色，并在基准数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2502.19276
作者: Beiyu Xu,Zhiwei Liu,Sophia Ananiadou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The stance detection task aims to categorise the stance regarding specified targets. Current methods face challenges in effectively integrating sentiment information for stance detection. Moreover, the role of highly granular sentiment labelling in stance detection has been largely overlooked. This study presents a novel stance detection framework utilizing a variational autoencoder (VAE) to disentangle latent emotional features-value, arousal, and dominance (VAD)-from political discourse on social media. This approach addresses limitations in current methods, particularly in in-target and cross-target stance detection scenarios. This research uses an advanced emotional annotation tool to annotate seven-class sentiment labels for P-STANCE. Evaluations on benchmark datasets, including P-STANCE and SemEval-2016, reveal that PoliStance-VAE achieves state-of-the-art performance, surpassing models like BERT, BERTweet, and GPT-4o. PoliStance-VAE offers a robust and interpretable solution for stance detection, demonstrating the effectiveness of integrating nuanced emotional representations. This framework paves the way for advancements in natural language processing tasks, particularly those requiring detailed emotional understanding.
zh

[NLP-17] Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization ICLR2025

【速读】：该论文旨在解决利用预训练密集模型初始化和训练混合专家（Mixture of Experts, MoE）模型时，虽然初期性能有所提升，但长期训练效果不佳的问题。解决方案的关键在于提出了一种名为Drop-Upcycling的方法，该方法结合了利用预训练密集模型的知识与统计重初始化部分权重的策略，从而促进专家的专化，显著提高了Mixture of Experts模型在知识获取过程中的效率。

链接: https://arxiv.org/abs/2502.19261
作者: Taishi Nakamura,Takuya Akiba,Kazuki Fujii,Yusuke Oda,Rio Yokota,Jun Suzuki
机构: Institute of Science Tokyo(科学东京研究所); Sakana AI; NII LLMC; Tohoku University (东北大学); RIKEN
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at the 13th International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model’s efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.
zh

[NLP-18] Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

【速读】：该论文旨在探究在语言模型预训练中使用形式语言能否有效提升其对自然语言的理解，并确定哪些形式语言特征能够提供归纳偏差以促进有效的迁移学习。研究的关键在于发现当形式语言不仅捕捉到自然语言中的依赖结构，同时保持在模型架构的计算限制之内时，可以实现更低的自然语言损失和更好的语言学泛化能力。论文通过实验证明，预先使用形式语言进行预训练（即先形式语言后自然语言的训练方式）比同等数量的纯自然语言训练更为高效，能够在减少33%的训练令牌预算的情况下达到相同的损失水平和更优的语言学泛化。此外，文中还提供了机制证据，表明形式语言预训练获得的注意力头对于模型在句法评估中的表现至关重要。

链接: https://arxiv.org/abs/2502.19249
作者: Michael Y. Hu,Jackson Petty,Chuan Shi,William Merrill,Tal Linzen
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretraining language models on formal languages can improve their acquisition of natural language, but it is unclear which features of the formal language impart an inductive bias that leads to effective transfer. Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when the formal language both captures dependency structures in natural language and remains within the computational limitations of the model architecture. Focusing on transformers, we find that formal languages with both these properties enable language models to achieve lower loss on natural language and better linguistic generalization compared to other languages. In fact, pre-pretraining, or training on formal-then-natural language, reduces loss more efficiently than the same amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. We also give mechanistic evidence of cross-task transfer from formal to natural language: attention heads acquired during formal language pretraining remain crucial for the model’s performance on syntactic evaluations.
zh

[NLP-19] wo Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂推理场景中的表现不佳以及推理结果透明度不足的问题。解决方案的关键在于引入了一个对比反思合成管道，以增强LLMs生成反思的准确性和深度，并提出了一个双模型推理框架，在口语强化学习范式下分离推理时间的自省过程为专门的、经过训练的模型，用于推理批判和改进。实验表明，该框架在所有评估指标上均优于传统的偏好优化方法，并通过协作的推理者-评论者模型实现了更优的推理性能和透明度。

链接: https://arxiv.org/abs/2502.19230
作者: Jiazheng Li,Yuxiang Zhou,Junru Lu,Gladys Tyen,Lin Gui,Cesare Aloisi,Yulan He
机构: King’s College London(国王学院伦敦); AQA; The Alan Turing Institute; Tencent YouTu Lab(腾讯优图实验室); University of Cambridge(剑桥大学); Google DeepMind(谷歌深思维)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with complex reasoning scenarios. While preference optimization methods enhance reasoning performance through training, they often lack transparency in why one reasoning outcome is preferred over another. Verbal reflection techniques improve explainability but are limited in LLMs’ critique and refinement capacity. To address these challenges, we introduce a contrastive reflection synthesis pipeline that enhances the accuracy and depth of LLM-generated reflections. We further propose a dual-model reasoning framework within a verbal reinforcement learning paradigm, decoupling inference-time self-reflection into specialized, trained models for reasoning critique and refinement. Extensive experiments show that our framework outperforms traditional preference optimization methods across all evaluation metrics. Our findings also show that “two heads are better than one”, demonstrating that a collaborative Reasoner-Critic model achieves superior reasoning performance and transparency, compared to single-model approaches.
zh

[NLP-20] Negation-Induced Forgetting in LLM s ISCA

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）是否表现出否定诱导遗忘（negation-induced forgetting, NIF），这是一种在人类认知中观察到的现象，即否定错误属性会削弱对对象或事件的记忆。论文的关键在于通过调整Zang等人（2023）的实验框架来测试ChatGPT-3.5、GPT-4o mini和Llama3-70b-instruct这三种模型是否存在NIF现象。实验结果显示，ChatGPT-3.5表现出明显的NIF效应，而GPT-4o-mini显示出边缘显著的NIF效应，Llama3-70B则未表现出NIF。这些发现为部分LLMs中存在NIF提供了初步证据，表明类似的认知偏差可能在这些模型中显现。

链接: https://arxiv.org/abs/2502.19211
作者: Francesca Capuano,Ellen Boschert,Barbara Kaup
机构: Department of Psychology, University of Tübingen (心理学院，蒂宾根大学)
类目: Computation and Language (cs.CL)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:The study explores whether Large Language Models (LLMs) exhibit negation-induced forgetting (NIF), a cognitive phenomenon observed in humans where negating incorrect attributes of an object or event leads to diminished recall of this object or event compared to affirming correct attributes (Mayo et al., 2014; Zang et al., 2023). We adapted Zang et al. (2023) experimental framework to test this effect in ChatGPT-3.5, GPT-4o mini and Llama3-70b-instruct. Our results show that ChatGPT-3.5 exhibits NIF, with negated information being less likely to be recalled than affirmed information. GPT-4o-mini showed a marginally significant NIF effect, while LLaMA-3-70B did not exhibit NIF. The findings provide initial evidence of negation-induced forgetting in some LLMs, suggesting that similar cognitive biases may emerge in these models. This work is a preliminary step in understanding how memory-related phenomena manifest in LLMs.
zh

[NLP-21] Bian: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

【速读】：该论文旨在解决Retrieval-Augmented Generation (RAG)模型在减少幻觉方面效果显著，但仍可能产生不一致或未经支持的内容的问题。针对广泛使用的LLM作为裁判方法在RAG幻觉检测中的两个主要挑战——缺乏全面的评估基准和缺少领域优化的裁判模型，论文提出了解决方案。关键在于引入了一个名为\textbfBi’an的新框架，该框架包含一个双语基准数据集和轻量级裁判模型。双语基准数据集支持在多个RAG场景下的严格评估，而轻量级裁判模型则通过从紧凑的开源LLM进行微调来实现。实验结果表明，基于Bi’an框架的14B参数规模模型在Bi’anBench上的表现优于具有超过五倍更大参数规模的基线模型，并且可以与最先进的闭源LLM相媲美。

链接: https://arxiv.org/abs/2502.19209
作者: Zhouyu Jiang,Mengshu Sun,Zhiqiang Zhang,Lei Liang
机构: Ant Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbfBi’an, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi’anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at this https URL.
zh

[NLP-22] MultiConAD: A Unified Multilingual Conversational Dataset for Early Alzheimers Detection

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Disease, AD）检测中细粒度分类能力不足以及跨语言泛化性差的问题。论文的关键解决方案包括：构建一个多语言（英语、西班牙语、中文和希腊语）数据集，涵盖语音和文本数据；进行更细致的分类，包括轻度认知障碍（Mild Cognitive Impairment, MCI）阶段，并评估不同分类器在稀疏和密集文本表示下的性能；在单语种和多语种设置下进行实验，发现某些语言从多语言训练中受益，而其他语言则独立表现更好。这些贡献显著提升了AD检测的细粒度分类能力和跨语言泛化性能。

链接: https://arxiv.org/abs/2502.19208
作者: Arezo Shakeri,Mina Farmanbar,Krisztian Balog
机构: University of Stavanger(斯塔万格大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 Figures

点击查看摘要

Abstract:Dementia is a progressive cognitive syndrome with Alzheimer’s disease (AD) as the leading cause. Conversation-based AD detection offers a cost-effective alternative to clinical methods, as language dysfunction is an early biomarker of AD. However, most prior research has framed AD detection as a binary classification problem, limiting the ability to identify Mild Cognitive Impairment (MCI)-a crucial stage for early intervention. Also, studies primarily rely on single-language datasets, mainly in English, restricting cross-language generalizability. To address this gap, we make three key contributions. First, we introduce a novel, multilingual dataset for AD detection by unifying 16 publicly available dementia-related conversational datasets. This corpus spans English, Spanish, Chinese, and Greek and incorporates both audio and text data derived from a variety of cognitive assessment tasks. Second, we perform finer-grained classification, including MCI, and evaluate various classifiers using sparse and dense text representations. Third, we conduct experiments in monolingual and multilingual settings, finding that some languages benefit from multilingual training while others perform better independently. This study highlights the challenges in multilingual AD detection and enables future research on both language-specific approaches and techniques aimed at improving model generalization and robustness.
zh

[NLP-23] FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

【速读】：该论文旨在解决现有语言模型去敏感化方法未能充分考虑知识的复杂性和关联性的问题。具体而言，这些方法可能未能有效移除应被删除的相关知识，或者无意中删除了看似无关但实际上存在于不同上下文中的知识。论文定义了一种新的现象——表层去学习（Superficial Unlearning），并提出了一种新的基准——FaithUn，用于分析和评估真实世界知识问答场景中的去学习忠实性。为了解决上述问题，论文的关键解决方案是提出了一种名为KLUE的新方法，该方法仅更新与知识相关的神经元以实现忠实的去学习。KLUE通过可解释性方法识别知识神经元，并利用选定的未遗忘样本更新这些神经元。实验结果表明，广泛使用的去学习方法无法确保忠实的去学习，而KLUE方法在真实世界问答去学习任务中表现出显著的有效性。

链接: https://arxiv.org/abs/2502.19207
作者: Nakyeong Yang,Minsung Kim,Seunghyun Yoon,Joongbo Shin,Kyomin Jung
机构: Seoul National University(首尔国立大学); Adobe Research(Adobe 研究); LG AI Research( LG AI 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
zh

[NLP-24] LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

【速读】：该论文旨在解决文档视觉问答（Document Visual Question Answering, document VQA）在越南语低资源语言环境下的挑战。解决方案的关键在于提出了ReceiptVQA数据集和LiGT架构。ReceiptVQA是一个包含9,000多张收据图像和60,000多个人工标注的问题-答案对的大型数据集。LiGT是一种布局感知的编解码器架构，通过利用语言模型的嵌入层来处理布局嵌入，减少了额外神经模块的使用。实验结果表明，该架构在与优秀基线模型的对比中取得了有竞争力的表现，并且发现生成式模型架构相比仅编码器模型具有明显优势，同时强调了结合多种模态对于处理该数据集的重要性。

链接: https://arxiv.org/abs/2502.19202
作者: Thanh-Phong Le,Trung Le Chi Phan,Nghia Hieu Nguyen,Kiet Van Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at IJDAR

点击查看摘要

Abstract:\textbfPurpose: Document Visual Question Answering (document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. \textbfMethods: In this paper, we present ReceiptVQA (\textbfReceipt \textbfVisual \textbfQuestion \textbfAnswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf9,000+ receipt images and \textbf60,000+ manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbfLayout-\textbfinfused \textbfGenerative \textbfTransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. \textbfResults: Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. \textbfConclusion: We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language. Comments: Accepted at IJDAR Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.19202 [cs.CL] (or arXiv:2502.19202v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.19202 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nghia Hieu Nguyen [view email] [v1] Wed, 26 Feb 2025 15:09:28 UTC (3,107 KB)
zh

[NLP-25] BIG-Bench Extra Hard

【速读】：该论文旨在解决大型语言模型（LLMs）在推理能力评估方面存在的局限性。尽管如BIG-Bench和其更难版本BIG-Bench Hard (BBH)等基准测试在一定程度上评估了LLMs的推理能力，但随着模型性能的提升，这些基准测试的任务变得过于简单，导致顶级模型在BBH上接近满分，从而削弱了这些基准的效用。为了解决这一问题，论文引入了BIG-Bench Extra Hard (BBEH)，这是一个新的基准测试，通过替换BBH中的每个任务为具有相似推理能力但显著增加难度的新任务，以进一步挑战LLMs的推理能力。关键在于BBEH的设计能够持续推动LLMs推理能力的边界，提供更具挑战性的评估标准。

链接: https://arxiv.org/abs/2502.19187
作者: Mehran Kazemi,Bahare Fatemi,Hritik Bansal,John Palowitch,Chrysovalantis Anastasiou,Sanket Vaibhav Mehta,Lalit K. Jain,Virginia Aglietti,Disha Jindal,Peter Chen,Nishanth Dikkala,Gladys Tyen,Xin Liu,Uri Shalit,Silvia Chiappa,Kate Olszewska,Yi Tay,Vinh Q. Tran,Quoc V. Le,Orhan Firat
机构: Google DeepMind; Google Research; UCLA; Google DeepMind; Google DeepMind; Google DeepMind; Google DeepMind; Google Research; Google DeepMind; Google DeepMind; Google DeepMind; Google Research; Google DeepMind; Google DeepMind; Google DeepMind; Google DeepMind; Google DeepMind; Google DeepMind; Google DeepMind; Google DeepMind
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8% for the best general-purpose model and 44.8% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: this https URL.
zh

[NLP-26] MEDDxAgent : A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis

【速读】：该论文旨在解决临床诊断过程中基于单次尝试和不完整患者信息进行疾病鉴别诊断（Differential Diagnosis, DDx）的局限性。解决方案的关键在于引入了一个名为MODULAR EXPLAINABLE DDX AGENT (MEDDxAgent) 的框架，该框架通过迭代学习而非依赖完整的患者信息来逐步优化诊断推理过程。MEDDxAgent整合了三个模块化组件：(1) DDX驱动器，(2) 病史采集模拟器，以及(3) 用于知识检索和诊断策略的两个专业化代理。此外，研究还提出了一套全面的DDx基准测试，涵盖呼吸系统、皮肤及罕见疾病的诊断，以确保评估的稳健性。

链接: https://arxiv.org/abs/2502.19175
作者: Daniel Rose,Chia-Chien Hung,Marco Lepri,Israa Alqassem,Kiril Gashteovski,Carolin Lawrence
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differential Diagnosis (DDx) is a fundamental yet complex aspect of clinical decision-making, in which physicians iteratively refine a ranked list of possible diseases based on symptoms, antecedents, and medical knowledge. While recent advances in large language models have shown promise in supporting DDx, existing approaches face key limitations, including single-dataset evaluations, isolated optimization of components, unrealistic assumptions about complete patient profiles, and single-attempt diagnosis. We introduce a Modular Explainable DDx Agent (MEDDxAgent) framework designed for interactive DDx, where diagnostic reasoning evolves through iterative learning, rather than assuming a complete patient profile is accessible. MEDDxAgent integrates three modular components: (1) an orchestrator (DDxDriver), (2) a history taking simulator, and (3) two specialized agents for knowledge retrieval and diagnosis strategy. To ensure robust evaluation, we introduce a comprehensive DDx benchmark covering respiratory, skin, and rare diseases. We analyze single-turn diagnostic approaches and demonstrate the importance of iterative refinement when patient profiles are not available at the outset. Our broad evaluation demonstrates that MEDDxAgent achieves over 10% accuracy improvements in interactive DDx across both large and small LLMs, while offering critical explainability into its diagnostic reasoning process.
zh

[NLP-27] stNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency

【速读】：该论文旨在解决在推理阶段通过利用额外计算资源提升大型语言模型性能的问题。关键在于提出了一种名为TestNUC的新方法，该方法通过考虑输入实例及其邻近未标记数据的局部一致性来改进测试阶段的预测结果。这种方法能够线性扩展，并且在多种数据集上表现出比标准提示和自我一致性等基线方法更为优越的性能。

链接: https://arxiv.org/abs/2502.19163
作者: Henry Peng Zou,Zhengyao Gu,Yue Zhou,Yankai Chen,Weizhi Zhang,Liancheng Fang,Yibo Wang,Yangning Li,Kay Liu,Philip S. Yu
机构: University of Illinois Chicago(芝加哥伊利诺伊大学); Cornell University(康奈尔大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model’s prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at this https URL.
zh

[NLP-28] Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中存在的数据偏差问题，这些偏差源于社会范畴和社会刻板印象在语言中的嵌入。论文的关键解决方案在于提出了一种新的方法，通过检测和量化句子中的语言指标来识别和衡量社会范畴和社会刻板印象。这种方法基于社会范畴与刻板印象交流（Social Category and Stereotype Communication, SCSC）框架，并利用大规模语言模型在上下文学习中的指令来自动应用这一方法。通过评估不同语言指标的重要性，论文进一步开发了一个评分函数来衡量刻板印象的语言指标。实验结果表明，该方法在检测和分类类别标签方面表现良好，但在评估相关行为和特征时仍存在挑战，使用更多的少量示例可以显著提高性能。

链接: https://arxiv.org/abs/2502.19160
作者: Rebekka Görge,Michael Mock,Héctor Allende-Cid
机构: Fraunhofer IAIS(弗劳恩霍夫协会人工智能研究所); Lamarr(拉玛尔); Fraunhofer IAIS(弗劳恩霍夫协会人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social categories and stereotypes are embedded in language and can introduce data bias into Large Language Models (LLMs). Despite safeguards, these biases often persist in model behavior, potentially leading to representational harm in outputs. While sociolinguistic research provides valuable insights into the formation of stereotypes, NLP approaches for stereotype detection rarely draw on this foundation and often lack objectivity, precision, and interpretability. To fill this gap, in this work we propose a new approach that detects and quantifies the linguistic indicators of stereotypes in a sentence. We derive linguistic indicators from the Social Category and Stereotype Communication (SCSC) framework which indicate strong social category formulation and stereotyping in language, and use them to build a categorization scheme. To automate this approach, we instruct different LLMs using in-context learning to apply the approach to a sentence, where the LLM examines the linguistic properties and provides a basis for a fine-grained assessment. Based on an empirical evaluation of the importance of different linguistic indicators, we learn a scoring function that measures the linguistic indicators of a stereotype. Our annotations of stereotyped sentences show that these indicators are present in these sentences and explain the strength of a stereotype. In terms of model performance, our results show that the models generally perform well in detecting and classifying linguistic indicators of category labels used to denote a category, but sometimes struggle to correctly evaluate the associated behaviors and characteristics. Using more few-shot examples within the prompts, significantly improves performance. Model performance increases with size, as Llama-3.3-70B-Instruct and GPT-4 achieve comparable results that surpass those of Mixtral-8x7B-Instruct, GPT-4-mini and Llama-3.1-8B-Instruct.
zh

[NLP-29] When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

【速读】：该论文旨在解决个性化偏好学习在强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）中的有效性评估缺乏标准化方法的问题，并特别关注多样化的用户偏好和少数群体观点。论文的关键解决方案在于提出一个多维度的评估框架，该框架不仅衡量性能，还评估公平性、非预期影响以及在不同偏好分歧水平下的适应性。通过广泛的实验，论文展示了当用户存在强烈分歧时，不同个性化方法之间的性能差异可能高达36%，并且个性化可能会导致高达20%的安全性错配。这些发现强调了采用全面评估方法的重要性，以推进更有效和包容性的偏好学习系统的开发。

链接: https://arxiv.org/abs/2502.19158
作者: Yijiang River Dong,Tiancheng Hu,Yinhong Liu,Ahmet Üstün,Nigel Collier
机构: University of Cambridge(Cambridge 大学); Cohere For AI(Cohere For AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
zh

[NLP-30] Isolating Language-Coding from Problem-Solving: Benchmarking LLM s with PseudoEval

【速读】：该论文旨在解决现有代码生成基准测试（如HumanEval和MBPP）在评估大型语言模型（LLMs）时无法明确区分其瓶颈在于问题解决能力还是编程语言能力的问题。为了解决这一问题，论文提出的关键方案是构建了一个名为PseudoEval的新基准测试，它提供伪代码作为输入，从而能够隔离并识别不同编程语言中的代码生成瓶颈。

链接: https://arxiv.org/abs/2502.19149
作者: Jiarong Wu,Songqiang Chen,Jialun Cao,Hau Ching Lo,Shing-Chi Cheung
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs’ end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation – whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: this https URL.
zh

[NLP-31] Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLM s ICLR2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中无法充分适应个性化、多变且多样化的用户偏好问题。论文的关键解决方案是引入了一个名为Amulet的无需训练的框架，该框架将每个解码令牌的过程表述为一个独立的在线学习问题，并通过简单的用户提供的提示进行引导，从而实现实时优化以满足用户的个性化偏好。为了减少每次解码优化过程带来的计算成本，Amulet还提供了优化过程每一步的闭式解，从而将计算时间成本降低到可忽略不计的水平。

链接: https://arxiv.org/abs/2502.19148
作者: Zhaowei Zhang,Fengshuo Bai,Qizhi Chen,Chengdong Ma,Mingzhi Wang,Haoran Sun,Zilong Zheng,Yaodong Yang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ICLR 2025, Project page: this https URL

点击查看摘要

Abstract:How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users’ personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.
zh

[NLP-32] Self-Memory Alignment: Mitigating Factual Hallucinations with Generalized Improvement

【速读】：该论文旨在解决大型语言模型（LLMs）在回应中与客观事实不一致的问题，即事实性幻觉（factual hallucinations），这一问题难以检测且容易误导缺乏相关知识的用户。为了解决这个问题，论文提出通过直接增强LLMs利用其现有记忆（即从预训练数据中获得的知识）的能力来精确获取信息。关键解决方案是引入自我记忆对齐（Self-Memory Alignment, SMA），通过对模型进行微调，使其能够针对自动生成的关于精确简单事实性问题的回答进行偏好优化。此外，构建了一个包含181k条中文数据、覆盖21个领域的全面且精确的事实性问答数据集FactualBench，以促进评估和训练。实验结果表明，SMA显著提升了LLMs在多个基准测试中的整体表现，特别是在事实性、实用性和综合性技能方面表现出一致的提升。

链接: https://arxiv.org/abs/2502.19127
作者: Siyuan Zhang,Yichi Zhang,Yinpeng Dong,Hang Su
机构: Tsinghua University(清华大学); Dept. of Comp. Sci. and Tech.(计算机科学与技术系); Institute for AI(人工智能研究所); THBI Lab(清华伯克利深圳学院); BNRist Center(清华信息科学与技术国家实验室); Tsinghua-Bosch Joint ML Center(清华大学-博世联合机器学习中心)
类目: Computation and Language (cs.CL)
备注: 29 pages, 17 figures

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. While post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in different capabilities. In this paper, we propose to address it by directly augmenting LLM’s fundamental ability to precisely leverage its existing memory–the knowledge acquired from pre-training data. We introduce self-memory alignment (SMA), which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments show that SMA significantly improves LLMs’ overall performance, with consistent enhancement across various benchmarks concerning factuality, as well as helpfulness and comprehensive skills.
zh

[NLP-33] Improving customer service with automatic topic detection in user emails

【速读】：该论文旨在解决电信服务公司Telekom Srbija在客户邮件处理方面的效率问题。解决方案的关键在于引入了一个基于BERTopic的自然语言处理（NLP）管道，该管道能够实现自动化的电子邮件主题检测与分类。通过一系列预处理和后处理步骤，系统能够对收到的每封邮件分配一个由12个主题之一及其附加标签组成的分类，从而使得客服团队可以通过定制的应用程序进行邮件的筛选和访问。该模型通过评估测试数据集中的100封客户邮件的自动分配主题的速度和准确性来衡量其性能。

链接: https://arxiv.org/abs/2502.19115
作者: Bojana Bašaragin,Darija Medvecki,Gorana Gojić,Milena Oparnica,Dragiša Mišković
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Paper submitted to the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9-12 March 2025

点击查看摘要

Abstract:This study introduces a novel Natural Language Processing pipeline that enhances customer service efficiency at Telekom Srbija, a leading Serbian telecommunications company, through automated email topic detection and labelling. Central to the pipeline is BERTopic, a modular architecture that allows unsupervised topic modelling. After a series of preprocessing and post-processing steps, we assign one of 12 topics and several additional labels to incoming emails, allowing customer service to filter and access them through a custom-made application. The model’s performance was evaluated by assessing the speed and correctness of the automatically assigned topics across a test dataset of 100 customer emails. The pipeline shows broad applicability across languages, particularly for those that are low-resourced and morphologically rich. The system now operates in the company’s production environment, streamlining customer service operations through automated email classification.
zh

[NLP-34] Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

【速读】：该论文旨在解决语言模型输出可靠性不足的问题，并提出了一种新的方法来调整模型响应以适应不确定性。论文的关键在于提出了一种统合拒绝策略（abstention）和语言校准（linguistic calibration）的统一框架——“一致性语言校准（Conformal Linguistic Calibration, CLC）”。这种方法通过将语言校准重新解释为答案集预测，实现了对模型响应不精确程度的控制，从而在保持事实准确性的同时提供可控的确定性感知自适应声明重写能力。

链接: https://arxiv.org/abs/2502.19110
作者: Zhengping Jiang,Anqi Liu,Benjamin Van Durme
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language model outputs are not always reliable; this prompts research into methods for adapting model responses based on uncertainty. Common approaches include: \emphabstention, where models refrain from generating responses when uncertain; and \emphlinguistic calibration, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unifying view of both approaches, Conformal Linguistic Calibration (CLC), reinterpreting linguistic calibration as answer set prediction. We begin by presenting a unified framework that connects abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation that allows for controlling the level of imprecision in model responses. Experimental results show that our method produces calibrated outputs with conformal guarantees on factual accuracy. Furthermore, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.
zh

[NLP-35] Evaluating Gender Bias in German Machine Translation ISCA

【速读】：该论文旨在解决德语机器翻译（MT）系统中的性别偏见问题，特别是职业刻板印象和代表性不足的问题。解决方案的关键在于开发了一个新的评估数据集WinoMTDE，该数据集包含288个平衡性别和刻板印象的德语句子，并使用德国劳动力统计数据进行标注。通过大规模评估五个广泛使用的MT系统和一个大型语言模型，研究揭示了大多数模型中持续存在的偏见，而大型语言模型（LLM）的表现优于传统系统。

链接: https://arxiv.org/abs/2502.19104
作者: Michelle Kappl
机构: Technische Universität Berlin(柏林工业大学)
类目: Computation and Language (cs.CL)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1 [cs.CL], we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under this https URL.
zh

[NLP-36] LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

【速读】：该论文旨在解决大型语言模型（LLMs）在长文本生成任务中的性能退化问题。通过提出LongEval基准测试，论文量化分析了模型在长文本生成过程中随着文本长度增加而表现出的性能下降现象，并提供了基于认知和语言写作模型的直接与计划生成范式的评估方法。关键解决方案在于开发了一种新的基准测试工具，以更全面地评估模型在长文本生成任务中的表现，从而为进一步的模型改进提供指导。

链接: https://arxiv.org/abs/2502.19103
作者: Siwei Wu,Yizhi Li,Xingwei Qu,Rishi Ravikumar,Yucheng Li,Tyler Loakman Shanghaoran Quan Xiaoyong Wei,Riza Batista-Navarro,Chenghua Lin
机构: University of Manchester (曼彻斯特大学); University of Surrey (萨里大学); University of Sheffield (谢菲尔德大学); Peking University (北京大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in this https URL.
zh

[NLP-37] Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLM s

【速读】：该论文旨在解决密集大型语言模型（LLMs）在处理输入复杂度时存在的效率瓶颈问题。现有稀疏化方法（如静态剪枝或动态激活）虽然部分缓解了这一问题，但要么缺乏适应上下文或模型结构需求的能力，要么带来了不可忽视的计算开销。为了解决这个问题，论文提出了一种名为CLADA的认知负荷感知动态激活框架，该框架将统计稀疏性和语义适应性相结合。其关键是发现了LLMs激活的两种互补模式：一是由序列级前缀信息驱动的全局统计稀疏性，二是由认知负荷指标（如意外性和熵）调节的局部语义适应性。通过分层阈值策略，CLADA实现了平均20%的速度提升，同时仅降低了2%的准确性，优于其他方法，并且无需重新训练或架构更改。

链接: https://arxiv.org/abs/2502.19078
作者: Yiheng Yang,Yujie Wang,Chi Ma,Lei Yu,Emmanuele Chersoni,Chu-Ren Huang
机构: The Hong Kong Polytechnic University; Meituan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by human brain’s dual-process mechanisms - predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex context - we propose CLADA, a \textit\textbfCognitive-\textbfLoad-\textbfAware \textbfDynamic \textbfActivation framework that synergizes statistical sparsity with semantic adaptability. Our key insight is that LLM activations exhibit two complementary patterns: 1) \textitGlobal statistical sparsity driven by sequence-level prefix information, and 2) \textitLocal semantic adaptability modulated by cognitive load metrics(e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline from offline error-controlled optimization ensures 40%+ sparsity, dynamically adjusted by real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves \textbf~20% average speedup with 2% accuracy drop, outperforming Griffin (5%+ degradation) and TT (negligible speedup). Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis ( R^2=0.17 for sparsity-adaptation synergy). Requiring no retraining or architectural changes, CLADA offers a deployable solution for resource-aware LLM inference while advancing biologically-inspired AI design. Our code is available at \hrefthis https URLCLADA.
zh

[NLP-38] Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

【速读】：该论文旨在解决平行数据整理（PDC）过程中因选择不同的预训练多语言语言模型（multiPLMs）而引起的排序质量差异问题。论文的关键解决方案在于通过采用一系列启发式方法来减少由不同multiPLMs导致的偏差，从而过滤掉噪声句子，提高基于网络挖掘语料训练的神经机器翻译（NMT）系统的性能，并减小multiPLMs之间的性能差距。

链接: https://arxiv.org/abs/2502.19074
作者: Aloka Fernando,Surangika Ranathunga,Nisansa de Silva
机构: Dept. of Computer Science and Engineering, University of Moratuwa(University of Moratuwa计算机科学与工程系), Sri Lanka; Massey University(梅西大学), Palmerston North, New Zealand
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En \rightarrow Si, En \rightarrow Ta and Si \rightarrow Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
zh

[NLP-39] IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages

【速读】：该论文旨在解决当前代码生成评估基准主要以英语为中心的问题，限制了其在全球开发者社区中的适用性。解决方案的关键在于提出了IndicEval-XL，这是一个涵盖6种主要印度语言（Indic languages）的全面基准，这些语言合计由全球约14%的人口使用。通过将这六种语言与12种编程语言相连接，IndicEval-XL构建了一个强大的评估框架，从而显著提升了代码生成系统的语言多样性。

链接: https://arxiv.org/abs/2502.19067
作者: Ujjwal Singh,Aditi Sharma,Nikhil Gupta,Deepakshi,Vivek Kumar Jha
机构: Deutsche Telekom Digital Labs(德电数字实验室)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation from natural language prompts, revolutionizing software development workflows. As we advance towards agent-based development paradigms, these models form the cornerstone of next-generation software development lifecycles. However, current benchmarks for evaluating multilingual code generation capabilities are predominantly English-centric, limiting their applicability across the global developer community. To address this limitation, we present IndicEval-XL, a comprehensive benchmark for code generation that incorporates 6 major Indic languages, collectively spoken by approximately 14% of the world’s population. Our benchmark bridges these languages with 12 programming languages, creating a robust evaluation framework. This work is particularly significant given India’s representation of one-eighth of the global population and the crucial role Indic languages play in Indian society. IndicEval-XL represents a significant step toward expanding the linguistic diversity in code generation systems and evaluation frameworks. By developing resources that support multiple languages, we aim to make AI-powered development tools more inclusive and accessible to developers of various linguistic backgrounds. To facilitate further research and development in this direction, we make our dataset and evaluation benchmark publicly available at this https URL
zh

[NLP-40] Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique

【速读】：该论文旨在探索利用先进的大语言模型（Large Language Models, LLMs）评估诗歌的可行性，并通过一种受共识评估技术（Consensual Assessment Technique, CAT）启发的方法进行验证。论文的关键解决方案在于使用Claude-3-Opus和GPT-4o两个LLMs对90首诗进行评估，结果表明这些模型在根据发表场所匹配真实情况方面超越了非专家人类评委的表现，尤其是在评估较小的诗歌子集时。研究显示，LLMs作为准确评估诗歌的工具具有潜力，从而可能将其应用扩展到其他创意领域。

链接: https://arxiv.org/abs/2502.19064
作者: Piotr Sawicki,Marek Grześ,Dan Brown,Fabrício Góes
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Consensual Assessment Technique (CAT) evaluates creativity through holistic expert judgments. We investigate the use of two advanced Large Language Models (LLMs), Claude-3-Opus and GPT-4o, to evaluate poetry by a methodology inspired by the CAT. Using a dataset of 90 poems, we found that these LLMs can surpass the results achieved by non-expert human judges at matching a ground truth based on publication venue, particularly when assessing smaller subsets of poems. Claude-3-Opus exhibited slightly superior performance than GPT-4o. We show that LLMs are viable tools for accurately assessing poetry, paving the way for their broader application into other creative domains.
zh

[NLP-41] MathClean: A Benchmark for Synthetic Mathematical Data Cleaning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中因合成数学问题及答案引入的不准确性而导致的数据质量问题。解决方案的关键在于提出MathClean基准，用于评估数学数据清洗模型的有效性。MathClean基准包含来自GSM8K和MATH数据集扩充的共计4,000道正确与错误的问题及其对应的答案，并且对每道题目或答案标注了错误类型，以便未来改进。实验结果表明，即使是当前最先进的模型如GPT-o1和DeepSeek-R1，在该基准上的表现也较差，从而突显出MathClean的重要性。

链接: https://arxiv.org/abs/2502.19058
作者: Hao Liang,Meiyi Qiang,Yuying Li,Zefeng He,Yongzhen Guo,Zhengzhou Zhu,Wentao Zhang,Bin Cui
机构: Peking University(北京大学); Beijing Institute of Technology(北京理工大学); Nanjing University(南京大学); Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), the quality of training data has become crucial. Among the various types of training data, mathematical data plays a key role in enabling LLMs to acquire strong reasoning abilities. While high-quality open-source data is important, it is often insufficient for pre-training, necessitating the addition of synthetic math problems. However, synthetic math questions and answers can introduce inaccuracies, which may degrade both the training data and web data. Therefore, an effective method for cleaning synthetic math data is essential. In this paper, we propose the MathClean benchmark to evaluate the effectiveness of math data cleaning models. The MathClean benchmark consists of 2,000 correct questions and 2,000 erroneous questions with additional 2,000 correct and erroneous answers sourced from augmented data based on GSM8K and MATH. Moreover, we also annotate error types for each question or answer, since it can assess whether models can correctly identify the error categories for future improvements. Finally, we present comprehensive evaluations using state-of-the-art (SOTA) models. Our results demonstrate that even strong models like GPT-o1 and DeepSeek-R1 perform poorly on this benchmark, highlighting the utility of MathClean. Our code and data is available at this https URL.
zh

[NLP-42] Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments ICRA2025

【速读】：该论文旨在解决视觉语言导航（Vision-and-Language Navigation, VLN）在不同高度视角下的人机指令匹配问题，特别是在四足机器人低视角场景中的泛化挑战。论文的关键解决方案是提出了一种地面视角导航（Ground-level Viewpoint Navigation, GVNav）方法，通过利用加权历史观测来增强时空上下文，从而有效管理特征碰撞，并通过分配适当权重处理不同视角下的相同特征。此外，论文还引入了HM3D和Gibson数据集的连接图作为额外资源，以增强空间先验和现实世界场景的全面表示，进而提升真实环境中路标预测器的性能和泛化能力。

链接: https://arxiv.org/abs/2502.19024
作者: Zerui Li,Gengze Zhou,Haodong Hong,Yanyan Shao,Wenqi Lyu,Yanyuan Qiao,Qi Wu
机构: Australian Institute for Machine Learning, The University of Adelaide (澳大利亚阿德莱德大学机器学习研究所); The University of Queensland (昆士兰大学); Zhejiang University of Technology (浙江工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.
zh

[NLP-43] Binary Neural Networks for Large Language Model: A Survey

【速读】：该论文旨在解决大型语言模型（LLMs）在应用过程中因参数规模指数增长带来的资源开销问题。论文的关键解决方案是引入一种从训练之初即采用低精度二值权重的量化方法，这与传统的后训练量化（PTQ）和量化感知训练（QAT）不同。这种方法通过在整个训练过程中使用二值权重来减少内存使用和计算需求，从而实现更高效的模型量化。

链接: https://arxiv.org/abs/2502.19008
作者: Liangdong Liu,Zhitong Zheng,Cong Wang,Tianhuang Su,Zhenyu Yang
机构: OPPO AI Center; School of Electrical and Automation Engineering, Nanjing Normal University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) have wide applications in the field of natural language processing(NLP), such as GPT-4 and Llama. However, with the exponential growth of model parameter sizes, LLMs bring significant resource overheads. Low-bit quantization, as a key technique, reduces memory usage and computational demands by decreasing the bit-width of model parameters, activations, and gradients. Previous quantization methods for LLMs have largely employed Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ does not require any retraining of the original model, while QAT involves optimizing precision during training to achieve the best quantization parameters. The BitNet team proposed a radically different approach, where quantization is performed from the start of model training, utilizing low-precision binary weights during the training process. This approach has led to the emergence of many binary quantization techniques for large language models. This paper provides a comprehensive review of these binary quantization techniques. Specifically, we will introduce binary quantization techniques in deep neural networks and further explore their application to LLMs, reviewing their various contributions, implementations, and applications.
zh

[NLP-44] MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering

【速读】：该论文旨在解决多实体问答（MEQA）任务中大型语言模型（LLM）和检索增强生成（RAG）系统在跨文档整合信息方面的挑战。现有的方法虽在单文档理解方面表现出色，但在处理需要整合来自异构源的实体中心见解的问题时，如“ACM院士在各研究领域的分布情况”，往往难以应对。论文的关键解决方案是引入了一个名为MEBench的新基准，该基准包含4,780个问题，系统性地评估LLMs在检索、整合及推理碎片化信息方面的能力。实验结果显示，即使是最先进的模型在MEBench上的准确率也仅达到59%，这突显了现有模型在信息完整性和事实准确性方面的局限性。

链接: https://arxiv.org/abs/2502.18993
作者: Teng Lin
机构: HKUST(GZ)(香港科技大学(广州))
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like “What is the distribution of ACM Fellows among various fields of study?”, which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs’ capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
zh

[NLP-45] GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation

【速读】：该论文旨在解决大型语言模型（LLMs）在利用外部工具时的泛化能力不足问题，特别是针对零样本（Zero-to-One Generalization）和弱到强泛化（Weak-to-Strong Generalization）场景。论文的关键解决方案是提出了一种名为GenTool的新训练框架，通过合成训练数据模拟这两种泛化维度，并采用两阶段微调方法：首先优化工具排名，然后改进工具选择。实验结果表明，这种方法显著提升了参数规模从10亿到80亿的LLMs的工具使用能力，性能超越了GPT-4o。

链接: https://arxiv.org/abs/2502.18990
作者: Jie He,Jennifer Neville,Mengting Wan,Longqi Yang,Hui Liu,Xiaofeng Xu,Xia Song,Jeff Z. Pan,Pei Zhou
机构: Microsoft Corporation (微软公司); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
zh

[NLP-46] PEToolLLM : Towards Personalized Tool Learning in Large Language Models

【速读】：该论文旨在解决现有工具学习研究主要关注通用工具使用能力而忽视个性化工具使用能力的问题，从而无法有效处理隐含的用户偏好。为了解决这一局限，论文提出了个性化工具学习任务的定义，并构建了PEToolBench基准数据集来反映多样化的用户偏好。关键解决方案在于提出的PEToolLLaMA框架，通过监督微调和直接偏好优化来适应个性化工具学习任务。

链接: https://arxiv.org/abs/2502.18980
作者: Qiancheng Xu,Yongqi Li,Heming Xia,Fan Liu,Min Yang,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool learning has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools. Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit user requirements in instructions. However, they overlook the importance of personalized tool-use capability, leading to an inability to handle implicit user preferences. To address the limitation, we first formulate the task of personalized tool learning, which integrates user’s interaction history towards personalized tool usage. To fill the gap of missing benchmarks, we construct PEToolBench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios. Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised fine-tuning and direct preference optimization. Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs.
zh

[NLP-47] Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

【速读】：该论文旨在解决大型语言模型指令微调过程中因训练数据质量和效率限制而导致的效果瓶颈问题。解决方案的关键在于引入了低置信度黄金样本（Low-Confidence Gold, LCG）过滤框架，通过基于质心的聚类和置信度引导选择来识别有价值的指令对。LCG采用半监督方法，利用在代表性样本上训练的轻量级分类器，从而精炼出高质量的数据子集，同时保持数据多样性。实验评估表明，使用LCG过滤的6000个样本子集进行微调的模型，在MT-bench上表现出显著改进，并且在全面评估指标上持续提升。这一框架的有效性证明了其在高效指令微调方面的潜力。

链接: https://arxiv.org/abs/2502.18978
作者: Hongyi Cal,ie Li,Wenzhen Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework’s efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
zh

[NLP-48] (Mis)Fitting: A Survey of Scaling Laws ICLR

【速读】：该论文旨在解决现代基础模型在利用缩放律指导关键训练决策过程中存在的不一致性问题。论文指出，从较小的训练运行中推断最优架构和超参数设置时，不同研究间的结论存在差异，这归因于拟合过程中的多个因素，如所使用的具体方程、训练设置和优化方法等。论文的关键在于提出一个检查清单，以帮助作者在进行缩放律研究时考虑所有必要的细节，从而提高研究结果的可复现性和一致性。

链接: https://arxiv.org/abs/2502.18969
作者: Margaret Li,Sneha Kudugunta,Luke Zettlemoyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
备注: 41 pages, 3 figure, first two authors contributed equally. ICLR, 2025

点击查看摘要

Abstract:Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of these papers quantify these trends using a power law, most under-report crucial details needed to reproduce their findings. To mitigate this, we we propose a checklist for authors to consider while contributing to scaling law research.
zh

[NLP-49] Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles

【速读】：该论文旨在解决现有用户模拟器仅依赖文本语句而忽略了隐含用户特征（如个性、说话风格和目标）的问题，以及基于人格的方法缺乏通用性的问题。论文的关键解决方案是提出了一种名为USP（User Simulator with implicit Profiles）的框架，通过从人机对话中推断隐含用户档案，并利用这些档案生成更个性化和真实的对话。该框架首先开发了一个基于大型语言模型（LLM）的提取器，并采用全面的档案模式；随后通过条件监督微调和循环一致性强化学习进行仿真优化；最后使用多样化的档案采样器来捕捉现实世界用户档案的分布。

链接: https://arxiv.org/abs/2502.18968
作者: Kuang Wang,Xianfei Li,Shenghao Yang,Li Zhou,Feng Jiang,Haizhou Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学，深圳); Shenzhen Research Institute of Big Data (深圳大数据研究院); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, existing simulators often rely solely on text utterances, missing implicit user traits such as personality, speaking style, and goals. In contrast, persona-based methods lack generalizability, as they depend on predefined profiles of famous individuals or archetypes. To address these challenges, we propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations and uses them to generate more personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema. Then, we refine the simulation through conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing it at both the utterance and conversation levels. Finally, we adopt a diverse profile sampler to capture the distribution of real-world user profiles. Experimental results demonstrate that USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency. Furthermore, dynamic multi-turn evaluations based on USP strongly align with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
zh

[NLP-50] owards Label-Only Membership Inference Attack against Pre-trained Large Language Models USENIX-SECURITY2025

【速读】：该论文旨在解决在仅能访问模型输出标签（即生成的文本）的情况下，如何有效进行成员推断攻击 (Membership Inference Attacks, MIA) 的问题。现有方法在此设置下效果有限，主要由于大型语言模型 (Large Language Models, LLMs) 在广泛预训练语料库上的训练导致其对成员与非成员样本的细微差异表现出了更好的泛化能力，并且基于标签的扰动过于粗糙而无法捕捉这些差异。论文的关键解决方案是提出了一种名为 PETAL 的新方法，它通过利用每个token级别的语义相似性来近似输出概率，并计算困惑度 (perplexity)，从而推断出成员身份，认为成员样本因更好的记忆而具有更小的困惑度。实验结果表明，PETAL 方法在多个基准测试中优于现有的基于标签的方法，并且在某些指标上甚至可以媲美基于完整输出logits的先进攻击方法。

链接: https://arxiv.org/abs/2502.18943
作者: Yu He,Boheng Li,Liu Liu,Zhongjie Ba,Wei Dong,Yiming Li,Zhan Qin,Kui Ren,Chun Chen
机构: The State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted by USENIX Security 2025

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) aim to predict whether a data sample belongs to the model’s training set or not. Although prior research has extensively explored MIAs in Large Language Models (LLMs), they typically require accessing to complete output logits (\ie, \textitlogits-based attacks), which are usually not available in practice. In this paper, we study the vulnerability of pre-trained LLMs to MIAs in the \textitlabel-only setting, where the adversary can only access generated tokens (text). We first reveal that existing label-only MIAs have minor effects in attacking pre-trained LLMs, although they are highly effective in inferring fine-tuning datasets used for personalized LLMs. We find that their failure stems from two main reasons, including better generalization and overly coarse perturbation. Specifically, due to the extensive pre-training corpora and exposing each sample only a few times, LLMs exhibit minimal robustness differences between members and non-members. This makes token-level perturbations too coarse to capture such differences. To alleviate these problems, we propose \textbfPETAL: a label-only membership inference attack based on \textbfPEr-\textbfToken sem\textbfAntic simi\textbfLarity. Specifically, PETAL leverages token-level semantic similarity to approximate output probabilities and subsequently calculate the perplexity. It finally exposes membership based on the common assumption that members are `better’ memorized and have smaller perplexity. We conduct extensive experiments on the WikiMIA benchmark and the more challenging MIMIR benchmark. Empirically, our PETAL performs better than the extensions of existing label-only attacks against personalized LLMs and even on par with other advanced logit-based attacks across all metrics on five prevalent open-source LLMs. Comments: Accepted by USENIX Security 2025 Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2502.18943 [cs.CR] (or arXiv:2502.18943v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2502.18943 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-51] MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors

【速读】：该论文旨在解决AI辅导模型教学能力评估缺乏可靠、易用且简便的基准工具的问题。关键解决方案在于提出了MathTutorBench，一个涵盖广泛教学能力的开源基准测试，包含数据集和度量标准，并通过训练奖励模型来评分开放式教师回应的质量，从而实现对辅导模型教学能力的全面评估。

链接: https://arxiv.org/abs/2502.18940
作者: Jakub Macina,Nico Daheim,Ido Hakimi,Manu Kapur,Iryna Gurevych,Mrinmaya Sachan
机构: Department of Computer Science, ETH Zurich(苏黎世联邦理工学院); ETH AI Center(苏黎世联邦理工学院AI中心); Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), TU Darmstadt(达姆施塔特工业大学; 普适知识处理实验室); Professorship for Learning Sciences and Higher Education, ETH Zurich(苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.
zh

[NLP-52] JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models PAKDD2025

【速读】：该论文旨在解决大型语言模型（LLMs）在安全性评估方面的不足，特别是针对中文表达的独特性和复杂性。现有的中文安全基准通常无法有效地揭示LLMs的安全漏洞。为了解决这一问题，论文提出了一套名为JailBench的综合中文基准，其关键在于引入了一个新颖的自动越狱提示工程师（AJPE）框架，用于构建基准测试集。AJPE框架结合越狱技术以增强评估效果，并利用LLMs通过上下文学习自动扩展数据集，从而更有效地识别LLMs中的潜在漏洞。

链接: https://arxiv.org/abs/2502.18935
作者: Shuyi Liu,Simiao Cui,Haoran Bu,Yuming Shang,Xi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, accepted at PAKDD 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at this https URL.
zh

[NLP-53] Kanana: Compute-efficient Bilingual Language Models

【速读】：该论文旨在开发Kanana系列双语语言模型，以在韩语任务中实现卓越性能，在英语任务中达到竞争性水平。为了解决计算成本高的问题，论文提出的关键方案包括高质量数据过滤、分阶段预训练、深度扩展以及剪枝和蒸馏技术。这些方法共同实现了计算效率高且性能优异的语言模型。

链接: https://arxiv.org/abs/2502.18934
作者: Kanana LLM Team:Yunju Bak,Hojin Lee,Minho Ryu,Jiyeon Ham,Seungjae Jung,Daniel Wontae Nam,Taegyeong Eo,Donghun Lee,Doohae Jung,Boseop Kim,Nayeon Kim,Jaesun Park,Hyunho Kim,Hyunwoong Ko,Changmin Lee,Kyoung-Woon On,Seulye Baeg,Junrae Cho,Sunghee Jung,Jieun Kang,EungGyun Kim,Eunhwa Kim,Byeongil Ko,Daniel Lee,Minchul Lee,Miok Lee,Shinbok Lee,Gaeun Seo
机构: Kakao Corp. (卡卡奥公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 40 pages, 15 figures

点击查看摘要

Abstract:We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.
zh

[NLP-54] END: Early Noise Dropping for Efficient and Effective Context Denoising

【速读】：该论文旨在解决大型语言模型（LLMs）在处理输入序列时因无关或噪声信息导致输出质量下降的问题。这一问题在长、短上下文场景中均存在，如检索增强生成、表格问答及上下文学习等任务。论文的关键解决方案是Early Noise Dropping (\textscEND)，它通过将输入序列分割成片段，并在LLMs的早期层使用线性探测器来区分信息片段与噪声片段，从而在处理过程中提前丢弃噪声片段，保留关键信息，减少干扰并降低计算开销。

链接: https://arxiv.org/abs/2502.18915
作者: Hongye Jin,Pei Chen,Jingfeng Yang,Zhengyang Wang,Meng Jiang,Yifan Gao,Binxuan Huang,Xinyang Zhang,Zheng Li,Tianyi Liu,Huasheng Li,Bing Yin
机构: Amazon(亚马逊); Texas A&M Univeristy(德克萨斯A&M大学); University of Notre Dame(圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textscEND segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textscEND preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textscEND significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs’ implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.
zh

[NLP-55] CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

【速读】：该论文旨在解决自动语音识别（Automatic Speech Recognition, ASR）系统在处理汉英代码转换（Code-switching, CS）对话中的挑战。现有汉英代码转换数据集通常存在规模小、缺乏自发性以及缺少完整对话录音及其转录的问题，这阻碍了实际会话场景下稳健ASR模型的发展。论文的关键解决方案是引入了一个名为CS-Dialogue的新大规模汉英代码转换语音数据集，包含来自200名发言者的104小时自发对话，并提供完整的对话录音及转录，以捕捉连续语音中的自然代码转换模式。

链接: https://arxiv.org/abs/2502.18913
作者: Jiaming Zhou,Yujie Guo,Shiwan Zhao,Haoqin Sun,Hui Wang,Jiabei He,Aobo Kong,Shiyao Wang,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin
机构: College of Computer Science, Nankai University (南开大学计算机学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and establish benchmark ASR performance using state-of-the-art models. Our experiments, using Transformer, Conformer, and Branchformer, demonstrate the challenges of code-switching ASR, and show that existing pre-trained models such as Whisper still have the space to improve. The CS-Dialogue dataset will be made freely available for all academic purposes.
zh

[NLP-56] From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

【速读】：该论文旨在解决使用大型语言模型（Large Language Models, LLMs）生成超长序列（高达100K tokens）时效率低下的问题。主要挑战包括频繁的模型重载、动态键值管理以及重复生成。论文的关键解决方案是引入TOKENSWIFT框架，该框架通过有效应对上述挑战，显著加速超长序列的生成过程，同时保持目标模型的质量。实验结果表明，TOKENSWIFT在不同规模（1.5B, 7B, 8B, 14B参数）和架构（多头注意力机制MHA, 分组查询注意力机制GQA）的模型上实现了超过3倍的速度提升。

链接: https://arxiv.org/abs/2502.18890
作者: Tong Wu,Junzhe Shen,Zixia Jia,Yuxuan Wang,Zilong Zheng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this https URL.
zh

[NLP-57] Clip-TTS: Contrastive Text-content and Mel-spectrogram A High-Huality Text-to-Speech Method based on Contextual Semantic Understanding

【速读】：该论文旨在解决传统文本到语音（Text-to-Speech, TTS）方法在音素编码阶段缺乏真实梅尔频谱图辅助信息的问题，这导致编码过程缺乏真正的语义理解。同时，传统TTS系统难以平衡模型的推理速度与合成语音的质量。为解决这些问题，论文提出了一种基于Clip架构的TTS方法（Clip-TTS）。该方法在文本编码阶段通过Clip框架连接文本内容与真实的梅尔频谱图，使文本编码器能够直接学习全局上下文的真实语义，从而确保合成语音的质量。关键在于采用Transformer的基本结构，以实现快速的推理速度。实验结果显示，Clip-TTS在LJSpeech和Baker数据集上达到了最先进的MOS评分，并且在多情感样本上表现优异。

链接: https://arxiv.org/abs/2502.18889
作者: Tianyun Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion this http URL samples are available at: this https URL.
zh

[NLP-58] On Pruning State-Space LLM s

【速读】：该论文旨在探究状态空间模型（State-Space Models, SSMs）在经过剪枝处理后是否能进一步降低其计算成本。关键在于适应并应用多种剪枝方法于四个基于SSM的语言模型（LLMs）上，从而评估这些模型对不同剪枝技术的鲁棒性。研究发现，某些剪枝方法如WANDA，可以使模型保持较好的性能，而其他方法则会导致性能快速下降。

链接: https://arxiv.org/abs/2502.18886
作者: Tamer Ghattas,Michael Hassid,Roy Schwartz
机构: The Hebrew University of Jerusalem (希伯来大学耶路撒冷)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g. WANDA), while using other methods lead to fast performance degradation.
zh

[NLP-59] Learning to Generate Structured Output with Schema Reinforcement Learning

【速读】：该论文旨在解决大型语言模型（LLMs）在根据给定模式生成有效JSON输出方面的结构化生成能力不足的问题。论文的关键解决方案在于提出SchemaBench，包含约40,000种不同的JSON模式，用于评估和提升模型生成有效JSON的能力。此外，通过结合强化学习与细粒度模式验证器（Fine-grained Schema Validator），进一步增强了模型对JSON模式的理解，从而显著提升了模型生成JSON输出及下游任务的表现。

链接: https://arxiv.org/abs/2502.18878
作者: Yaxi Lu,Haolun Li,Xin Cong,Zhong Zhang,Yesai Wu,Yankai Lin,Zhiyuan Liu,Fangming Liu,Maosong Sun
机构: Department of Computer Science and Technology, Tsinghua University (清华大学); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models’ abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models’ understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.
zh

[NLP-60] Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

【速读】：该论文旨在解决现有方法在自动化评估大型语言模型（Large Language Models, LLMs）时存在的局限性，这些方法主要依赖于预定义的一般标准进行文本分析，导致其在处理未见过的指令时适应性较差，并且在评估量化和结构约束方面的稳定性不足。为了解决这些问题，论文提出了一种名为ARJudge的新颖评估框架，关键在于其自适应地制定评估标准，并结合文本分析和基于代码的分析来评估LLM响应。ARJudge包含两个组件：一个经过微调的分析器（Analyzer），用于生成多维度的评估分析；以及一个无需调整参数的精炼器（Refiner），用于整合和优化所有分析以做出最终判断。通过构建一个综合分析语料库（Composite Analysis Corpus），该框架能够训练分析器以生成评价标准、文本分析和基于代码的分析。研究结果表明，ARJudge在有效性和鲁棒性方面优于现有的微调评估器，并强调了多维度评估和基于代码的分析对于提升评估能力的重要性。

链接: https://arxiv.org/abs/2502.18874
作者: Kaishuai Xu,Tiezheng Yu,Wenjun Hou,Yi Cheng,Liangyou Li,Xin Jiang,Lifeng Shang,Qun Liu,Wenjie Li
机构: The Hong Kong Polytechnic University; Huawei Noah’s Ark Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
zh

[NLP-61] Multi-LLM Collaborative Search for Complex Problem Solving

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂推理任务中的局限性，这些局限性主要源于其处理广泛的推理空间和自然语言固有的模糊性的困难。论文的关键解决方案是提出了Mixture-of-Search-Agents (MoSA)范式，通过结合多个LLMs的集体专业知识来增强基于搜索的推理。MoSA通过将独立探索与LLMs之间的迭代精炼相结合，整合多样的推理路径，从而克服单一模型方法的局限性。使用蒙特卡洛树搜索（MCTS）作为基础，MoSA使多个代理能够提出和聚合推理步骤，从而提高准确性。

链接: https://arxiv.org/abs/2502.18873
作者: Sen Yang,Yafu Li,Wai Lam,Yu Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle with complex reasoning tasks due to their limitations in addressing the vast reasoning space and inherent ambiguities of natural language. We propose the Mixture-of-Search-Agents (MoSA) paradigm, a novel approach leveraging the collective expertise of multiple LLMs to enhance search-based reasoning. MoSA integrates diverse reasoning pathways by combining independent exploration with iterative refinement among LLMs, mitigating the limitations of single-model approaches. Using Monte Carlo Tree Search (MCTS) as a backbone, MoSA enables multiple agents to propose and aggregate reasoning steps, resulting in improved accuracy. Our comprehensive evaluation across four reasoning benchmarks demonstrates MoSA’s consistent performance improvements over single-agent and other multi-agent baselines, particularly in complex mathematical and commonsense reasoning tasks.
zh

[NLP-62] owards an AI co-scientist

【速读】：该论文旨在通过引入AI共科学家系统，增强科学假设生成与验证的过程。论文的关键解决方案在于设计了一个基于Gemini 2.0的多智能体系统，采用生成、辩论和进化的方法来产生新的研究假设。系统的关键创新包括：(1) 多智能体架构与异步任务执行框架，以实现灵活的计算扩展；(2) 比赛式的进化过程，用于自我改进的假设生成。这些方法显著提升了假设的质量，并在药物重定位、新靶点发现以及细菌进化和抗菌素耐药性机制解释等三个生物医学领域展示了其潜力。

链接: https://arxiv.org/abs/2502.18864
作者: Juraj Gottweis,Wei-Hung Weng,Alexander Daryin,Tao Tu,Anil Palepu,Petar Sirkovic,Artiom Myaskovsky,Felix Weissenberger,Keran Rong,Ryutaro Tanno,Khaled Saab,Dan Popovici,Jacob Blum,Fan Zhang,Katherine Chou,Avinatan Hassidim,Burak Gokturk,Amin Vahdat,Pushmeet Kohli,Yossi Matias,Andrew Carroll,Kavita Kulkarni,Nenad Tomasev,Yuan Guan,Vikram Dhillon,Eeshit Dhaval Vaishnav,Byron Lee,Tiago R D Costa,José R Penadés,Gary Peltz,Yunhan Xu,Annalisa Pawlosky,Alan Karthikesalingam,Vivek Natarajan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Other Quantitative Biology (q-bio.OT)
备注: 81 pages in total (main 38 pages, appendix 43 pages), 13 main figures, 40 appendix figures, 1 main table, 2 appendix tables, 143 main references, 7 appendix references

点击查看摘要

Abstract:Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system’s design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.
zh

[NLP-63] Exploring Rewriting Approaches for Different Conversational Tasks

【速读】：该论文旨在解决如何通过重写或融合用户提问来提高对话助理在不同生成任务中的表现。关键在于确定最适合特定应用场景的重写（rewriting）或融合（fusion）方法。对于基于文本的问答助手，查询重写方法更为有效；而对于基于数据生成可视化和表格的数据分析助手，查询融合方法则更佳。

链接: https://arxiv.org/abs/2502.18860
作者: Md Mehrab Tanjim,Ryan A. Rossi,Mike Rimer,Xiang Chen,Sungchul Kim,Vaishnavi Muppala,Tong Yu,Zhengmian Hu,Ritwik Sinha,Wei Zhang,Iftikhar Ahamath Burhanuddin,Franck Dernoncourt
机构: Adobe Research (Adobe 研究)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Conversational assistants often require a question rewriting algorithm that leverages a subset of past interactions to provide a more meaningful (accurate) answer to the user’s question or request. However, the exact rewriting approach may often depend on the use case and application-specific tasks supported by the conversational assistant, among other constraints. In this paper, we systematically investigate two different approaches, denoted as rewriting and fusion, on two fundamentally different generation tasks, including a text-to-text generation task and a multimodal generative task that takes as input text and generates a visualization or data table that answers the user’s question. Our results indicate that the specific rewriting or fusion approach highly depends on the underlying use case and generative task. In particular, we find that for a conversational question-answering assistant, the query rewriting approach performs best, whereas for a data analysis assistant that generates visualizations and data tables based on the user’s conversation with the assistant, the fusion approach works best. Notably, we explore two datasets for the data analysis assistant use case, for short and long conversations, and we find that query fusion always performs better, whereas for the conversational text-based question-answering, the query rewrite approach performs best.
zh

[NLP-64] A Causal Lens for Evaluating Faithfulness Metrics

【速读】：该论文旨在解决大型语言模型（LLMs）自然语言解释在模型可解释性中的可信度评估问题。现有方法虽然提供了看似合理的解释，但可能未能忠实反映模型内部的真实推理过程。为了解决这一问题，论文提出了因果诊断性（Causal Diagnosticity）框架，通过模型编辑方法生成忠实与不忠实解释对，以评估多种可信性度量方法的有效性。关键在于引入因果诊断性概念，并构建统一的评估框架来衡量自然语言解释的忠实度。

链接: https://arxiv.org/abs/2502.18848
作者: Kerem Zaman,Shashank Srivastava
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注: 18 pages, 18 figures, 6 tables

点击查看摘要

Abstract:Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model’s internal reasoning faithfully, which is crucial for understanding the model’s true decision-making processes. Although several faithfulness metrics have been proposed, a unified evaluation framework remains absent. To address this gap, we present Causal Diagnosticity, a framework to evaluate faithfulness metrics for natural language explanations. Our framework employs the concept of causal diagnosticity, and uses model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate a variety of faithfulness metrics, including post-hoc explanation and chain-of-thought-based methods. We find that all tested faithfulness metrics often fail to surpass a random baseline. Our work underscores the need for improved metrics and more reliable interpretability methods in LLMs.
zh

[NLP-65] Sliding Window Attention Training for Efficient Large Language Models

【速读】：该论文旨在解决基于Transformer的大规模语言模型（LLMs）在处理长文档时，由于其与序列长度呈二次增长的计算复杂性导致的效率瓶颈问题。为了解决这一问题，论文提出了一种名为SWAT的新方法，关键在于通过滑动窗口注意力训练（Sliding Window Attention Training）实现高效长上下文处理。具体而言，SWAT通过将softmax函数替换为sigmoid函数，并结合平衡的ALiBi和Rotary位置嵌入来实现有效的信息压缩和保留。实验结果表明，SWAT在八个基准测试中达到了最先进的线性递归架构的表现。

链接: https://arxiv.org/abs/2502.18845
作者: Zichuan Fu,Wentao Song,Yejing Wang,Xian Wu,Yefeng Zheng,Yingying Zhang,Derong Xu,Xuetao Wei,Tong Xu,Xiangyu Zhao
机构: City University of Hong Kong; Xi’an Jiaotong University; Jarvis Research Center, Tencent YouTu Lab; University of Science and Technology of China; Southern University of Science and Technology; Westlake University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at this https URL.
zh

[NLP-66] Sentiment Analysis of Movie Reviews Using BERT

【速读】：该论文旨在通过微调Bidirectional Encoder Representations from Transformers (BERT)与双向长短时记忆网络（BiLSTM）来提升电影评论情感分析的准确性，超越当前最先进方法。其关键在于结合BERT与BiLSTM模型，并应用启发式算法计算情感的整体极性，以预测观众对电影的主要反应。

链接: https://arxiv.org/abs/2502.18841
作者: Gibson Nkhata,Usman Anjum,Justin Zhan
机构: University of Arkansas(阿肯色大学); University of Cincinnati(辛辛那提大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 3 figures, published in the proceedings The Fifteenth International Conference on Information, Process, and Knowledge Management (eKNOW 2023)

点击查看摘要

Abstract:Sentiment Analysis (SA) or opinion mining is analysis of emotions and opinions from any kind of text. SA helps in tracking peoples viewpoints and it is an important factor when it comes to social media monitoring product and brand recognition customer satisfaction customer loyalty advertising and promotions success and product acceptance. That is why SA is one of the active research areas in Natural Language Processing (NLP). SA is applied on data sourced from various media platforms to mine sentiment knowledge from them. Various approaches have been deployed in the literature to solve the problem. Most techniques devise complex and sophisticated frameworks in order to attain optimal accuracy. This work aims to finetune Bidirectional Encoder Representations from Transformers (BERT) with Bidirectional Long Short-Term Memory (BiLSTM) for movie reviews sentiment analysis and still provide better accuracy than the State-of-The-Art (SOTA) methods. The paper also shows how sentiment analysis can be applied if someone wants to recommend a certain movie for example by computing overall polarity of its sentiments predicted by the model. That is our proposed method serves as an upper-bound baseline in prediction of a predominant reaction to a movie. To compute overall polarity a heuristic algorithm is applied to BERTBiLSTM output vector. Our model can be extended to three-class four-class or any fine-grained classification and apply overall polarity computation again. This is intended to be exploited in future work.
zh

[NLP-67] Evidence-Driven Marker Extraction for Social Media Suicide Risk Detection

【速读】：该论文旨在解决通过社交媒体文本早期检测自杀风险时面临的挑战，特别是在提高大型语言模型（Large Language Models, LLMs）的可解释性和计算效率方面。论文提出的关键解决方案是Evidence-Driven LLM (ED-LLM)，这是一种基于多任务学习框架的方法，采用Mistral-7B模型同时进行临床标记提取和自杀风险分类。这种证据驱动策略通过明确突出支持风险评估的文本证据，增强了模型的可解释性。

链接: https://arxiv.org/abs/2502.18823
作者: Carter Adams,Caleb Carter,Jackson Simmons
机构: Federal University of Bahia
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Early detection of suicide risk from social media text is crucial for timely intervention. While Large Language Models (LLMs) offer promising capabilities in this domain, challenges remain in terms of interpretability and computational efficiency. This paper introduces Evidence-Driven LLM (ED-LLM), a novel approach for clinical marker extraction and suicide risk classification. ED-LLM employs a multi-task learning framework, jointly training a Mistral-7B based model to identify clinical marker spans and classify suicide risk levels. This evidence-driven strategy enhances interpretability by explicitly highlighting textual evidence supporting risk assessments. Evaluated on the CLPsych datasets, ED-LLM demonstrates competitive performance in risk classification and superior capability in clinical marker span identification compared to baselines including fine-tuned LLMs, traditional machine learning, and prompt-based methods. The results highlight the effectiveness of multi-task learning for interpretable and efficient LLM-based suicide risk assessment, paving the way for clinically relevant applications.
zh

[NLP-68] Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

【速读】：该论文旨在解决现有自动化评估指标无法公平评价 Retrieval-Augmented Generation (RAG) 模型在训练和评估过程中生成输出的问题。论文的关键解决方案是提出Judge-Consistency (ConsJudge) 方法，通过引导大型语言模型 (LLMs) 根据不同的判断维度生成多样化的评价，并利用这些评价的一致性来选择接受或拒绝的评价，从而用于DPO训练，以提高对RAG模型优化的准确性。

链接: https://arxiv.org/abs/2502.18817
作者: Shuliang Liu,Xinze Li,Zhenghao Liu,Yukun Yan,Cheng Yang,Zheni Zeng,Zhiyuan Liu,Maosong Sun,Ge Yu
机构: Department of Computer Science and Technology, Northeastern University, China (东北大学计算机科学与技术系，中国); Department of Computer Science and Technology, Institute for AI, Tsinghua University, China (清华大学计算机科学与技术系人工智能研究所，中国); Beijing National Research Center for Information Science and Technology, China (北京信息科学技术国家研究中心，中国); School of Computer Science, Beijing University of Posts and Telecommunications (北京邮电大学计算机学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at this https URL.
zh

[NLP-69] Holistic Audit Dataset Generation for LLM Unlearning via Knowledge Graph Traversal and Redundancy Removal

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在删除敏感信息、保护隐私及遵守版权法规方面的有效性评估问题。现有基准测试规模有限且不够全面，通常仅包含数百个测试案例。论文的关键解决方案是提出了HANKER框架，这是一个利用知识图谱自动生成全面审计数据集的自动化框架，以实现细粒度覆盖并消除冗余知识，从而更准确地评估模型的无学习效果。

链接: https://arxiv.org/abs/2502.18810
作者: Weipeng Jiang,Juan Zhai,Shiqing Ma,Ziyan Lei,Xiaofei Xie,Yige Wang,Chao Shen
机构: Xi’an Jiaotong University(西安交通大学); University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Singapore Management University(新加坡管理大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have faced increasing demands to selectively remove sensitive information, protect privacy, and comply with copyright regulations through unlearning, by Machine Unlearning. While evaluating unlearning effectiveness is crucial, existing benchmarks are limited in scale and comprehensiveness, typically containing only a few hundred test cases. We identify two critical challenges in generating holistic audit datasets: ensuring audit adequacy and handling knowledge redundancy between forget and retain dataset. To address these challenges, we propose HANKER, an automated framework for holistic audit dataset generation leveraging knowledge graphs to achieve fine-grained coverage and eliminate redundant knowledge. Applying HANKER to the popular MUSE benchmark, we successfully generated over 69,000 and 111,000 audit cases for the News and Books datasets respectively, identifying thousands of knowledge memorization instances that the previous benchmark failed to detect. Our empirical analysis uncovers how knowledge redundancy significantly skews unlearning effectiveness metrics, with redundant instances artificially inflating the observed memorization measurements ROUGE from 19.7% to 26.1% and Entailment Scores from 32.4% to 35.2%, highlighting the necessity of systematic deduplication for accurate assessment.
zh

[NLP-70] Language Models Grow Less Humanlike beyond Phase Transition

【速读】：该论文旨在探究生成式语言模型（Language Models, LMs）在预训练过程中与人类阅读行为（即心理计量预测能力，Psychometric Predictive Power, PPP）之间的关联性提升至某一临界点后停滞或退化的原因。研究的关键在于假设存在一个预训练相变阶段，这一阶段特征为专门化注意力头的迅速涌现。通过一系列的相关性和因果性实验，论文证明了这种相变是导致PPP临界点现象的关键因素，并进一步表明这一相变改变了模型后续的学习动态，使得进一步训练反而会损害PPP。

链接: https://arxiv.org/abs/2502.18802
作者: Tatsuya Aoyama,Ethan Wilcox
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LMs’ alignment with human reading behavior (i.e. psychometric predictive power; PPP) is known to improve during pretraining up to a tipping point, beyond which it either plateaus or degrades. Various factors, such as word frequency, recency bias in attention, and context size, have been theorized to affect PPP, yet there is no current account that explains why such a tipping point exists, and how it interacts with LMs’ pretraining dynamics more generally. We hypothesize that the underlying factor is a pretraining phase transition, characterized by the rapid emergence of specialized attention heads. We conduct a series of correlational and causal experiments to show that such a phase transition is responsible for the tipping point in PPP. We then show that, rather than producing attention patterns that contribute to the degradation in PPP, phase transitions alter the subsequent learning dynamics of the model, such that further training keeps damaging PPP.
zh

[NLP-71] ANPMI: Assessing the True Comprehension Capabilities of LLM s for Multiple Choice Questions

【速读】：该论文旨在解决多选基准测试中存在的问题，即通过计算 ( P(\text{Choice}|\text{Prompt}) ) 来评估语言模型理解能力时，模型的表现不仅反映了其对提示的理解，还反映了其对某些选项的固有偏好。为了解决这一局限性，论文提出了一种新的度量标准 ANPMI (Adjusted Normalized Pointwise Mutual Information)，通过将点互信息 (Pointwise Mutual Information, PMI) 归一化来克服上述偏差，从而更准确地评估模型的语言理解能力。关键在于 ANPMI 能确保模型在不充分理解提示的情况下难以正确作答。

链接: https://arxiv.org/abs/2502.18798
作者: Gyeongje Cho,Yeonkyoung So,Jaejin Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multiple-choice benchmarks, consisting of various prompts and choices, are among the most widely used methods to assess a language model’s natural language understanding capability. Given a specific prompt, we typically compute P(Choice|Prompt) to evaluate how likely a language model is to generate the correct choice compared to incorrect ones. However, we observe that performance measured using this approach reflects not only the model’s comprehension of the prompt but also its inherent biases for certain choices regardless of the prompt. This issue makes it challenging to accurately measure a model’s natural language understanding, as models may select the answer without fully understanding the prompt. To address this limitation, we propose a novel metric called ANPMI, which normalizes Pointwise Mutual Information (PMI) by -\log P(Choice) . ANPMI provides a more accurate assessment of the model’s natural language understanding by ensuring that it is challenging to answer a question without properly understanding the prompt.
zh

[NLP-72] Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs

【速读】：该论文旨在探究大型语言模型（LLMs）是否能够提供有关人类语言学习的见解。研究的关键在于通过训练模型来区分实际存在的语言与不可能的语言，以及验证这些模型是否展现出类似人类的归纳偏置。实验结果显示，虽然GPT-2小型模型能够在一定程度上区分实际存在的语言与其不可能的对应物，但未能实现完全的分离。进一步测试表明，当未被语言类型学证实的词序保持句法结构时，模型的困惑度分数无法区分实际存在的词序与未被证实的词序。这些发现表明，语言模型确实表现出一些类似人类的归纳偏置，但这些偏置较人类学习者更为薄弱。

链接: https://arxiv.org/abs/2502.18795
作者: Xiulin Yang,Tatsuya Aoyama,Yuekun Yao,Ethan Wilcox
机构: Georgetown University (乔治城大学); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do LLMs offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LLMs can learn arbitrary inputs as easily as natural languages. In this paper, we test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 natural languages from 4 language families. Our results show that while GPT-2 small can primarily distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg’s Universal 20. We find that the model’s perplexity scores do not distinguish attested vs. unattested word orders, as long as the unattested variants maintain constituency structure. These findings suggest that language models exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.
zh

[NLP-73] Seeing the Forest for the Trees: A Large Scale Continuously Updating Meta-Analysis of Frontier LLM s

【速读】：该论文旨在解决大规模语言模型（LLMs）研究结果综合分析的挑战，提出的关键解决方案是一种半自动化的元分析方法。这种方法利用大型语言模型（LLMs）加速数据提取过程，通过自动识别相关论文、提取实验结果及相关属性，并将其组织成结构化数据集，从而显著减少了文献调研和数据提取的工作量，相比传统手动方法降低了超过93%的努力。

链接: https://arxiv.org/abs/2502.18791
作者: Jungsoo Park,Junmo Kang,Gabriel Stanovsky,Alan Ritter
机构: Georgia Institute of Technology(乔治亚理工学院); The Hebrew University of Jerusalem(耶路撒冷希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:The surge of LLM studies makes synthesizing their findings challenging. Meta-analysis can uncover important trends across studies, but its use is limited by the time-consuming nature of manual data extraction. Our study presents a semi-automated approach for meta-analysis that accelerates data extraction using LLMs. It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset. We conduct a comprehensive meta-analysis of frontier LLMs using an automatically extracted dataset, reducing the effort of paper surveying and data extraction by more than 93% compared to manual approaches. We validate our dataset by showing that it reproduces key findings from a recent manual meta-analysis about Chain-of-Thought (CoT), and also uncovers new insights that go beyond it, showing for example that in-context examples benefit multimodal tasks but offer limited gains in mathematical tasks compared to CoT. Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through our scientific artifacts and empirical analysis, we provide novel insights into LLMs while facilitating ongoing meta-analyses of their behavior.
zh

[NLP-74] Active Few-Shot Learning for Text Classification NAACL2025

【速读】：该论文旨在解决Few-Shot Learning (FSL) 方法在使用有限标注样本时性能下降的问题，特别是在选择不当的支持样本时。论文的关键解决方案是一种基于主动学习的实例选择机制，该机制能够从未标注的数据池中识别出有效支持实例，并且可以与不同的大型语言模型 (LLMs) 配合使用。实验结果表明，该方法能够显著提升FSL的性能。

链接: https://arxiv.org/abs/2502.18782
作者: Saeed Ahmadnia,Arash Yousefi Jordehi,Mahsa Hosseini Khasheh Heyran,Seyed Abolghasem Mirroshandel,Owen Rambow,Cornelia Caragea
机构: University of Illinois Chicago(芝加哥伊利诺伊大学); University of Guilan(吉兰大学); Stony Brook University(石溪大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Main Conference; 18 pages, 8 figures, 13 tables including Appendix

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has boosted the use of Few-Shot Learning (FSL) methods in natural language processing, achieving acceptable performance even when working with limited training data. The goal of FSL is to effectively utilize a small number of annotated samples in the learning process. However, the performance of FSL suffers when unsuitable support samples are chosen. This problem arises due to the heavy reliance on a limited number of support samples, which hampers consistent performance improvement even when more support samples are added. To address this challenge, we propose an active learning-based instance selection mechanism that identifies effective support instances from the unlabeled pool and can work with different LLMs. Our experiments on five tasks show that our method frequently improves the performance of FSL. We make our implementation available on GitHub.
zh

[NLP-75] owards Optimal Multi-draft Speculative Decoding

【速读】：该论文旨在解决多草稿推测解码（Multi-Draft Speculative Decoding, MDSD）方法中的最优接受率计算难题以及现有验证算法与理论上限之间的差距。关键在于通过讨论最优传输问题的对偶问题，提供了一种有效计算最优接受率的方法，并首次量化了MDSD效率在大规模词汇表情况下的理论上限。此外，研究发现无放回抽样方法相较于有放回抽样方法能够显著提升最优接受率，而现有的验证算法尚未达到这一理论上限。

链接: https://arxiv.org/abs/2502.18779
作者: Zhengmian Hu,Tong Zheng,Vignesh Viswanathan,Ziyi Chen,Ryan A. Rossi,Yihan Wu,Dinesh Manocha,Heng Huang
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate. For the first time, we measure the theoretical upper bound of MDSD efficiency for vocabulary sizes in the thousands and quantify the gap between existing verification algorithms and this bound. We also compare different draft sampling methods based on their optimal acceptance rates. Our results show that the draft sampling method strongly influences the optimal acceptance rate, with sampling without replacement outperforming sampling with replacement. Additionally, existing verification algorithms do not reach the theoretical upper bound for both without replacement and with replacement sampling. Our findings suggest that carefully designed draft sampling methods can potentially improve the optimal acceptance rate and enable the development of verification algorithms that closely match the theoretical upper bound.
zh

[NLP-76] M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

【速读】：该论文旨在解决跨模态数据在训练过程中因数据量和收敛速率显著差异导致的挑战，提出了一种统一的多模态序列建模框架M2-omni。关键解决方案包括：在预训练阶段采用步长平衡策略处理模态特定数据的数量差异；在指令调优阶段引入动态自适应平衡策略以同步各模态的训练进程，确保最优收敛。通过这些方法，M2-omni实现了与GPT-4o相当的性能，并保持了纯文本任务上的强表现力。

链接: https://arxiv.org/abs/2502.18778
作者: Qingpei Guo,Kaiyou Song,Zipeng Feng,Ziping Ma,Qinglong Zhang,Sirui Gao,Xuzheng Yu,Yunxiao Sun,Tai-WeiChang,Jingdong Chen,Ming Yang,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni’s language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.
zh

[NLP-77] Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance

【速读】：该论文旨在解决希腊语在金融自然语言处理（Financial NLP）领域中的不足，特别是在大型语言模型（LLMs）的应用方面。由于希腊语的复杂性和特定领域数据集的稀缺性，希腊语金融领域的LLMs尚未得到充分探索。论文的关键解决方案在于引入Plutus-ben，首个希腊金融评估基准，以及Plutus-8B，首个专门针对希腊金融领域的LLM。Plutus-8B通过希腊特定领域的数据进行微调，并且Plutus-ben涵盖了五个核心的金融NLP任务：数值与文本命名实体识别、问答、抽象概括及主题分类。这些措施旨在促进系统化和可重复的LLM评估，从而克服希腊语金融NLP中存在的挑战，包括语言复杂性、特定领域的术语以及金融推理差距。

链接: https://arxiv.org/abs/2502.18772
作者: Xueqing Peng,Triantafillos Papadopoulos,Efstathia Soufleri,Polydoros Giannouris,Ruoyu Xiang,Yan Wang,Lingfei Qian,Jimin Huang,Qianqian Xie,Sophia Ananiadou
机构: The Fin AI (金融AI); Athens University of Economics and Business, Archimedes/Athena RC (雅典经济与商业大学, 阿基米德/雅典RC); The University of Manchester (曼彻斯特大学); Archimedes/Athena RC (阿基米德/雅典RC)
类目: Computation and Language (cs.CL)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Despite Greece’s pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.
zh

[NLP-78] Reward Shaping to Mitigate Reward Hacking in RLHF

【速读】：该论文旨在解决在使用强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）训练大型语言模型（Large Language Models, LLMs）过程中，因奖励黑客行为（reward hacking）导致的对齐（alignment）降解问题。论文的关键解决方案是提出了一种名为偏好作为奖励（Preference As Reward, PAR）的新方法，该方法利用奖励模型内部嵌入的潜在偏好作为强化学习的信号。通过这种方法，论文展示了PAR在两个基准模型和数据集上的优越性能，并且在对抗奖励黑客行为方面表现出更高的鲁棒性。

链接: https://arxiv.org/abs/2502.18770
作者: Jiayi Fu,Xuandong Zhao,Chengyuan Yao,Heng Wang,Qi Han,Yanghua Xiao
机构: Fudan University (复旦大学); StepFun; UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to reward hacking, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. While reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests three key design principles: (1) RL reward is ideally bounded, (2) RL benefits from rapid initial growth followed by gradual convergence, and (3) RL reward is best formulated as a function of centered reward. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR’s superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. Code is available at this https URL.
zh

[NLP-79] Automatic Prompt Optimization via Heuristic Search: A Survey

【速读】：该论文旨在解决自动提示优化（Automatic Prompt Optimization）的问题，特别是在大型语言模型（Large Language Models, LLMs）应用中，如何通过系统化方法提升提示工程的有效性。论文的关键在于提出了一种全面的分类法，将这些方法按照优化发生的位置、优化目标、驱动优化的标准、生成新提示的操作符以及应用的迭代搜索算法进行分类。此外，论文还强调了支持和加速自动化提示优化的专门数据集和工具。

链接: https://arxiv.org/abs/2502.18746
作者: Wendi Cui,Jiaxin Zhang,Zhuohang Li,Hao Sun,Damien Lopez,Kamalika Das,Bradley A. Malin,Sricharan Kumar
机构: Intuit; Intuit AI Research; Vanderbilt University; University of Cambridge; Vanderbilt University Medical Center
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models have led to remarkable achievements across a variety of Natural Language Processing tasks, making prompt engineering increasingly central to guiding model outputs. While manual methods can be effective, they typically rely on intuition and do not automatically refine prompts over time. In contrast, automatic prompt optimization employing heuristic-based search algorithms can systematically explore and improve prompts with minimal human oversight. This survey proposes a comprehensive taxonomy of these methods, categorizing them by where optimization occurs, what is optimized, what criteria drive the optimization, which operators generate new prompts, and which iterative search algorithms are applied. We further highlight specialized datasets and tools that support and accelerate automated prompt refinement. We conclude by discussing key open challenges pointing toward future opportunities for more robust and versatile LLM applications.
zh

[NLP-80] Like Father Like Son: Kinship-Aware Preference Mapping (KARMA) for Automatic Alignment in Large Language Models

【速读】：该论文旨在解决现有大型语言模型（Large Language Model, LLM）对齐方法因比较能力差异显著的模型响应而产生的表面化区分问题，这些问题无法提供有意义的指导以确定更优的响应。论文的关键解决方案是提出了一种名为亲缘感知偏好映射（Kinship-Aware pReference MApping, KARMA）的新框架，该框架系统性地配对具有可比能力的模型的响应。通过限制偏好比较于相似复杂度和质量的输出，KARMA增强了偏好数据的信息量，并提高了对齐信号的精细度。实验评估表明，这种亲缘感知的方法能够带来更一致且可解释的对齐结果，从而为实现LLM行为与人类偏好的对齐提供了更为合理和可靠的途径。

链接: https://arxiv.org/abs/2502.18744
作者: Jeesu Jung,Chanjun Park,Sangkeun Jung
机构: Chungnam National University (忠南国立大学); Korea University (韩国大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages,5 figures,3 tables,4 graphs

点击查看摘要

Abstract:Recent advancements in Large Language Model (LLM) alignment have sought to mitigate the cost of human annotations by leveraging pretrained models to generate preference data. However, existing methods often compare responses from models with substantially different capabilities, yielding superficial distinctions that fail to provide meaningful guidance on what constitutes a superior response. To address this limitation, we propose Kinship-Aware pReference MApping (KARMA), a novel framework that systematically pairs responses from models with comparable competencies. By constraining preference comparisons to outputs of similar complexity and quality, KARMA enhances the informativeness of preference data and improves the granularity of alignment signals. Empirical evaluations demonstrate that our kinship-aware approach leads to more consistent and interpretable alignment outcomes, ultimately facilitating a more principled and reliable pathway for aligning LLM behavior with human preferences.
zh

[NLP-81] Beyond RNNs: Benchmarking Attention-Based Image Captioning Models

【速读】：该论文旨在解决图像描述生成任务中的准确性与语义丰富性问题，通过比较基于注意力机制的图像描述模型与传统的循环神经网络（RNN）方法。关键解决方案在于采用Bahdanau注意力机制，以增强图像特征与生成描述之间的对齐效果。研究表明，基于注意力的模型在生成更准确且语义更丰富的描述方面优于RNN方法，并在自然语言处理指标如BLEU、METEOR、GLEU和WER评估中表现更佳。

链接: https://arxiv.org/abs/2502.18734
作者: Hemanth Teja Yanambakkam,Rahul Chinthala
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 6 figures. Code and additional results are available on GitHub under the handle HemanthTejaY

点击查看摘要

Abstract:Image captioning is a challenging task at the intersection of computer vision and natural language processing, requiring models to generate meaningful textual descriptions of images. Traditional approaches rely on recurrent neural networks (RNNs), but recent advancements in attention mechanisms have demonstrated significant improvements. This study benchmarks the performance of attention-based image captioning models against RNN-based approaches using the MS-COCO dataset. We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions. The models are assessed using natural language processing metrics such as BLEU, METEOR, GLEU, and WER. Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions, with better alignment to human evaluation. This work provides insights into the impact of attention mechanisms in image captioning and highlights areas for future improvements.
zh

[NLP-82] Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science

【速读】：该论文旨在解决社会调查分析过程中因问卷选项高度依赖受访者的先前回答而带来的复杂性及所需时间和专业知识的问题。论文的关键解决方案是提出了一种新颖的大语言模型提示方法——随机森林思维（Random Forest of Thoughts, RFoT），通过生成多样化的思维空间并随机选择子思维来构建思维森林，从而实现不确定性推理，增强大型语言模型在复杂推理中的能力，特别是在需要探索和不确定性搜索的问题上。这种方法扩展了整体性能的探索和预测，适用于计算社会科学领域。

链接: https://arxiv.org/abs/2502.18729
作者: Xiaohua Wu,Xiaohui Tao,Wenjie Wu,Yuefeng Li,Lin Li
机构: Queensland University of Technology(昆士兰科技大学), Australia; Wuhan University of Technology(武汉理工大学), China; University of Southern Queensland(南昆士兰大学), Australia
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Social surveys in computational social science are well-designed by elaborate domain theories that can effectively reflect the interviewee’s deep thoughts without concealing their true feelings. The candidate questionnaire options highly depend on the interviewee’s previous answer, which results in the complexity of social survey analysis, the time, and the expertise required. The ability of large language models (LLMs) to perform complex reasoning is well-enhanced by prompting learning such as Chain-of-thought (CoT) but still confined to left-to-right decision-making processes or limited paths during inference. This means they can fall short in problems that require exploration and uncertainty searching. In response, a novel large language model prompting method, called Random Forest of Thoughts (RFoT), is proposed for generating uncertainty reasoning to fit the area of computational social science. The RFoT allows LLMs to perform deliberate decision-making by generating diverse thought space and randomly selecting the sub-thoughts to build the forest of thoughts. It can extend the exploration and prediction of overall performance, benefiting from the extensive research space of response. The method is applied to optimize computational social science analysis on two datasets covering a spectrum of social survey analysis problems. Our experiments show that RFoT significantly enhances language models’ abilities on two novel social survey analysis problems requiring non-trivial reasoning.
zh

[NLP-83] alking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation

【速读】：该论文旨在解决传统心理实验在使用自然主义刺激时面临的手动注释繁琐及生态效度不足的问题。关键解决方案在于引入了一种利用多模态大型语言模型（LLMs）作为代理，通过视觉问答（VQA）策略从自然主义图像中提取丰富的语义信息的新范式，从而分析人类视觉语义表示。这一方法通过LLM衍生的表征成功预测了功能性磁共振成像（fMRI）测量到的已知神经活动模式（如人脸、建筑物），验证了其可行性，并揭示了大脑皮层区域间的层级语义组织。

链接: https://arxiv.org/abs/2502.18725
作者: Xin Liu,Ziyue Zhang,Jingxin Nie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Traditional psychological experiments utilizing naturalistic stimuli face challenges in manual annotation and ecological validity. To address this, we introduce a novel paradigm leveraging multimodal large language models (LLMs) as proxies to extract rich semantic information from naturalistic images through a Visual Question Answering (VQA) strategy for analyzing human visual semantic representation. LLM-derived representations successfully predict established neural activity patterns measured by fMRI (e.g., faces, buildings), validating its feasibility and revealing hierarchical semantic organization across cortical regions. A brain semantic network constructed from LLM-derived representations identifies meaningful clusters reflecting functional and contextual associations. This innovative methodology offers a powerful solution for investigating brain semantic organization with naturalistic stimuli, overcoming limitations of traditional annotation methods and paving the way for more ecologically valid explorations of human cognition.
zh

[NLP-84] A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition WWW2025

【速读】：该论文旨在解决零样本命名实体识别（Zero-shot Named Entity Recognition, NER）中的两个主要挑战：一是忽略实体周围上下文之间的关联性，导致类型预测错误或实体遗漏；二是盲目使用通过浅层相似性策略检索的任务演示，严重误导大型语言模型（LLMs）在推理过程中的表现。为了解决这些问题，论文提出了一种名为合作多智能体系统（Cooperative Multi-Agent System, CMAS）的新框架。CMAS的关键在于其四个主要组成部分：自注释器、与类型相关的特征（Type-Related Feature, TRF）提取器、演示鉴别器以及总体预测器。特别是，CMAS通过将NER任务分解为识别命名实体和识别目标句子中的类型相关特征这两个子任务来显式捕捉实体周围上下文之间的关联性，并通过建立演示鉴别器实现可控的演示利用，自动评估目标句子的有用性分数。实验结果表明，CMAS显著提升了六个基准测试集上的零样本NER性能，并且在少量样本设置下同样有效。

链接: https://arxiv.org/abs/2502.18702
作者: Zihan Wang,Ziqi Zhao,Yougang Lyu,Zhumin Chen,Maarten de Rijke,Zhaochun Ren
机构: University of Amsterdam(阿姆斯特丹大学); Shandong University(山东大学); Leiden University(莱顿大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at WWW 2025

点击查看摘要

Abstract:Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora. This task presents substantial challenges due to minimal human intervention. Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates. It advances model self-learning abilities by incorporating self-annotated demonstrations. However, two important challenges persist: (i) Correlations between contexts surrounding entities are overlooked, leading to wrong type predictions or entity omissions. (ii) The indiscriminate use of task demonstrations, retrieved through shallow similarity-based strategies, severely misleads LLMs during inference. In this paper, we introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER that uses the collective intelligence of multiple agents to address the challenges outlined above. CMAS has four main agents: (i) a self-annotator, (ii) a type-related feature (TRF) extractor, (iii) a demonstration discriminator, and (iv) an overall predictor. To explicitly capture correlations between contexts surrounding entities, CMAS reformulates NER into two subtasks: recognizing named entities and identifying entity type-related features within the target sentence. To enable controllable utilization of demonstrations, a demonstration discriminator is established to incorporate the self-reflection mechanism, automatically evaluating helpfulness scores for the target sentence. Experimental results show that CMAS significantly improves zero-shot NER performance across six benchmarks, including both domain-specific and general-domain scenarios. Furthermore, CMAS demonstrates its effectiveness in few-shot settings and with various LLM backbones. Comments: Accepted at WWW 2025 Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2502.18702 [cs.IR] (or arXiv:2502.18702v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.18702 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-85] MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

【速读】：该论文旨在解决在使用Reinforcement Learning from Human Feedback (RLHF)对大型语言模型 (LLMs) 进行优化时，单一奖励模型难以捕捉人类偏好的多样性的问题。为应对这一挑战，论文提出了一种名为Mixing Preference Optimization (MPO) 的后处理框架，其关键是通过批量随机镜像下降方法计算各单目标策略的权重，并将其对数线性组合成一个统一的策略，从而实现对多样化偏好的平衡优化，同时显著降低了计算成本。

链接: https://arxiv.org/abs/2502.18699
作者: Tianze Wang,Dongnan Gui,Yifan Hu,Shuhang Lin,Linjun Zhang
机构: rutgers.edu(罗格斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
zh

[NLP-86] Speaking the Right Language: The Impact of Expertise Alignment in User-AI Interactions

【速读】：该论文旨在研究Bing Copilot在不同专业水平用户交互中的响应策略及其对用户体验的影响。关键在于确保生成式AI (Generative AI) 的响应水平与用户的实际专业水平相匹配，以提升整体交互体验，尤其是在复杂任务中。研究表明，当AI的响应水平低于用户时，会显著降低用户体验，而用户参与度则在AI响应水平与用户相当时显著提高。因此，设计以用户为中心的人工智能系统时，确保用户与AI之间的响应水平对齐至关重要。

链接: https://arxiv.org/abs/2502.18685
作者: Shramay Palta,Nirupama Chandrasekaran,Rachel Rudinger,Scott Counts
机构: University of Maryland, College Park(马里兰大学学院公园分校); Microsoft Research, Redmond(微软研究，雷德蒙德)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: arXiv Version

点击查看摘要

Abstract:Using a sample of 25,000 Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77% of conversations) which correlates with positive user experience regardless of the user’s level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between user and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions.
zh

[NLP-87] Discriminative Finetuning of Generative Large Language Models without Reward Models and Preference Data

【速读】：该论文旨在解决在不使用偏好数据或奖励模型的情况下，能否微调大型语言模型（Large Language Models, LLMs）以实现与监督微调后接偏好优化（SFT (\rightarrow) PO）相当甚至更优的性能。解决方案的关键在于引入了一种名为判别式微调（Discriminative Fine-Tuning, DFT）的新方法。DFT采用了一种判别式的范式，通过显式建模输入条件下所有可能输出中正确答案的判别似然性，从而提高正确答案的概率并抑制潜在错误答案，从令牌预测转变为数据预测。此方法无需依赖偏好数据或奖励模型，并通过有效的算法优化判别似然性，最终实现了优于单纯监督微调（SFT）且与SFT (\rightarrow) PO相当甚至更好的性能。

链接: https://arxiv.org/abs/2502.18679
作者: Siqi Guo,Ilgee Hong,Vicente Balmaseda,Tuo Zhao,Tianbao Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Supervised fine-tuning (SFT) followed by preference optimization (PO) denoted by SFT \rightarrow PO has become the standard for improving pretrained large language models (LLMs), with PO demonstrating significant performance gains. However, PO methods rely on either human-labeled preference data or a strong reward model to generate preference data. Can we fine-tune LLMs without preference data or reward models while achieving competitive performance to SFT \rightarrow PO? We address this question by introducing Discriminative Fine-Tuning (DFT), a novel approach that eliminates the need for preference data. Unlike SFT, which employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that that increases the probability of positive answers while suppressing potentially negative ones, shifting from token prediction to data prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s effectiveness, achieving performance better than SFT and comparable to if not better than SFT \rightarrow PO. The code can be found at this https URL.
zh

[NLP-88] Scaffolding Empathy: Training Counselors with Simulated Patients and Utterance-level Performance Visualizations

【速读】：该论文旨在解决临床辅导员在角色扮演训练中因现有培训方法仅提供间歇性反馈而导致的学习效率低下的问题。解决方案的关键在于开发了一套基于大型语言模型（LLM）的培训系统，该系统通过模拟患者互动及提供逐轮性能反馈可视化，为辅导员学习动机访谈技能（Motivational Interviewing）提供了频繁且详细的反馈。

链接: https://arxiv.org/abs/2502.18673
作者: Ian Steenstra,Farnaz Nouraei,Timothy W. Bickmore
机构: Northeastern University (东北大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: This is a preprint version of the paper conditionally accepted to CHI’25

点击查看摘要

Abstract:Learning therapeutic counseling involves significant role-play experience with mock patients, with current manual training methods providing only intermittent granular feedback. We seek to accelerate and optimize counselor training by providing frequent, detailed feedback to trainees as they interact with a simulated patient. Our first application domain involves training motivational interviewing skills for counselors. Motivational interviewing is a collaborative counseling style in which patients are guided to talk about changing their behavior, with empathetic counseling an essential ingredient. We developed and evaluated an LLM-powered training system that features a simulated patient and visualizations of turn-by-turn performance feedback tailored to the needs of counselors learning motivational interviewing. We conducted an evaluation study with professional and student counselors, demonstrating high usability and satisfaction with the system. We present design implications for the development of automated systems that train users in counseling skills and their generalizability to other types of social skills training.
zh

[NLP-89] Enhancing Text Classification with a Novel Multi-Agent Collaboration Framework Leverag ing BERT

【速读】：该论文旨在解决提高文本分类模型的准确性和鲁棒性的问题。解决方案的关键在于引入了一种新颖的多智能体协作框架，该框架以BERT为基础分类器，并通过动态升级低置信度预测至包含词汇、上下文、逻辑、共识和可解释性五个智能体的专门多智能体系统（Multi-Agent System），从而实现全面分析和共识驱动的决策制定，显著提升了不同文本分类任务中的分类性能。

链接: https://arxiv.org/abs/2502.18653
作者: Hediyeh Baban,Sai A Pidapar,Aashutosh Nema,Sichen Lu
机构: Dell Technologies(戴尔技术); NYU(纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel multi-agent collaboration framework designed to enhance the accuracy and robustness of text classification models. Leveraging BERT as the primary classifier, our framework dynamically escalates low-confidence predictions to a specialized multi-agent system comprising Lexical, Contextual, Logic, Consensus, and Explainability agents. This collaborative approach allows for comprehensive analysis and consensus-driven decision-making, significantly improving classification performance across diverse text classification tasks. Empirical evaluations on benchmark datasets demonstrate that our framework achieves a 5.5% increase in accuracy compared to standard BERT-based classifiers, underscoring its effectiveness and academic novelty in advancing multi-agent systems within natural language processing.
zh

[NLP-90] Single- vs. Dual-Prompt Dialogue Generation with LLM s for Job Interviews in Human Resources

【速读】：该论文旨在解决通过语言模型生成高质量且难以与真实人类对话区分开来的HR（人力资源）面试对话的问题。关键解决方案在于采用双提示方法（dual-prompt method），即使用两个代理相互对话来生成面试对话，相较于单一提示方法，该方法虽然导致令牌成本增加六倍，但生成的对话在质量评估中胜率提高了十倍，且这一优势在不同模型（GPT-4o或Llama 3.3 70B）下均保持一致。

链接: https://arxiv.org/abs/2502.18650
作者: Joachim De Baer,A. Seza Doğruöz,Thomas Demeester,Chris Develder
机构: IDLab, Universiteit Gent – imec (根特大学 – imec), Belgium; LT3, Universiteit Gent (根特大学, Belgium)
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Optimizing language models for use in conversational agents requires large quantities of example dialogues. Increasingly, these dialogues are synthetically generated by using powerful large language models (LLMs), especially in domains with challenges to obtain authentic human data. One such domain is human resources (HR). In this context, we compare two LLM-based dialogue generation methods for the use case of generating HR job interviews, and assess whether one method generates higher-quality dialogues that are more challenging to distinguish from genuine human discourse. The first method uses a single prompt to generate the complete interview dialog. The second method uses two agents that converse with each other. To evaluate dialogue quality under each method, we ask a judge LLM to determine whether AI was used for interview generation, using pairwise interview comparisons. We demonstrate that despite a sixfold increase in token cost, interviews generated with the dual-prompt method achieve a win rate up to ten times higher than those generated with the single-prompt method. This difference remains consistent regardless of whether GPT-4o or Llama 3.3 70B is used for either interview generation or judging quality.
zh

[NLP-91] Steered Generation via Gradient Descent on Sparse Features

【速读】：该论文旨在解决如何通过调整大型语言模型（Large Language Models, LLMs）的内部表示来精确控制其输出的认知复杂度。关键解决方案在于训练稀疏自编码器（sparse autoencoders），以学习查询嵌入（query embedding）的稀疏表示，从而实现对模型注意力分布的精准调控。通过引导稀疏嵌入向期望认知复杂度水平的样本表示靠近，并利用潜在空间中的基于梯度的优化方法，实现了对LLM生成反馈的认知复杂度的系统性调整。

链接: https://arxiv.org/abs/2502.18644
作者: Sumanta Bhattacharyya,Pedram Rooshenas
机构: University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) encode a diverse range of linguistic features within their latent representations, which can be harnessed to steer their output toward specific target characteristics. In this paper, we modify the internal structure of LLMs by training sparse autoencoders to learn a sparse representation of the query embedding, allowing precise control over the model’s attention distribution. We demonstrate that manipulating this sparse representation effectively transforms the output toward different stylistic and cognitive targets. Specifically, in an educational setting, we show that the cognitive complexity of LLM-generated feedback can be systematically adjusted by modifying the encoded query representation at a specific layer. To achieve this, we guide the learned sparse embedding toward the representation of samples from the desired cognitive complexity level, using gradient-based optimization in the latent space.
zh

[NLP-92] Contextual effects of sentiment deployment in human and machine translation ISCA

【速读】：该论文旨在探讨文本在翻译过程中整体情感倾向的变化及其对自动化情感分析的影响，特别是那些利用机器翻译并通过语义相似性指标评估结果的方法。论文的关键在于指出虽然人工和机器翻译都能使词汇更符合目标语言的情感频率预期，但只有机器翻译会同时缩减文本的整体语义范围，尤其是在涉及认识论内容的词汇方面。

链接: https://arxiv.org/abs/2502.18642
作者: Lindy Comstock,Priyanshu Sharma,Mikhail Belov
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models

点击查看摘要

Abstract:This paper illustrates how the overall sentiment of a text may be shifted in translation and the implications for automated sentiment analyses, particularly those that utilize machine translation and assess findings via semantic similarity metrics. While human and machine translation will produce more lemmas that fit the expected frequency of sentiment in the target language, only machine translation will also reduce the overall semantic field of the text, particularly in regard to words with epistemic content.
zh

[NLP-93] Faster Cheaper Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）系统中检索增强生成（Retrieval Augmented Generation, RAG）配置的多目标参数优化问题。面对庞大的解空间、噪声评估以及高昂的评估成本，论文提出了一种新的方法，通过贝叶斯优化技术，在成本、延迟、安全性和对齐性等多个目标之间进行优化。关键在于采用贝叶斯优化方法显著提升了多目标优化的效果，相较于基线方法获得了更优的帕累托前沿。

链接: https://arxiv.org/abs/2502.18635
作者: Matthew Barker,Andrew Bell,Evan Thomas,James Carr,Thomas Andrews,Umang Bhatt
机构: Trustwise AI; New York University (纽约大学); The Alan Turing Institute (图灵研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving Large Language Model (LLM) systems, it introduces a large number of choices, parameters and hyperparameters that must be made or tuned. This includes the LLM, embedding, and ranker models themselves, as well as hyperparameters governing individual RAG components. Yet, collectively optimizing the entire configuration in a RAG or LLM system remains under-explored - especially in multi-objective settings - due to intractably large solution spaces, noisy objective evaluations, and the high cost of evaluations. In this work, we introduce the first approach for multi-objective parameter optimization of cost, latency, safety and alignment over entire LLM and RAG systems. We find that Bayesian optimization methods significantly outperform baseline approaches, obtaining a superior Pareto front on two new RAG benchmark tasks. We conclude our work with important considerations for practitioners who are designing multi-objective RAG systems, highlighting nuances such as how optimal configurations may not generalize across tasks and objectives.
zh

[NLP-94] Automated Knowledge Component Generation and Knowledge Tracing for Coding Problems

【速读】：该论文旨在解决手工标注知识组件（Knowledge Components, KCs）到问题上的高劳动强度问题。解决方案的关键在于提出了一种基于大型语言模型（LLM）的全自动管道，用于开放性编程问题的知识组件生成与标注（KCGen），以及在此基础上开发的一种利用这些生成的知识组件的知识追踪（KT）框架（KCGen-KT）。研究结果表明，KCGen-KT在实际学生代码提交数据集上优于现有的KT方法，并且生成的知识组件在性能因素分析（PFA）模型下具有与人工编写的知识组件相当的拟合度。

链接: https://arxiv.org/abs/2502.18632
作者: Zhangqi Duan,Nigel Fernandez,Sri Kanakadandi,Bita Akram,Andrew Lan
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); North Carolina State University(北卡罗来纳州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor-intensive. We present a fully automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations validating the effectiveness of KCGen-KT. On a real-world dataset of student code submissions to open-ended programming problems, KCGen-KT outperforms existing KT methods. We investigate the learning curves of generated KCs and show that LLM-generated KCs have a comparable level-of-fit to human-written KCs under the performance factor analysis (PFA) model. We also conduct a human evaluation to show that the KC tagging accuracy of our pipeline is reasonably accurate when compared to that by human domain experts.
zh

[NLP-95] Chain of Draft: Thinking Faster by Writing Less

【速读】：该论文旨在解决大型语言模型（LLMs）在处理复杂推理任务时过度冗长的链式思考（Chain-of-Thought, CoT）提示方法导致的成本高和延迟大的问题。论文的关键在于提出了一种新的范式——Draft链式思考（Chain of Draft, CoD），通过让模型生成精简且信息量足够的中间推理输出，从而在保持甚至超越CoT准确性的同时，仅使用其7.6%的标记数，显著降低了成本和延迟。

链接: https://arxiv.org/abs/2502.18600
作者: Silei Xu,Wenhao Xie,Lingxiao Zhao,Pengcheng He
机构: Zoom Communications (Zoom通讯)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.
zh

[NLP-96] Neurobiber: Fast and Interpretable Stylistic Feature Extraction

【速读】：该论文旨在解决大规模提取详细文体特征的挑战。解决方案的关键在于提出了一种基于Transformer的系统Neurobiber，它能够快速且可解释地进行文体分析，并且基于Biber的多维分析方法（Multidimensional Analysis, MDA）。Neurobiber能够从开源库BiberPlus预测96个Biber风格特征，并且在保持高效性的同时，其结果与经典MDA的洞见一致，在PAN 2020作者验证任务中也表现出竞争力，而无需进行广泛的再训练。

链接: https://arxiv.org/abs/2502.18590
作者: Kenan Alkiek,Anna Wegmann,Jian Zhu,David Jurgens
机构: University of Michigan (密歇根大学); Utrecht University (乌得勒支大学); University of British Columbia (英属哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linguistic style is pivotal for understanding how texts convey meaning and fulfill communicative purposes, yet extracting detailed stylistic features at scale remains challenging. We present Neurobiber, a transformer-based system for fast, interpretable style profiling built on Biber’s Multidimensional Analysis (MDA). Neurobiber predicts 96 Biber-style features from our open-source BiberPlus library (a Python toolkit that computes stylistic features and provides integrated analytics, e.g., PCA and factor analysis). Despite being up to 56 times faster than existing open source systems, Neurobiber replicates classic MDA insights on the CORE corpus and achieves competitive performance on the PAN 2020 authorship verification task without extensive retraining. Its efficient and interpretable representations readily integrate into downstream NLP pipelines, facilitating large-scale stylometric research, forensic analysis, and real-time text monitoring. All components are made publicly available.
zh

[NLP-97] What are Foundation Models Cooking in the Post-Soviet World?

【速读】：该论文旨在解决多模态模型在识别与后苏联国家相关的菜肴起源方面的能力不足问题。研究通过构建BORSch数据集，包含俄语和乌克兰语的1147和823道菜肴，揭示了领先模型在文本和多模态问答（Question Answering, QA）任务中无法正确识别这些菜肴的起源，并倾向于高估与问题语言相关的国家。关键解决方案在于分析预训练数据中的误导性菜肴起源共现现象以及如俄乌代码混合等语言现象，以解释模型表现不佳的原因，并进一步通过测试模型生成准确视觉描述的能力，表明仅依赖QA可能不足以评估文化理解。

链接: https://arxiv.org/abs/2502.18583
作者: Anton Lavrouk,Tarek Naous,Alan Ritter,Wei Xu
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at this https URL.
zh

[NLP-98] Scalable Best-of-N Selection for Large Language Models via Self-Certainty

【速读】：该论文旨在解决通过最佳选择（Best-of-N selection）提升大规模语言模型（LLMs）推理性能时所面临的挑战，特别是现有方法依赖于计算密集型的奖励模型进行响应评估与选择的问题。论文的关键解决方案是提出了一种名为自确定性（self-certainty）的新颖且高效的度量标准，它利用LLM输出的内在概率分布来估计响应质量，无需外部奖励模型。实验结果表明，自确定性能够有效扩展，并在增加样本数量时保持性能，同时改善链式思考（chain-of-thought）策略，从而提高LLM的推理能力，尤其适用于开放性任务。

链接: https://arxiv.org/abs/2502.18581
作者: Zhewei Kang,Xuandong Zhao,Dawn Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N , akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at this https URL
zh

[NLP-99] FactReason er: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成内容时难以确保事实正确性的问题，这导致这些模型在需要事实准确响应的实际应用中显得不够可靠。论文的关键解决方案是提出了FactReasoner，这是一种基于概率推理的事实性评估器，通过将生成的回答分解为原子单位，并从外部知识源检索相关上下文，构建原子与上下文之间的联合概率分布，利用文本陈述之间逻辑关系（蕴含、矛盾）的概率编码。FactReasoner计算回答中的原子单位是否由检索到的上下文支持的后验概率，从而改进了现有基于提示的方法，在事实精确度和召回率方面均有显著提升。

链接: https://arxiv.org/abs/2502.18573
作者: Radu Marinescu,Debarun Bhattacharjya,Junkyu Lee,Tigran Tchrakian,Javier Carnerero Cano,Yufang Hou,Elizabeth Daly,Alessandra Pascale
机构: IBM Research; IT:U - Interdisciplinary Transformation University Austria (奥地利跨学科转型大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.
zh

[NLP-100] PII-Bench: Evaluating Query-Aware Privacy Protection Systems

【速读】：该论文旨在解决大型语言模型（LLMs）在处理用户提示时暴露个人识别信息（PII）的隐私问题。论文的关键解决方案是一种与查询无关的PII屏蔽策略，并引入了PII-Bench，这是一个全面的评估框架，用于评估隐私保护系统。PII-Bench包含来自55个细粒度PII类别的2,842个测试样本，涵盖从单一主体描述到复杂多主体交互的各种场景。研究表明，尽管现有模型在基本PII检测方面表现尚可，但在确定PII查询相关性方面存在显著局限，特别是在处理复杂的多主体情景时。

链接: https://arxiv.org/abs/2502.18545
作者: Hao Shen,Zhouhong Gu,Haokai Hong,Weili Han
机构: Institute of Fintech, Fudan University (复旦大学金融科技研究院); Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学计算机科学数据科学重点实验室); Laboratory of Data Analytics and Security, Fudan University (复旦大学数据分析与安全实验室)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even state-of-the-art LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.
zh

[NLP-101] FilterRAG : Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

【速读】：该论文旨在解决视觉问答（Visual Question Answering, VQA）模型在生成答案时容易出现幻觉（hallucinations），即产生看似可信但不正确的答案，尤其是在知识驱动和分布外（Out-of-Distribution, OOD）场景下。为了解决这一问题，论文提出了一种名为FilterRAG的框架，该框架结合了BLIP-VQA与检索增强生成（Retrieval-Augmented Generation, RAG）技术，通过引用外部知识源如Wikipedia和DBpedia来锚定答案。FilterRAG的关键在于利用外部知识库减少幻觉现象，并提升模型在域内和域外场景下的鲁棒性。实验结果表明，FilterRAG在OK-VQA数据集上达到了36.5%的准确率，验证了其有效性。

链接: https://arxiv.org/abs/2502.18536
作者: S M Sarwar
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 12 pages, 6 figures and 2 tables

点击查看摘要

Abstract:Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
zh

[NLP-102] Enhancing Hepatopathy Clinical Trial Efficiency: A Secure Large Language Model-Powered Pre-Screening Pipeline

【速读】：该论文旨在解决复杂肝脏疾病（如肝细胞癌和肝硬化）患者招募过程中，传统手动筛查方法耗时且易出错的问题。解决方案的关键在于开发了一种基于临床专业知识指导的新型患者预筛选管道，该管道通过分解复杂的入选标准为一系列复合问题，并采用两种策略进行语义问答：(1)拟人化专家推理链策略（Pathway A），(2)代理协作中的预设立场策略（Pathway B）。这两种策略在处理复杂的临床推理场景时表现出色，分别在精确数据提取和复杂推理方面取得了良好的效果。实验结果显示，该管道在肝细胞癌和肝硬化试验中均展示了高精度和高效性。

链接: https://arxiv.org/abs/2502.18531
作者: Xiongbin Gui,Hanlin Lv,Xiao Wang,Longting Lv,Yi Xiao,Lei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 5 figures

点击查看摘要

Abstract:Background: Recruitment for cohorts involving complex liver diseases, such as hepatocellular carcinoma and liver cirrhosis, often requires interpreting semantically complex criteria. Traditional manual screening methods are time-consuming and prone to errors. While AI-powered pre-screening offers potential solutions, challenges remain regarding accuracy, efficiency, and data privacy. Methods: We developed a novel patient pre-screening pipeline that leverages clinical expertise to guide the precise, safe, and efficient application of large language models. The pipeline breaks down complex criteria into a series of composite questions and then employs two strategies to perform semantic question-answering through electronic health records - (1) Pathway A, Anthropomorphized Experts’ Chain of Thought strategy, and (2) Pathway B, Preset Stances within an Agent Collaboration strategy, particularly in managing complex clinical reasoning scenarios. The pipeline is evaluated on three key metrics-precision, time consumption, and counterfactual inference - at both the question and criterion levels. Results: Our pipeline achieved high precision (0.921, in criteria level) and efficiency (0.44s per task). Pathway B excelled in complex reasoning, while Pathway A was effective in precise data extraction with faster processing times. Both pathways achieved comparable precision. The pipeline showed promising results in hepatocellular carcinoma (0.878) and cirrhosis trials (0.843). Conclusions: This data-secure and time-efficient pipeline shows high precision in hepatopathy trials, providing promising solutions for streamlining clinical trial workflows. Its efficiency and adaptability make it suitable for improving patient recruitment. And its capability to function in resource-constrained environments further enhances its utility in clinical settings.
zh

[NLP-103] Analyzing User Perceptions of Large Language Models (LLM s) on Reddit: Sentiment and Topic Modeling of ChatGPT and DeepSeek Discussions

【速读】：该论文旨在解决在线平台如Reddit上用户对大型语言模型（Large Language Models, LLMs）如ChatGPT和DeepSeek感知不足的问题。关键解决方案在于通过情感分析和主题建模来剖析用户的态度，识别出诸如对AI的信任度、用户期望、技术应用潜力、对AI偏见的担忧以及使用中的伦理影响等重要话题。通过这些分析，研究者提供了公众情绪如何塑造AI发展方向的见解，并使用词频方法和Latent Dirichlet Allocation (LDA)主题建模来识别用户讨论中的主要话题及其情感趋势。

链接: https://arxiv.org/abs/2502.18513
作者: Krishnaveni Katta
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:While there is an increased discourse on large language models (LLMs) like ChatGPT and DeepSeek, there is no comprehensive understanding of how users of online platforms, like Reddit, perceive these models. This is an important omission because public opinion can influence AI development, trust, and future policy. This study aims at analyzing Reddit discussions about ChatGPT and DeepSeek using sentiment and topic modeling to advance the understanding of user attitudes. Some of the significant topics such as trust in AI, user expectations, potential uses of the tools, reservations about AI biases, and ethical implications of their use are explored in this study. By examining these concerns, the study provides a sense of how public sentiment might shape the direction of AI development going forward. The report also mentions whether users have faith in the technology and what they see as its future. A word frequency approach is used to identify broad topics and sentiment trends. Also, topic modeling through the Latent Dirichlet Allocation (LDA) method identifies top topics in users’ language, for example, potential benefits of LLMs, their technological applications, and their overall social ramifications. The study aims to inform developers and policymakers by making it easier to see how users comprehend and experience these game-changing technologies.
zh

[NLP-104] Comprehensive Analysis of Transparency and Accessibility of ChatGPT DeepSeek And other SoTA Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在透明性和可访问性方面的不足。尽管开源倡议（Open Source Initiative, OSI）已发布关于开源软件的正式定义，但现有研究仍缺乏对这些先进模型透明度和可访问性的深入讨论。论文的关键在于通过分析过去五年内的多个代表性LLMs（如ChatGPT、DeepSeek、LLaMA等），评估它们在透明标准方面的遵守情况，并探讨部分开放所带来的影响。研究从开源模型与开放权重模型两个角度出发，揭示了许多被标为开源的模型实际上并未完全公开其训练数据、代码及关键指标。论文呼吁采取负责任且可持续的AI实践，以确保更高的透明度、问责制和伦理部署。

链接: https://arxiv.org/abs/2502.18505
作者: Ranjan Sapkota,Shaina Raza,Manoj Karkee
机构: Cornell University (康奈尔大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software. This definition, when combined with standard dictionary definitions and the sparse published literature, provide an initial framework to support broader accessibility to AI models such as LLMs, but more work is essential to capture the unique dynamics of openness in AI. In addition, concerns about open-washing, where models claim openness but lack full transparency, has been raised, which limits the reproducibility, bias mitigation, and domain adaptation of these models. In this context, our study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Specifically, we examine transparency and accessibility from two perspectives: open-source vs. open-weight models. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced. Even in the best cases, open-source models often do not report model training data, and code as well as key metrics, such as weight accessibility, and carbon emissions. To the best of our knowledge, this is the first study that systematically examines the transparency and accessibility of over 100 different SoTA LLMs through the dual lens of open-source and open-weight models. The findings open avenues for further research and call for responsible and sustainable AI practices to ensure greater transparency, accountability, and ethical deployment of these models.(DeepSeek transparency, ChatGPT accessibility, open source, DeepSeek open source)
zh

[NLP-105] urboFuzzLLM : Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice NAACL2025

【速读】：该论文旨在解决大型语言模型（Large-Language Models, LLMs）在对抗性提示下的鲁棒性问题，特别是通过测试其对恶意指令的抵御能力。论文的关键在于提出了一种名为TurboFuzzLLM的突变型模糊测试技术，该技术能够高效地发现一组有效的越狱模板。这些模板与有害问题结合后，可以通过用户提示以黑盒方式诱导目标LLM产生有害响应。TurboFuzzLLM的关键创新在于对现有基于模板攻击技术的局限性的改进，并引入了功能性和效率上的升级，从而实现了自动化生成有效的越狱模板。这种方法在公共数据集上对领先的LLMs（包括GPT-4o / GPT-4 Turbo）达到了≥95%的攻击成功率（Attack Success Rate, ASR），并且展示了对未见过的有害问题的强大泛化能力，有助于改进模型对提示攻击的防御。

链接: https://arxiv.org/abs/2502.18504
作者: Aman Goel,Xian Carrie Wu,Zhe Wang,Dmitriy Bespalov,Yanjun Qi
机构: Amazon Web Services (亚马逊网络服务)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at NAACL 2025 industry track, 12 pages, 5 figures

点击查看摘要

Abstract:Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves \geq 95% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \ GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks.
zh

[NLP-106] Mechanistic Understanding of Language Models in Syntactic Code Completion AAAI2025

【速读】：该论文旨在探究代码语言模型（Code LMs）在语法补全任务中的内部决策过程，特别是关注模型如何利用其句法或语义知识。研究以CodeLlama-7b模型为例，通过分析模型各层及多头注意力机制（MHA）与前馈子层（FF）的作用，揭示了模型在完成闭合括号任务时的关键行为。研究发现，模型需要到中间偏后的层才能自信地预测正确的闭合括号标签，并且尽管MHA和FF子层均发挥重要作用，但MHA尤为关键。此外，研究还发现一些注意头能够精确追踪已闭合括号的数量，但未必能促进正确数量的缺失闭合括号，从而对模型性能产生正向或负向影响。简言之，该论文的关键在于解析模型在执行特定代码补全任务时，不同组件的作用及其对整体性能的影响。

链接: https://arxiv.org/abs/2502.18499
作者: Samuel Miller,Daking Rai,Ziyu Yao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, accepted to the AAAI 2025 Workshop on Towards Knowledgeable Foundation Models

点击查看摘要

Abstract:Recently, language models (LMs) have shown impressive proficiency in code generation tasks, especially when fine-tuned on code-specific datasets, commonly known as Code LMs. However, our understanding of the internal decision-making processes of Code LMs, such as how they use their (syntactic or semantic) knowledge, remains limited, which could lead to unintended harm as they are increasingly used in real life. This motivates us to conduct one of the first Mechanistic Interpretability works to understand how Code LMs perform a syntactic completion task, specifically the closing parenthesis task, on the CodeLlama-7b model (Roziere et al. 2023). Our findings reveal that the model requires middle-later layers until it can confidently predict the correct label for the closing parenthesis task. Additionally, we identify that while both multi-head attention (MHA) and feed-forward (FF) sub-layers play essential roles, MHA is particularly crucial. Furthermore, we also discover attention heads that keep track of the number of already closed parentheses precisely but may or may not promote a correct number of closing parentheses that are still missing, leading to a positive or negative impact on the model’s performance.
zh

[NLP-107] AuPair: Golden Example Pairs for Code Repair

【速读】：该论文旨在解决在推理时间通过扩展计算资源来提升大语言模型（Large Language Models, LLMs）性能的问题。论文的关键在于提出了一种方法，通过合成和选择一组有序的黄金示例对（AuPairs），即初始猜测和相应修复后的正确答案对，利用LLMs的上下文学习能力进行自我修复。对于每个问题，在推理过程中使用N个AuPairs生成N个修复后的解决方案，并从中选出得分最高的作为最终答案。这种方法通过提供多样化的修复示例，使模型能够生成更优的修复结果，从而显著提升了性能，并展现出强大的泛化能力和与推理时间计算预算的强关联性。

链接: https://arxiv.org/abs/2502.18487
作者: Aditi Mavalankar,Hassan Mansoor,Zita Marinho,Masha Samsikova,Tom Schaul
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response, or guess, the LLM corrects its own mistake and produces an improved response, or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of N LLM calls per problem, N AuPairs are used to generate N repaired solutions, out of which the highest-scoring solution is selected as the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of- N and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows significantly stronger scaling with inference-time compute budget compared to baselines.
zh

[NLP-108] MixLLM : Dynamic Routing in Mixed Large Language Models NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在实际应用中的高成本和响应延迟问题。通过引入混合模型路由策略，论文提出了一种动态上下文多臂老虎机（contextual-bandit-based）系统MixLLM，以实现查询与最佳模型的高效匹配。解决方案的关键在于利用查询标签增强查询嵌入，设计轻量级预测模型评估不同LLMs上的响应质量和成本，并通过元决策机制优化质量、成本和延迟之间的权衡。此外，系统通过持续训练适应不断变化的查询需求和用户反馈。

链接: https://arxiv.org/abs/2502.18482
作者: Xinyuan Wang,Yanchi Liu,Wei Cheng,Xujiang Zhao,Zhengzhang Chen,Wenchao Yu,Yanjie Fu,Haifeng Chen
机构: Arizona State University; NEC Labs America
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 11 pages, 7 figures, accepted by NAACL 2025 main conference

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4’s quality at 24.18% of the cost under the time constraint).
zh

[NLP-109] QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration

【速读】：该论文旨在解决信息检索中有效查询自动生成的挑战，特别是在探索有毒内容时，此类内容可能被伪装。解决方案的关键在于提出了一种基于大语言模型（Large Language Model, LLM）的查询提取方法QExplorer，该方法通过指令监督微调（Supervised Fine-Tuning, SFT）和偏好对齐使用直接偏好优化（Direct Preference Optimization, DPO）的两阶段训练过程，并结合搜索系统的反馈构建数据集。

链接: https://arxiv.org/abs/2502.18480
作者: Shaola Ren,Li Ke,Longtao Huang,Dehong Gao,Hui Xue
机构: Alibaba Group(阿里集团); Northwestern Polytechnical University(西北工业大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatically extracting effective queries is challenging in information retrieval, especially in toxic content exploration, as such content is likely to be disguised. With the recent achievements in generative Large Language Model (LLM), we are able to leverage the capabilities of LLMs to extract effective queries for similar content exploration directly. This study proposes QExplorer, an approach of large language model based Query Extraction for toxic content Exploration. The QExplorer approach involves a 2-stage training process: instruction Supervised FineTuning (SFT) and preference alignment using Direct Preference Optimization (DPO), as well as the datasets construction with feedback of search system. To verify the effectiveness of QExplorer, a series of offline and online experiments are conducted on our real-world system. The offline empirical results demonstrate that the performance of our automatic query extraction outperforms that of several LLMs and humans. The online deployment shows a significant increase in the detection of toxic items.
zh

[NLP-110] FinBloom: Knowledge Grounding Large Language Model with Real-time Financial Data

【速读】：该论文旨在解决大型语言模型（LLMs）在处理需要实时信息访问的交互任务时表现不佳的问题，特别是在金融领域，模型需利用最新的新闻或价格变动等信息来支持决策。解决方案的关键在于提出了一种名为Financial Agent的知识增强方法，通过使用实时文本和表格数据来处理金融查询。具体实现包括开发一个包含超过50,000个金融查询及其所需上下文的Financial Context数据集，并训练了一个名为FinBloom 7B的大规模语言模型。最终，通过使用Financial Context数据集对该模型进行微调，使其能够作为Financial Agent生成相关的金融上下文，从而实现高效实时数据检索以回答用户查询。这种方法显著提高了LLMs处理动态金融任务的能力。

链接: https://arxiv.org/abs/2502.18471
作者: Ankur Sinha,Chaitanya Agarwal,Pekka Malo
机构: Indian Institute of Management Ahmedabad(印度管理学院艾哈迈达巴德分校); Aalto University School of Business(奥卢大学商学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注: 27 pages, 9 tables

点击查看摘要

Abstract:Large language models (LLMs) excel at generating human-like responses but often struggle with interactive tasks that require access to real-time information. This limitation poses challenges in finance, where models must access up-to-date information, such as recent news or price movements, to support decision-making. To address this, we introduce Financial Agent, a knowledge-grounding approach for LLMs to handle financial queries using real-time text and tabular data. Our contributions are threefold: First, we develop a Financial Context Dataset of over 50,000 financial queries paired with the required context. Second, we train FinBloom 7B, a custom 7 billion parameter LLM, on 14 million financial news articles from Reuters and Deutsche Presse-Agentur, alongside 12 million Securities and Exchange Commission (SEC) filings. Third, we fine-tune FinBloom 7B using the Financial Context Dataset to serve as a Financial Agent. This agent generates relevant financial context, enabling efficient real-time data retrieval to answer user queries. By reducing latency and eliminating the need for users to manually provide accurate data, our approach significantly enhances the capability of LLMs to handle dynamic financial tasks. Our proposed approach makes real-time financial decisions, algorithmic trading and other related tasks streamlined, and is valuable in contexts with high-velocity data flows.
zh

计算机视觉

[CV-0] Consistent Amortized Clustering via Generative Flow Networks AISTATS2025

【速读】：该论文旨在解决神经模型在聚类过程中对数据顺序敏感以及无法提供分配概率的问题。关键在于引入了GFNCP（生成流网络聚类过程），它通过共享基于能量的策略和奖励参数化方法，确保聚类后验的一致性，并满足流匹配条件，从而实现顺序不变性。这一方法不仅提升了合成数据和真实数据上的聚类性能，还避免了现有方法的局限性。

链接: https://arxiv.org/abs/2502.19337
作者: Irit Chelly,Roy Uziel,Oren Freifeld,Ari Pakman
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AISTATS 2025 on January 21, 2025

点击查看摘要

Abstract:Neural models for amortized probabilistic clustering yield samples of cluster labels given a set-structured input, while avoiding lengthy Markov chain runs and the need for explicit data likelihoods. Existing methods which label each data point sequentially, like the Neural Clustering Process, often lead to cluster assignments highly dependent on the data order. Alternatively, methods that sequentially create full clusters, do not provide assignment probabilities. In this paper, we introduce GFNCP, a novel framework for amortized clustering. GFNCP is formulated as a Generative Flow Network with a shared energy-based parametrization of policy and reward. We show that the flow matching conditions are equivalent to consistency of the clustering posterior under marginalization, which in turn implies order invariance. GFNCP also outperforms existing methods in clustering performance on both synthetic and real-world data.
zh

[CV-1] Does 3D Gaussian Splatting Need Accurate Volumetric Rendering?

【速读】：该论文旨在探究在3D Gaussian Splatting (3DGS)中使用更精确的体渲染方法是否可以提升其质量。关键在于分析并验证尽管3DGS采用了多种近似方法，但其高效的优化能力和大量的高斯分布使其即使在面对更准确的体渲染时仍能保持优越性能。

链接: https://arxiv.org/abs/2502.19318
作者: Adam Celarek,George Kopanas,George Drettakis,Michael Wimmer,Bernhard Kerbl
机构: TU Wien(维也纳技术大学); Google(谷歌); Inria(法国国家信息与自动化研究所); UniversitГё CГёte dвЂљAzur(蔚蓝海岸大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in Eurogrpahics 2025, code: this https URL

点击查看摘要

Abstract:Since its introduction, 3D Gaussian Splatting (3DGS) has become an important reference method for learning 3D representations of a captured scene, allowing real-time novel-view synthesis with high visual quality and fast training times. Neural Radiance Fields (NeRFs), which preceded 3DGS, are based on a principled ray-marching approach for volumetric rendering. In contrast, while sharing a similar image formation model with NeRF, 3DGS uses a hybrid rendering solution that builds on the strengths of volume rendering and primitive rasterization. A crucial benefit of 3DGS is its performance, achieved through a set of approximations, in many cases with respect to volumetric rendering theory. A naturally arising question is whether replacing these approximations with more principled volumetric rendering solutions can improve the quality of 3DGS. In this paper, we present an in-depth analysis of the various approximations and assumptions used by the original 3DGS solution. We demonstrate that, while more accurate volumetric rendering can help for low numbers of primitives, the power of efficient optimization and the large number of Gaussians allows 3DGS to outperform volumetric rendering despite its approximations.
zh

[CV-2] Model Adaptation: Unsupervised Domain Adaptation without Source Data CVPR2020

【速读】：该论文旨在解决无监督领域适应（Unsupervised Domain Adaptation, UDA）场景下的模型适应问题，特别是在没有标签的目标数据可用的情况下如何改进现有源预测模型在目标域上的性能。为了解决这一问题，论文提出了一种新的框架，称为协作类条件生成对抗网络（Collaborative Class Conditional Generative Adversarial Network），以摆脱对源数据的依赖。关键在于通过生成目标样式的数据来改进预测模型，同时引入权重约束和基于聚类的正则化，以在缺乏源数据监督的情况下鼓励生成的数据与源模型相似，并产生更具辨别力的特征。

链接: https://arxiv.org/abs/2502.19316
作者: Rui Li,Qianfen Jiao,Wenming Cao,Hau-San Wong,Si Wu
机构: Department of Computer Science, City University of Hong Kong(计算机科学系，香港城市大学); School of Computer Science and Engineering, South China University of Technology(计算机科学与工程学院，华南理工大学); Department of Statistics and Actuarial Science, The University of Hong Kong(统计与精算学系，香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR2020

点击查看摘要

Abstract:In this paper, we investigate a challenging unsupervised domain adaptation setting – unsupervised model adaptation. We aim to explore how to rely only on unlabeled target data to improve performance of an existing source prediction model on the target domain, since labeled source data may not be available in some real-world scenarios due to data privacy issues. For this purpose, we propose a new framework, which is referred to as collaborative class conditional generative adversarial net to bypass the dependence on the source data. Specifically, the prediction model is to be improved through generated target-style data, which provides more accurate guidance for the generator. As a result, the generator and the prediction model can collaborate with each other without source data. Furthermore, due to the lack of supervision from source data, we propose a weight constraint that encourages similarity to the source model. A clustering-based regularization is also introduced to produce more discriminative features in the target domain. Compared to conventional domain adaptation methods, our model achieves superior performance on multiple adaptation tasks with only unlabeled target data, which verifies its effectiveness in this challenging setting.
zh

[CV-3] CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query ICRA2025

【速读】：该论文旨在解决在增强自动驾驶车辆（Autonomous Vehicles, AVs）感知能力的同时，平衡感知性能与传输成本的问题。论文的关键解决方案在于提出了一种名为CoopDETR的新框架，该框架通过对象查询引入了对象级特征协作。CoopDETR包含两个核心模块：单代理查询生成，用于高效编码原始传感器数据为对象查询，从而在减少传输成本的同时保留检测所需的关键信息；以及跨代理查询融合，包括空间查询匹配（Spatial Query Matching, SQM）和对象查询聚合（Object Query Aggregation, OQA），以实现查询之间的有效交互。实验结果表明，CoopDETR实现了最先进的性能，并将传输成本降低到先前方法的1/782。

链接: https://arxiv.org/abs/2502.19313
作者: Zhe Wang,Shaocong Xu,Xucai Zhuang,Tongda Xu,Yan Wang,Jingjing Liu,Yilun Chen,Ya-Qin Zhang
机构: Institute for AI Industry Research (AIR)(人工智能产业研究院), Tsinghua University (清华大学), Beijing, China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, ICRA 2025

点击查看摘要

Abstract:Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit region-level features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces object-level feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.
zh

[CV-4] Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions

【速读】：该论文旨在解决病理学家在诊断黑素细胞皮肤病变时，快速诊断普通黑素细胞痣不成问题，但撰写相应的病理报告耗时较多的问题。为缓解病理学家日益增加的工作负担，论文提出了一种专门针对皮肤黑素细胞病变病理领域的视觉-语言模型。该模型基于对比描述框架（Contrastive Captioner），使用包含42,512张HE染色全切片图像和19,645份相应病理报告的数据集进行训练和评估。研究结果表明，对于普通黑素细胞痣，模型生成的报告质量评分与病理学家撰写的报告相当，并且跨模态检索性能对于罕见亚型也有显著提升。关键在于开发了这一能够自动生成病理报告的视觉-语言模型。

链接: https://arxiv.org/abs/2502.19293
作者: Ruben T. Lucassen,Sander P.J. Moonemans,Tijn van de Luijtgaarden,Gerben E. Breimer,Willeke A.M. Blokx,Mitko Veta
机构: Dept. of Pathology, University Medical Center Utrecht, the Netherlands; Dept. of Biomedical Engineering, Eindhoven University of Technology, the Netherlands; Dept. of Mathematics and Computer Science, Eindhoven University of Technology, the Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Millions of melanocytic skin lesions are examined by pathologists each year, the majority of which concern common nevi (i.e., ordinary moles). While most of these lesions can be diagnosed in seconds, writing the corresponding pathology report is much more time-consuming. Automating part of the report writing could, therefore, alleviate the increasing workload of pathologists. In this work, we develop a vision-language model specifically for the pathology domain of cutaneous melanocytic lesions. The model follows the Contrastive Captioner framework and was trained and evaluated using a melanocytic lesion dataset of 42,512 HE-stained whole slide images and 19,645 corresponding pathology reports. Our results show that the quality scores of model-generated reports were on par with pathologist-written reports for common nevi, assessed by an expert pathologist in a reader study. While report generation revealed to be more difficult for rare melanocytic lesion subtypes, the cross-modal retrieval performance for these cases was considerably better.
zh

[CV-5] On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation

【速读】：该论文旨在解决病理学领域中视觉-语言模型在生成报告时因包含无法从病理切片图像推断的信息而导致的幻觉句子问题。关键解决方案在于对病理报告中的文本进行预处理，仅保留描述HE染色切片中细胞和组织外观的句子，从而避免生成报告时的幻觉现象。实验结果显示，尽管这种预处理方法提高了生成报告的质量，但使用完整报告进行训练的视觉-语言模型在跨模态检索任务上的表现更佳。

链接: https://arxiv.org/abs/2502.19285
作者: Ruben T. Lucassen,Tijn van de Luijtgaarden,Sander P.J. Moonemans,Gerben E. Breimer,Willeke A.M. Blokx,Mitko Veta
机构: University Medical Center Utrecht (乌得勒支大学医疗中心); Eindhoven University of Technology (埃因霍温科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the HE-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 HE-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.
zh

[CV-6] Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models

【速读】：该论文旨在解决预训练视觉-语言模型（Vision-Language Models, VLMs）如CLIP在面对后门攻击时表现出脆弱性的问题。现有防御策略主要集中在微调整个可疑模型，但仅提供有限的抵抗能力，并且通常会导致清洁准确度下降，特别是在数据受限的情况下。这些方法的失败可能归因于微调数据不足与VLMs中大量参数之间的不匹配。为应对这一挑战，论文提出了一种名为类别级后门提示微调（Class-wise Backdoor Prompt Tuning, CBPT）的高效有效防御方法，该方法通过对文本提示进行操作来间接净化中毒的VLMs。关键在于利用先进的对比学习通过精心设计的正负样本有效地反转潜在的后门触发器，并使用高效的提示微调技术优化类别级文本提示以修改模型的决策边界，从而重新分类后门触发器的特征区域。实验结果表明，CBPT显著减轻了后门威胁，同时保持了模型效用。

链接: https://arxiv.org/abs/2502.19269
作者: Jiawei Kong,Hao Fang,Sihang Guo,Chenxi Qing,Bin Chen,Bin Wang,Shu-Tao Xia
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学，深圳); Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit excellent representational capabilities for multimodal data, recent studies have shown that they are vulnerable to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model, yet offer only marginal resistance to state-of-the-art attacks and often result in a decrease in clean accuracy, particularly in data-limited scenarios. Their failure may be attributed to the mismatch between insufficient fine-tuning data and massive parameters in VLMs. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT) defense, an efficient and effective method that operates on the text prompts to indirectly purify the poisoned VLMs. Specifically, we first employ the advanced contrastive learning via our carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we utilize the efficient prompt tuning technique to optimize these class-wise text prompts for modifying the model’s decision boundary to further reclassify the feature regions of backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen model robustness against backdoor attacks.
zh

[CV-7] EMT: A Visual Multi-Task Benchmark Dataset for Autonomous Driving in the Arab Gulf Region

【速读】：该论文旨在解决在阿拉伯海湾地区自动驾驶领域缺乏公开可用数据集的问题。解决方案的关键在于创建了Emirates Multi-Task (EMT) 数据集，该数据集包含了超过30,000帧的行车记录仪视角图像，以及大约150公里驾驶路线上的570,000个标注边界框。EMT数据集支持多目标跟踪、轨迹预测和意图预测三项主要任务，并提供了相应的评估基准，包括多智能体跟踪实验、基于深度序列和交互感知模型的轨迹预测评估，以及从观测轨迹预测智能体意图的基准实验。数据集及相关预处理脚本和评估模型可在指定网址获取。

链接: https://arxiv.org/abs/2502.19260
作者: Nadya Abdel Madjid,Murad Mebrahtu,Abdelmoamen Nasser,Bilal Hassan,Naoufel Werghi,Jorge Dias,Majid Khonji
机构: Khalifa University of Science and Technology(哈利法大学); Computer Science, Khalifa University(计算机科学，哈利法大学); New York University(纽约大学); Computer and Information Engineering, Khalifa University(计算机与信息工程，哈利法大学); KUCARS-KU Center for Autonomous Robotic Systems, Khalifa University(哈利法大学自主机器人系统中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:This paper introduces the Emirates Multi-Task (EMT) dataset - the first publicly available dataset for autonomous driving collected in the Arab Gulf region. The EMT dataset captures the unique road topology, high traffic congestion, and distinctive characteristics of the Gulf region, including variations in pedestrian clothing and weather conditions. It contains over 30,000 frames from a dash-camera perspective, along with 570,000 annotated bounding boxes, covering approximately 150 kilometers of driving routes. The EMT dataset supports three primary tasks: tracking, trajectory forecasting and intention prediction. Each benchmark dataset is complemented with corresponding evaluations: (1) multi-agent tracking experiments, focusing on multi-class scenarios and occlusion handling; (2) trajectory forecasting evaluation using deep sequential and interaction-aware models; and (3) intention benchmark experiments conducted for predicting agents intentions from observed trajectories. The dataset is publicly available at this https URL, and pre-processing scripts along with evaluation models can be accessed at this https URL.
zh

[CV-8] ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration

【速读】：该论文旨在解决机器人在执行灵巧操作任务时，难以将已学习技能泛化到未见过的同类新物体上的问题。解决方案的关键在于提出了一种名为“ObjectVLA”的方法，通过利用视觉语言对数据，使机器人能够无需针对每个新目标物体进行显式的演示，即可实现技能泛化。这种方法通过建立物体与期望动作之间的隐含联系，提供了一种轻量且可扩展的方式来注入关于目标物体的知识。

链接: https://arxiv.org/abs/2502.19250
作者: Minjie Zhu,Yichen Zhu,Jinming Li,Zhongyi Zhou,Junjie Wen,Xiaoyu Liu,Chaomin Shen,Yaxin Peng,Feifei Feng
机构: Midea Group; East China Normal University; Shanghai University
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of human demonstration data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization, where a robot trained to perform a task with one object, such as “hand over the apple,” struggles to transfer its skills to a semantically similar but visually different object, such as “hand over the peach.” This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as \textbfObjectVLA. Our model enables robots to generalize learned skills to novel objects without requiring explicit human demonstrations for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate in selecting objects not seen during training. Furthermore, we propose a more accessible method for enhancing object generalization in VLA models, using a smartphone to capture a few images and fine-tune the pre-trained model. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.
zh

[CV-9] ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

【速读】：该论文旨在解决基于语言指令在实时3D环境中交互时，从RGB-D图像渲染出的点云包含大量冗余背景数据和固有噪声的问题，这些问题干扰了目标区域的流形结构。现有方法通常需要繁琐的过程来改善流形结构，不适用于实时任务。论文的关键解决方案是提出了适用于多模态任务的代理变换（Proxy Transformation），通过可变形点聚类（Deformable Point Clustering）识别目标区域内的点云子流形，并利用代理注意力模块（Proxy Attention）引导点云变换。在此基础上，设计了一个子流形变换生成模块，其中文本信息全局指导不同子流形的平移向量，优化目标区域的相对空间关系；同时，图像信息指导每个子流形内的线性变换，从而精炼目标区域的局部点云流形。实验结果表明，该方法显著优于现有方法，在简单目标上提升了7.49%，在复杂目标上提升了4.60%，并且减少了注意力块的计算开销40.6%。

链接: https://arxiv.org/abs/2502.19247
作者: Qihang Peng,Henry Zheng,Gao Huang
机构: Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing the computational overhead of attention blocks by 40.6%. These results establish a new SOTA in ego-centric 3D visual grounding, showcasing the effectiveness and robustness of our approach.
zh

[CV-10] Arbitrary Volumetric Refocusing of Dense and Sparse Light Fields

【速读】：该论文旨在解决多区域同时对焦的问题，特别是针对稀疏光场（Sparse Light Field, SLF）在对焦过程中产生的鬼影效应。关键在于提出了一种端到端的流程，利用像素相关的位移与典型的移位和求和方法（Shift-and-Sum）独立地对光场中的每个像素进行对焦。此外，通过采用基于U-Net架构的深度学习模型，几乎完全消除了稀疏光场对焦过程中出现的鬼影伪影，从而显著提升了结构相似性指数（Structural Similarity Index, SSIM），即使在数据量仅为密集光场（Dense Light Field, DLF）20%的情况下也能达到SSIM高于0.9的效果。

链接: https://arxiv.org/abs/2502.19238
作者: Tharindu Samarakoon,Kalana Abeywardena,Chamira U. S. Edussooriya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 7 figures, 3 tables

点击查看摘要

Abstract:A four-dimensional light field (LF) captures both textural and geometrical information of a scene in contrast to a two-dimensional image that captures only the textural information of a scene. Post-capture refocusing is an exciting application of LFs enabled by the geometric information captured. Previously proposed LF refocusing methods are mostly limited to the refocusing of single planar or volumetric region of a scene corresponding to a depth range and cannot simultaneously generate in-focus and out-of-focus regions having the same depth range. In this paper, we propose an end-to-end pipeline to simultaneously refocus multiple arbitrary planar or volumetric regions of a dense or a sparse LF. We employ pixel-dependent shifts with the typical shift-and-sum method to refocus an LF. The pixel-dependent shifts enables to refocus each pixel of an LF independently. For sparse LFs, the shift-and-sum method introduces ghosting artifacts due to the spatial undersampling. We employ a deep learning model based on U-Net architecture to almost completely eliminate the ghosting artifacts. The experimental results obtained with several LF datasets confirm the effectiveness of the proposed method. In particular, sparse LFs refocused with the proposed method archive structural similarity index higher than 0.9 despite having only 20% of data compared to dense LFs.
zh

[CV-11] A Lightweight and Extensible Cell Segmentation and Classification Model for Whole Slide Images

【速读】：该论文旨在解决数字病理学中细胞级分析工具开发面临的挑战，包括数据集粒度限制、标注不一致、高计算需求以及新技术整合难题。解决方案的关键在于构建一个轻量且可扩展的细胞分割与分类模型，通过更新数据标签、利用H-Optimus基础模型改进特征表示、知识蒸馏以减小模型大小和复杂性，并将其集成到QuPath平台中，从而提升模型性能和实用性。

链接: https://arxiv.org/abs/2502.19217
作者: Nikita Shvetsov,Thomas K. Kilvaer,Masoud Tafavvoghi,Anders Sildnes,Kajsa Møllersen,Lill-Tove Rasmussen Busund,Lars Ailo Bongo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 11 figures

点击查看摘要

Abstract:Developing clinically useful cell-level analysis tools in digital pathology remains challenging due to limitations in dataset granularity, inconsistent annotations, high computational demands, and difficulties integrating new technologies into workflows. To address these issues, we propose a solution that enhances data quality, model performance, and usability by creating a lightweight, extensible cell segmentation and classification model. First, we update data labels through cross-relabeling to refine annotations of PanNuke and MoNuSAC, producing a unified dataset with seven distinct cell types. Second, we leverage the H-Optimus foundation model as a fixed encoder to improve feature representation for simultaneous segmentation and classification tasks. Third, to address foundation models’ computational demands, we distill knowledge to reduce model size and complexity while maintaining comparable performance. Finally, we integrate the distilled model into QuPath, a widely used open-source digital pathology platform. Results demonstrate improved segmentation and classification performance using the H-Optimus-based model compared to a CNN-based model. Specifically, average R^2 improved from 0.575 to 0.871, and average PQ score improved from 0.450 to 0.492, indicating better alignment with actual cell counts and enhanced segmentation quality. The distilled model maintains comparable performance while reducing parameter count by a factor of 48. By reducing computational complexity and integrating into workflows, this approach may significantly impact diagnostics, reduce pathologist workload, and improve outcomes. Although the method shows promise, extensive validation is necessary prior to clinical deployment.
zh

[CV-12] Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

【速读】：该论文旨在解决单目深度估计（Monocular Depth Estimation, MDE）中的伪标签蒸馏效果不佳的问题。当前方法依赖全局归一化处理深度信息，这在处理噪声伪标签时效果有限。论文的关键解决方案是提出了跨上下文蒸馏（Cross-Context Distillation），该方法结合全局和局部深度线索以提升伪标签质量，并引入多教师蒸馏框架利用不同深度估计模型的互补优势，从而实现更鲁棒和精确的深度预测。

链接: https://arxiv.org/abs/2502.19204
作者: Xiankang He,Dongyan Guo,Hongji Li,Ruibo Li,Ying Cui,Chi Zhang
机构: Zhejiang University of Technology (浙江工业大学); AGI Lab, Westlake University (西湖大学AGI实验室); Lanzhou University (兰州大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Monocular depth estimation (MDE) aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding. Recent advances in zero-shot MDE leverage normalized depth representations and distillation-based learning to improve generalization across diverse scenes. However, current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies on pseudo-label distillation. Based on our findings, we propose Cross-Context Distillation, which integrates global and local depth cues to enhance pseudo-label quality. Additionally, we introduce a multi-teacher distillation framework that leverages complementary strengths of different depth estimation models, leading to more robust and accurate depth predictions. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.
zh

[CV-13] HDM: Hybrid Diffusion Model for Unified Image Anomaly Detection

【速读】：该论文旨在解决图像异常检测中复杂且多样的异常模式难以处理的问题，特别是现有方法在生成与判别任务分离的情况下，难以有效协调异常样本生成与异常区域检测。论文的关键解决方案在于提出了一种新颖的混合扩散模型（Hybrid Diffusion Model, HDM），该模型将生成与判别任务整合到一个统一框架中。HDM 包含三个核心模块：扩散异常生成模块（Diffusion Anomaly Generation Module, DAGM）、扩散判别模块（Diffusion Discriminative Module, DDM）以及概率优化模块（Probability Optimization Module, POM）。DAGM 生成逼真且多样化的异常样本，DDM 利用逆向扩散过程捕捉生成样本与正常样本之间的差异，从而实现基于概率分布的精确异常区域检测与定位，而 POM 在生成和判别阶段优化概率分布，确保高质量样本用于训练。

链接: https://arxiv.org/abs/2502.19200
作者: Zekang Weng,Jinjin Shi,Jinwei Wang,Zeming Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image anomaly detection plays a vital role in applications such as industrial quality inspection and medical imaging, where it directly contributes to improving product quality and system reliability. However, existing methods often struggle with complex and diverse anomaly patterns. In particular, the separation between generation and discrimination tasks limits the effective coordination between anomaly sample generation and anomaly region detection. To address these challenges, we propose a novel hybrid diffusion model (HDM) that integrates generation and discrimination into a unified framework. The model consists of three key modules: the Diffusion Anomaly Generation Module (DAGM), the Diffusion Discriminative Module (DDM), and the Probability Optimization Module (POM). DAGM generates realistic and diverse anomaly samples, improving their representativeness. DDM then applies a reverse diffusion process to capture the differences between generated and normal samples, enabling precise anomaly region detection and localization based on probability distributions. POM refines the probability distributions during both the generation and discrimination phases, ensuring high-quality samples are used for training. Extensive experiments on multiple industrial image datasets demonstrate that our method outperforms state-of-the-art approaches, significantly improving both image-level and pixel-level anomaly detection performance, as measured by AUROC.
zh

[CV-14] EGR-Net: A Novel Embedding Gramian Representation CNN for Intelligent Fault Diagnosis

【速读】：该论文旨在解决旋转机械智能故障诊断中特征提取的问题。现有的一维信号到二维图像的表征方法存在计算复杂度高和可分离性差的问题，同时基于二维卷积神经网络（2D-CNN）的方法在转换过程中不可避免地导致信息丢失。为了解决这些问题，论文提出了两种关键方案：首先，提出了一种新的表征方法Embedding Gramian Representation (EGR)，它易于计算且具有良好的可分离性；其次，设计了一种双分支的基于EGR的卷积神经网络EGR-Net，通过桥接连接改进两分支之间的特征学习交互，以从原始信号特征图及其对应的EGR中共同学习故障特征。

链接: https://arxiv.org/abs/2502.19199
作者: Linshan Jia
机构: Department of Electrical Engineering, City University of Hong Kong (香港城市大学电气工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature extraction is crucial in intelligent fault diagnosis of rotating machinery. It is easier for convolutional neural networks(CNNs) to visually recognize and learn fault features by converting the complicated one-dimensional (1D) vibrational signals into two-dimensional (2D) images with simple textures. However, the existing representation methods for encoding 1D signals as images have two main problems, including complicated computation and low separability. Meanwhile, the existing 2D-CNN fault diagnosis methods taking 2D images as the only inputs still suffer from the inevitable information loss because of the conversion process. Considering the above issues, this paper proposes a new 1D-to-2D conversion method called Embedding Gramian Representation (EGR), which is easy to calculate and shows good separability. In EGR, 1D signals are projected in the embedding space and the intrinsic periodicity of vibrational signals is captured enabling the faulty characteristics contained in raw signals to be uncovered. Second, aiming at the information loss problem of existing CNN models with the single input of converted images, a double-branch EGR-based CNN, called EGR-Net, is proposed to learn faulty features from both raw signal feature maps and their corresponding EGRs. The bridge connection is designed to improve the feature learning interaction between the two branches. Widely used open domain gearbox dataset and bearing dataset are used to verify the effectiveness and efficiency of the proposed methods. EGR-Net is compared with traditional and state-of-the-art approaches, and the results show that the proposed method can deliver enhanced performance.
zh

[CV-15] Self-supervised conformal prediction for uncertainty quantification in Poisson imaging problems

【速读】：该论文旨在解决图像复原过程中不确定性量化不足的问题。对于许多不适定的图像恢复问题，传统的图像恢复方法难以准确评估重建图像的不确定性。论文提出的关键解决方案是一种自监督的符合预测方法，该方法利用Poisson无偏风险估计器消除了对真实标签数据的需求。这种方法适用于任何条件数不良的Poisson线性成像问题，并且在与现代直接基于测量数据训练的自监督图像恢复技术结合使用时尤为有效。

链接: https://arxiv.org/abs/2502.19194
作者: Bernardin Tamo Amougou,Marcelo Pereyra,Barbara Pascal
机构: Heriot-Watt University(赫瑞-瓦特大学), Edinburgh, UK; Université de Paris Cité(巴黎城市大学), Paris, France; Nantes Université(南特大学), École Centrale Nantes(中央南特大学), CNRS, LS2N, UMR 6004, F-44000 Nantes, France
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Image restoration problems are often ill-posed, leading to significant uncertainty in reconstructed images. Accurately quantifying this uncertainty is essential for the reliable interpretation of reconstructed images. However, image restoration methods often lack uncertainty quantification capabilities. Conformal prediction offers a rigorous framework to augment image restoration methods with accurate uncertainty quantification estimates, but it typically requires abundant ground truth data for calibration. This paper presents a self-supervised conformal prediction method for Poisson imaging problems which leverages Poisson Unbiased Risk Estimator to eliminate the need for ground truth data. The resulting self-calibrating conformal prediction approach is applicable to any Poisson linear imaging problem that is ill-conditioned, and is particularly effective when combined with modern self-supervised image restoration techniques trained directly on measurement data. The proposed method is demonstrated through numerical experiments on image denoising and deblurring; its performance are comparable to supervised conformal prediction methods relying on ground truth data.
zh

[CV-16] Knowledge Distillation for Semantic Segmentation: A Label Space Unification Approach

【速读】：该论文旨在解决由于不同数据集在分类法和/或标注策略上的不一致性，导致难以训练更大更优模型的问题。解决方案的关键在于提出了一种知识蒸馏方法，同时作为标签空间统一的方法应用于语义分割任务。具体而言，首先使用源数据集训练教师模型，然后利用该教师模型为具有相关标签空间的真实标签数据生成伪标签。通过将相关的分类法映射到源分类法，创建模型预测伪标签的约束条件。基于改进后的伪标签，训练的学生模型在城市和非公路驾驶这两个具有挑战性的领域中始终优于其教师模型。

链接: https://arxiv.org/abs/2502.19177
作者: Anton Backhaus,Thorsten Luettel,Mirko Maehlisch
机构: Chair of Machine Perception for Autonomous Driving, Department of Aerospace Engineering, University of the Bundeswehr Munich(德国联邦国防军大学航天工程学院自主驾驶机器感知系), Germany(德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:An increasing number of datasets sharing similar domains for semantic segmentation have been published over the past few years. But despite the growing amount of overall data, it is still difficult to train bigger and better models due to inconsistency in taxonomy and/or labeling policies of different datasets. To this end, we propose a knowledge distillation approach that also serves as a label space unification method for semantic segmentation. In short, a teacher model is trained on a source dataset with a given taxonomy, then used to pseudo-label additional data for which ground truth labels of a related label space exist. By mapping the related taxonomies to the source taxonomy, we create constraints within which the model can predict pseudo-labels. Using the improved pseudo-labels we train student models that consistently outperform their teachers in two challenging domains, namely urban and off-road driving. Our ground truth-corrected pseudo-labels span over 12 and 7 public datasets with 388.230 and 18.558 images for the urban and off-road domains, respectively, creating the largest compound datasets for autonomous driving to date.
zh

[CV-17] A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLM s

【速读】：该论文旨在解决在资源受限场景下加速推断的同时保持模型性能的问题。针对宽度剪枝（Width-wise pruning）的局限性，论文提出了一种基于深度剪枝（Depth-wise pruning）的方法。关键在于通过分析大型语言模型中不同层输出在再生核希尔伯特空间（Reproducing Kernel Hilbert Space, RKHS）中的相关性，揭示层与层之间的“补丁状”特征关系。在此基础上，论文提出了滑动层合并方法，该方法根据预定义的相似度阈值动态选择并融合连续的层，从而简化模型结构同时保持其性能。实验结果表明，所提方法在零样本推理性能和剪枝后的再训练恢复质量方面均优于现有剪枝技术。

链接: https://arxiv.org/abs/2502.19159
作者: Xuan Ding,Yao Zhu,Yunjian Zhang,Chuanlong Xie
机构: Beijing Normal University(北京师范大学); Zhejiang University(浙江大学); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. Howerver, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the “Patch-like” feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we proposes a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at this https URL.
zh

[CV-18] SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation ICRA2025

【速读】：该论文旨在解决跨模态3D检索任务中由于3D数据稀缺和昂贵导致现有方法性能受限的问题。关键在于引入SCA3D，这是一种新颖的3D形状和标题在线数据增强方法。SCA3D利用LLaVA模型创建组件库，并为数据集中每个3D形状的每一部分生成描述，从而生成包含新语义特征的大量新的3D-文本对。此外，通过使用单模态编码器提取基于丰富数据集的3D形状和文本嵌入，并采用地球移动者距离（EMD）计算细粒度跨模态相似性，结合对比学习增强跨模态匹配，实现了文本和3D形状之间的双向检索。

链接: https://arxiv.org/abs/2502.19128
作者: Junlong Ren,Hao Wu,Hui Xiong,Hao Wang
机构: AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover’s Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in this https URL.
zh

[CV-19] he NeRF Signature: Codebook-Aided Watermarking for Neural Radiance Fields

【速读】：该论文旨在解决NeRF (Neural Radiance Fields) 模型的版权保护问题。现有的嵌入数字水印方法通常忽视了模型层面的关键考虑因素，并且存在显著的时间开销，导致不可感知性和鲁棒性降低，以及用户不便。论文的关键解决方案是提出了一种名为NeRF签名（NeRF Signature）的新颖水印方法。该方法采用基于码本的签名嵌入（CSE），不改变模型结构，从而保持了不可感知性并增强了模型层面的鲁棒性。此外，通过优化后可以嵌入任何所需的签名，且在NeRF所有者使用新的二进制签名时无需微调。同时，引入了联合姿态-补丁加密水印策略以提高鲁棒性，并探索了复杂度感知密钥选择（CAKS）方案以增强不可感知性。实验结果表明，所提方法在不可感知性和鲁棒性方面优于其他基线方法。

链接: https://arxiv.org/abs/2502.19125
作者: Ziyuan Luo,Anderson Rocha,Boxin Shi,Qing Guo,Haoliang Li,Renjie Wan
机构: Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China (香港浸会大学计算机科学系); Institute of Computing, University of Campinas, Brazil (巴西坎皮纳斯大学计算研究所); State Key Laboratory of Multimedia Information Processing and National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, 100871, China (北京大学多媒体信息处理国家重点实验室及视觉技术国家工程研究中心，计算机科学学院); A*STAR, Singapore (新加坡科技研究局); Department of Electrical Engineering, City University of Hong Kong, Hong Kong (香港城市大学电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, accepted by TPAMI

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have been gaining attention as a significant form of 3D content representation. With the proliferation of NeRF-based creations, the need for copyright protection has emerged as a critical issue. Although some approaches have been proposed to embed digital watermarks into NeRF, they often neglect essential model-level considerations and incur substantial time overheads, resulting in reduced imperceptibility and robustness, along with user inconvenience. In this paper, we extend the previous criteria for image watermarking to the model level and propose NeRF Signature, a novel watermarking method for NeRF. We employ a Codebook-aided Signature Embedding (CSE) that does not alter the model structure, thereby maintaining imperceptibility and enhancing robustness at the model level. Furthermore, after optimization, any desired signatures can be embedded through the CSE, and no fine-tuning is required when NeRF owners want to use new binary signatures. Then, we introduce a joint pose-patch encryption watermarking strategy to hide signatures into patches rendered from a specific viewpoint for higher robustness. In addition, we explore a Complexity-Aware Key Selection (CAKS) scheme to embed signatures in high visual complexity patches to enhance imperceptibility. The experimental results demonstrate that our method outperforms other baseline methods in terms of imperceptibility and robustness. The source code is available at: this https URL.
zh

[CV-20] A Survey on Foundation-Model-Based Industrial Defect Detection

【速读】：该论文旨在系统性地调研与对比基于基础模型（Foundation Model, FM）的方法，并简要回顾非基础模型（Non-Foundation Model, NFM）方法。论文特别关注FM方法在少样本学习（few-shot learning）和零样本学习（zero-shot learning）中的优势，这些场景更符合实际工业应用的需求。论文的关键在于探讨FM方法在训练目标、模型结构与规模、性能表现等方面的差异，并讨论未来研究的方向。通过比较，论文发现FM方法更适合于少样本和零样本学习，这使其更适用于实际工业应用场景，值得深入研究。

链接: https://arxiv.org/abs/2502.19106
作者: Tianle Yang,Luyao Chang,Jiadong Yan,Juntao Li,Zhi Wang,Ke Zhang
机构: Soochow University (苏州大学), Suzhou, China; Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院), Beijing, China; Wuhan University of Science and Technology (武汉科技大学), Wuhan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:As industrial products become abundant and sophisticated, visual industrial defect detection receives much attention, including two-dimensional and three-dimensional visual feature modeling. Traditional methods use statistical analysis, abnormal data synthesis modeling, and generation-based models to separate product defect features and complete defect detection. Recently, the emergence of foundation models has brought visual and textual semantic prior knowledge. Many methods are based on foundation models (FM) to improve the accuracy of detection, but at the same time, increase model complexity and slow down inference speed. Some FM-based methods have begun to explore lightweight modeling ways, which have gradually attracted attention and deserve to be systematically analyzed. In this paper, we conduct a systematic survey with comparisons and discussions of foundation model methods from different aspects and briefly review non-foundation model (NFM) methods recently published. Furthermore, we discuss the differences between FM and NFM methods from training objectives, model structure and scale, model performance, and potential directions for future exploration. Through comparison, we find FM methods are more suitable for few-shot and zero-shot learning, which are more in line with actual industrial application scenarios and worthy of in-depth research.
zh

[CV-21] An anatomically-informed correspondence initialisation method to improve learning-based registration for radiotherapy

【速读】：该论文旨在解决跨患者CT图像非刚性配准（NRR）中的初始定位问题。论文的关键解决方案在于提出了一种基于学习模型的解剖学信息初始化方法，通过预测器官结构之间的对应关系来设置薄板样条（TPS）变形，从而改进非刚性配准的初始定位。这种方法在保持显著速度优势的同时，提升了基于学习的方法的配准性能，使其更接近传统的迭代算法。

链接: https://arxiv.org/abs/2502.19101
作者: Edward G. A. Henderson,Marcel van Herk,Andrew F. Green,Eliana M. Vasquez Osorio
机构: The University of Manchester(曼彻斯特大学); European Bioinformatics Institute, EMBL-EBI(欧洲生物信息学研究所, 欧洲分子生物学实验室-欧洲生物信息学研究所), Cambridge(剑桥)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the XXth International Conference on the use of Computers in Radiation therapy. Pages 99-102 in XXth ICCR Proceedings, found here this https URL

点击查看摘要

Abstract:We propose an anatomically-informed initialisation method for interpatient CT non-rigid registration (NRR), using a learning-based model to estimate correspondences between organ structures. A thin plate spline (TPS) deformation, set up using the correspondence predictions, is used to initialise the scans before a second NRR step. We compare two established NRR methods for the second step: a B-spline iterative optimisation-based algorithm and a deep learning-based approach. Registration performance is evaluated with and without the initialisation by assessing the similarity of propagated structures. Our proposed initialisation improved the registration performance of the learning-based method to more closely match the traditional iterative algorithm, with the mean distance-to-agreement reduced by 1.8mm for structures included in the TPS and 0.6mm for structures not included, while maintaining a substantial speed advantage (5 vs. 72 seconds).
zh

[CV-22] EndoMamba: An Efficient Foundation Model for Endoscopic Videos

【速读】：该论文旨在解决内窥镜视频任务中的两个主要问题：计算效率低下以及由于预训练数据有限导致的性能不佳。为了解决这些问题，论文提出的关键方案是EndoMamba模型，它设计用于实时推理同时学习泛化的时空表示。具体而言，EndoMamba通过引入优化的EndoMamba主干网络来缓解计算效率问题，并采用双向Mamba块进行空间建模，以及单向Mamba块进行时间域的过去到现在的推理，从而实现高效的在线视频流处理。此外，为了增强表示学习，论文提出了一个自监督分层预训练框架，结合掩码重建与辅助监督，利用低层次重建捕捉时空结构，并通过高层次对齐从预训练的通用视频领域基础模型中转移更广泛的知识。

链接: https://arxiv.org/abs/2502.19090
作者: Qingyao Tian,Huai Liao,Xinyan Huang,Bingyu Yang,Dongdong Lei,Sebastien Ourselin,Hongbin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba’s representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks–classification, segmentation, surgical phase recognition, and localization–demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code will be released upon acceptance.
zh

[CV-23] Dynamic Degradation Decomposition Network for All-in-One Image Restoration

【速读】：该论文旨在解决使用单一模型从多种退化类型中恢复清晰图像的挑战。现有方法在处理复杂且定义不明确的退化类型时存在困难。论文的关键解决方案是引入了一种名为D(^3) Net的动态退化分解网络，通过跨域交互和动态退化分解实现适应性图像恢复。D(^3) Net中的交叉域退化分析器（Cross-Domain Degradation Analyzer, CDDA）通过频率域退化特征与空间域图像特征之间的深度交互，识别和建模不同退化类型的图像流形变化，生成校正提示和策略提示，指导后续的分解过程。此外，基于提示的动态分解机制（Dynamic Decomposition Mechanism, DDM）促进了网络自适应选择恢复策略，利用CDDA生成的两级提示进行逐步退化分解。这一协同合作使得D(^3) Net在处理未知退化类型时具有优越的灵活性和可扩展性，并有效减少了不必要的计算开销。

链接: https://arxiv.org/abs/2502.19068
作者: Huiqiang Wang,Mingchen Song,Guoqiang Zhong
机构: College of Computer Science and Technology, Ocean University of China (海洋大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, restoring clean images from a variety of degradation types using a single model is still a challenging task. Existing all-in-one image restoration approaches struggle with addressing complex and ambiguously defined degradation types. In this paper, we introduce a dynamic degradation decomposition network for all-in-one image restoration, named D ^3 Net. D ^3 Net achieves degradation-adaptive image restoration with guided prompt through cross-domain interaction and dynamic degradation decomposition. Concretely, in D ^3 Net, the proposed Cross-Domain Degradation Analyzer (CDDA) engages in deep interaction between frequency domain degradation characteristics and spatial domain image features to identify and model variations of different degradation types on the image manifold, generating degradation correction prompt and strategy prompt, which guide the following decomposition process. Furthermore, the prompt-based Dynamic Decomposition Mechanism (DDM) for progressive degradation decomposition, that encourages the network to adaptively select restoration strategies utilizing the two-level prompt generated by CDDA. Thanks to the synergistic cooperation between CDDA and DDM, D ^3 Net achieves superior flexibility and scalability in handling unknown degradation, while effectively reducing unnecessary computational overhead. Extensive experiments on multiple image restoration tasks demonstrate that D ^3 Net significantly outperforms the state-of-the-art approaches, especially improving PSNR by 5.47dB and 3.30dB on the SOTS-Outdoor and GoPro datasets, respectively.
zh

[CV-24] An Improved 3D Skeletons UP-Fall Dataset: Enhancing Data Quality for Efficient Impact Fall Detection

【速读】：该论文旨在解决UP-Fall数据集在跌倒检测中的局限性，特别是其在数据准确性与全面性方面的不足，这导致难以区分滑动等非冲击事件与实际有冲击的跌倒。关键解决方案在于通过引入三维骨骼数据对UP-Fall数据集进行增强，并采用预处理技术以确保高数据质量和完整性，从而实现更可靠的冲击跌倒检测。实验结果表明，基于改进后的三维骨骼数据集训练的跌倒检测模型性能显著提升。

链接: https://arxiv.org/abs/2502.19048
作者: Tresor Y. Koffi,Youssef Mourchid,Mohammed Hindawi,Yohan Dupuis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17th International Conference on Machine Vision (ICMV 2024) will take place in Edinburgh, UK during October 10-13, 2024

点击查看摘要

Abstract:Detecting impact where an individual makes contact with the ground within a fall event is crucial in fall detection systems, particularly for elderly care where prompt intervention can prevent serious injuries. The UP-Fall dataset, a key resource in fall detection research, has proven valuable but suffers from limitations in data accuracy and comprehensiveness. These limitations cause confusion in distinguishing between non-impact events, such as sliding, and real falls with impact, where the person actually hits the ground. This confusion compromises the effectiveness of current fall detection systems. This study presents enhancements to the UP-Fall dataset aiming at improving it for impact fall detection by incorporating 3D skeleton data. Our preprocessing techniques ensure high data accuracy and comprehensiveness, enabling a more reliable impact fall detection. Extensive experiments were conducted using various machine learning and deep learning algorithms to benchmark the improved 3D skeletons dataset. The results demonstrate substantial improvements in the performance of fall detection models trained on the enhanced dataset. This contribution aims to enhance the safety and well-being of the elderly population at risk. To support further research and development of building more reliable impact fall detection systems, we have made the improved 3D skeletons UP-Fall dataset publicly available at this link this https URL.
zh

[CV-25] A Dual-Purpose Framework for Backdoor Defense and Backdoor Amplification in Diffusion Models

【速读】：该论文旨在解决扩散模型在面对后门攻击时的脆弱性问题。论文的关键解决方案是提出了一种名为PureDiffusion的双用途框架，该框架能够同时实现后门防御和后门攻击增强。关键创新点在于引入了两种新的损失函数，用于反转嵌入在扩散模型中的后门触发器。这两种损失函数分别利用了触发器引起的分布偏移和去噪一致性效应。通过实现准确的触发器反转，论文进一步开发了一种基于反转触发器和生成的后门目标进行后门检测的方法。此外，在攻击场景下，论文描述了如何利用触发器反转算法来强化嵌入在后门扩散模型中的原始触发器，从而显著提升攻击性能并减少所需的后门训练时间。

链接: https://arxiv.org/abs/2502.19047
作者: Vu Tuan Truong Long,Bao Le
机构: INRS, University of Québec (魁北克大学INRS), Montréal, QC H5A 1K6, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as state-of-the-art generative frameworks, excelling in producing high-quality multi-modal samples. However, recent studies have revealed their vulnerability to backdoor attacks, where backdoored models generate specific, undesirable outputs called backdoor target (e.g., harmful images) when a pre-defined trigger is embedded to their inputs. In this paper, we propose PureDiffusion, a dual-purpose framework that simultaneously serves two contrasting roles: backdoor defense and backdoor attack amplification. For defense, we introduce two novel loss functions to invert backdoor triggers embedded in diffusion models. The first leverages trigger-induced distribution shifts across multiple timesteps of the diffusion process, while the second exploits the denoising consistency effect when a backdoor is activated. Once an accurate trigger inversion is achieved, we develop a backdoor detection method that analyzes both the inverted trigger and the generated backdoor targets to identify backdoor attacks. In terms of attack amplification with the role of an attacker, we describe how our trigger inversion algorithm can be used to reinforce the original trigger embedded in the backdoored diffusion model. This significantly boosts attack performance while reducing the required backdoor training time. Experimental results demonstrate that PureDiffusion achieves near-perfect detection accuracy, outperforming existing defenses by a large margin, particularly against complex trigger patterns. Additionally, in an attack scenario, our attack amplification approach elevates the attack success rate (ASR) of existing backdoor attacks to nearly 100% while reducing training time by up to 20x.
zh

[CV-26] FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach

【速读】：该论文旨在解决大型视觉语言模型（Vision-Language Models, VLMs）在零样本分类任务中的有效性依赖于广泛且对齐良好的文本图像数据集的问题。论文的关键解决方案是引入两个互补的数据源：一个是由大规模语言模型（Large Language Models, LLMs）生成的描述真菌生长阶段的文本，另一个是一组包含多种合成真菌图像的数据集。通过将这些数据投影到CLIP的共享表示空间中，并使用LLaMA3.2生成跨模态的文本来填补模态差距，从而增强CLIP在真菌相关任务上的零样本分类能力。此外，论文还通过比较不同LLM技术生成的文本输出，以优化各生长阶段的分类效果。

链接: https://arxiv.org/abs/2502.19038
作者: Anju Rani,Daniel O. Arroyo,Petar Durdevic
机构: Aalborg University (奥胡斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 Figures, 1 Table

点击查看摘要

Abstract:The effectiveness of zero-shot classification in large vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), depends on access to extensive, well-aligned text-image datasets. In this work, we introduce two complementary data sources, one generated by large language models (LLMs) to describe the stages of fungal growth and another comprising a diverse set of synthetic fungi images. These datasets are designed to enhance CLIPs zero-shot classification capabilities for fungi-related tasks. To ensure effective alignment between text and image data, we project them into CLIPs shared representation space, focusing on different fungal growth stages. We generate text using LLaMA3.2 to bridge modality gaps and synthetically create fungi images. Furthermore, we investigate knowledge transfer by comparing text outputs from different LLM techniques to refine classification across growth stages.
zh

[CV-27] Enhanced Neuromorphic Semantic Segmentation Latency through Stream Event

【速读】：该论文旨在解决使用帧式视觉传感器进行实时语义分割时面临的挑战，特别是在无人机（UAVs）和自动驾驶汽车等实时系统中的低延迟、高精度和能量效率之间的平衡问题。传统帧式方法难以同时满足这些需求。为了解决这些问题，论文提出利用基于事件的相机（event-based cameras）产生的事件流，这是一种生物启发式的传感器，在场景发生变化时触发事件。关键解决方案在于采用一种基于脉冲神经网络（Spiking Neural Network, SNN）的方法，利用事件信息进行语义分割。实验结果表明，这种方法在保持较低延迟的同时，仅有限度地降低了准确性，并且由于SNN具有低能耗特性，使得该方法适用于能量受限的实时应用。

链接: https://arxiv.org/abs/2502.18982
作者: D. Hareb,J. Martinet,B. Miramond
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving optimal semantic segmentation with frame-based vision sensors poses significant challenges for real-time systems like UAVs and self-driving cars, which require rapid and precise processing. Traditional frame-based methods often struggle to balance latency, accuracy, and energy efficiency. To address these challenges, we leverage event streams from event-based cameras-bio-inspired sensors that trigger events in response to changes in the scene. Specifically, we analyze the number of events triggered between successive frames, with a high number indicating significant changes and a low number indicating minimal changes. We exploit this event information to solve the semantic segmentation task by employing a Spiking Neural Network (SNN), a bio-inspired computing paradigm known for its low energy consumption. Our experiments on the DSEC dataset show that our approach significantly reduces latency with only a limited drop in accuracy. Additionally, by using SNNs, we achieve low power consumption, making our method suitable for energy-constrained real-time applications. To the best of our knowledge, our approach is the first to effectively balance reduced latency, minimal accuracy loss, and energy efficiency using events stream to enhance semantic segmentation in dynamic and resource-limited environments.
zh

[CV-28] Brain-inspired analogical mixture prototypes for few-shot class-incremental learning

【速读】：该论文旨在解决小样本类别增量学习（FSCIL）中人工神经网络面临的挑战，即如何在有限数据下高效学习的同时保留先前任务的知识。论文的关键解决方案是提出了脑启发类比混合原型（Brain-inspired Analogical Mixture Prototypes, BAMP），其包含三个组件：混合原型特征学习、统计类比和软投票。通过从预训练的视觉Transformer（ViT）开始，BAMP利用混合原型表示每个类别，并在基础阶段微调这些表示。统计类比根据新类与基础类之间的相似性校准原型的均值和协方差矩阵，并使用马氏距离计算分类得分。软投票结合了统计类比和现有FSCIL方法的优点。实验结果表明，BAMP在传统的大样本启动FSCIL设置和具有挑战性的小样本启动FSCIL设置中均优于现有技术。研究表明，脑启发类比混合原型可以缓解FSCIL中的灾难性遗忘和过拟合问题。

链接: https://arxiv.org/abs/2502.18923
作者: Wanyi Li,Wei Wei,Yongkang Luo,Peng Wang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院多模态人工智能系统国家重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); CAS Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences(中国科学院脑科学与智能技术卓越创新中心); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences(香港创新科技研究院人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: under review

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) poses significant challenges for artificial neural networks due to the need to efficiently learn from limited data while retaining knowledge of previously learned tasks. Inspired by the brain’s mechanisms for categorization and analogical learning, we propose a novel approach called Brain-inspired Analogical Mixture Prototypes (BAMP). BAMP has three components: mixed prototypical feature learning, statistical analogy, and soft voting. Starting from a pre-trained Vision Transformer (ViT), mixed prototypical feature learning represents each class using a mixture of prototypes and fine-tunes these representations during the base session. The statistical analogy calibrates the mean and covariance matrix of prototypes for new classes according to similarity to the base classes, and computes classification score with Mahalanobis distance. Soft voting combines both merits of statistical analogy and an off-shelf FSCIL method. Our experiments on benchmark datasets demonstrate that BAMP outperforms state-of-the-art on both traditional big start FSCIL setting and challenging small start FSCIL setting. The study suggests that brain-inspired analogical mixture prototypes can alleviate catastrophic forgetting and over-fitting problems in FSCIL.
zh

[CV-29] Inscanner: Dual-Phase Detection and Classification of Auxiliary Insulation Using YOLOv8 Models

【速读】：该论文旨在解决结构组件中辅助绝缘检测与分类的问题。解决方案的关键在于提出了一种两阶段的方法：第一阶段使用YOLOv8x模型在完整的结构蓝图数据集上进行绝缘区域的检测，第二阶段利用检测到的绝缘区域训练YOLOv8x-CLS模型以确定绝缘的存在或缺失。这一方法通过精确的标注、数据增强以及适当的裁剪步骤优化了输入数据，并取得了82%的平均精度均值（mAP）和98%的分类准确率。

链接: https://arxiv.org/abs/2502.18871
作者: Youngtae Kim,Soonju Jeong,Sardar Arslan,Dhananjay Agnihotri,Yahya Ahmed,Ali Nawaz,Jinhee Song,Hyewon Kim
机构: Doaz Corporation; R&D Institute, Lotte E&C
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study proposes a two-phase methodology for detecting and classifying auxiliary insulation in structural components. In the detection phase, a YOLOv8x model is trained on a dataset of complete structural blueprints, each annotated with bounding boxes indicating areas that should contain insulation. In the classification phase, these detected insulation patches are cropped and categorized into two classes: present or missing. These are then used to train a YOLOv8x-CLS model that determines the presence or absence of auxiliary insulation. Preprocessing steps for both datasets included annotation, augmentation, and appropriate cropping of the insulation regions. The detection model achieved a mean average precision (mAP) score of 82%, while the classification model attained an accuracy of 98%. These findings demonstrate the effectiveness of the proposed approach in automating insulation detection and classification, providing a foundation for further advancements in this domain.
zh

[CV-30] Enhanced Transformer-Based Tracking for Skiing Events: Overcoming Multi-Camera Challenges Scale Variations and Rapid Motion – SkiTB Visual Tracking Challenge 2025

【速读】：该论文旨在解决在高山滑雪运动中，传统跟踪方法因遮挡、动态运动及多变环境条件而效果受限的问题。关键解决方案在于采用基于变换器的STARK（Spatio-Temporal Transformer Network for Visual Tracking）模型，并通过优化模型架构和超参数来适应滑雪场景中的特定挑战，如相机移动、相机切换及遮挡等。

链接: https://arxiv.org/abs/2502.18867
作者: Akhil Penta,Vaibhav Adwani,Ankush Chopra
机构: Tredence Analytics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate skier tracking is essential for performance analysis, injury prevention, and optimizing training strategies in alpine sports. Traditional tracking methods often struggle with occlusions, dynamic movements, and varying environmental conditions, limiting their effectiveness. In this work, we used STARK (Spatio-Temporal Transformer Network for Visual Tracking), a transformer-based model, to track skiers. We adapted STARK to address domain-specific challenges such as camera movements, camera changes, occlusions, etc. by optimizing the model’s architecture and hyperparameters to better suit the dataset.
zh

[CV-31] Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM

【速读】：该论文旨在解决视频异常事件检测（Video Anomaly Detection, VAD）中仅关注单帧图像是否异常而忽视结构化视频语义信息的问题。论文提出了一个新的任务——多场景视频异常事件提取与定位（M-VAE），目标是从视频中提取异常事件四元组（即主体、事件类型、对象和场景）并定位这些事件。为了解决这一任务中的两个关键挑战——全局-局部空间建模和全局-局部空间平衡，论文提出了一种名为Sherlock的全局-局部空间敏感大型语言模型（Global-local Spatial-sensitive Large Language Model, LLM）。具体而言，该模型设计了一个全局-局部空间增强的专家混合模块（Global-local Spatial-enhanced MoE, GSM）和一个空间不平衡调节器（Spatial Imbalance Regulator, SIR）来分别应对这两个挑战。

链接: https://arxiv.org/abs/2502.18863
作者: Junxiao Ma,Jingjing Wang,Jiamin Luo,Peiying Yu,Guodong Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prior studies on Video Anomaly Detection (VAD) mainly focus on detecting whether each video frame is abnormal or not in the video, which largely ignore the structured video semantic information (i.e., what, when, and where does the abnormal event happen). With this in mind, we propose a new chat-paradigm \textbfMulti-scene Video Abnormal Event Extraction and Localization (M-VAE) task, aiming to extract the abnormal event quadruples (i.e., subject, event type, object, scene) and localize such event. Further, this paper believes that this new task faces two key challenges, i.e., global-local spatial modeling and global-local spatial balancing. To this end, this paper proposes a Global-local Spatial-sensitive Large Language Model (LLM) named Sherlock, i.e., acting like Sherlock Holmes to track down the criminal events, for this M-VAE task. Specifically, this model designs a Global-local Spatial-enhanced MoE (GSM) module and a Spatial Imbalance Regulator (SIR) to address the two challenges respectively. Extensive experiments on our M-VAE instruction dataset show the significant advantages of Sherlock over several advanced Video-LLMs. This justifies the importance of global-local spatial information for the M-VAE task and the effectiveness of Sherlock in capturing such information.
zh

[CV-32] BarkXAI: A Lightweight Post-Hoc Explainable Method for Tree Species Classification with Quantifiable Concepts

【速读】：该论文旨在解决高精度树种分类模型在实际应用中的可解释性不足问题。现有方法依赖局部特征或需要大量外部概念图像数据集，且概念难以精确量化。论文的关键解决方案在于提出了一种轻量级的后验方法，通过算子和可量化概念来解释基于树皮视觉模型的树种分类，从而消除计算开销，实现复杂概念的量化，并评估概念重要性和模型推理过程。这种方法首次实现了基于全局视觉特征和概念来解释树皮视觉模型，实验结果表明其显著优于TCAV和Llama3.2。

链接: https://arxiv.org/abs/2502.18844
作者: Yunmei Huang,Songlin Hou,Zachary Nelson Horve,Songlin Fei
机构: Purdue University (普渡大学); Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The precise identification of tree species is fundamental to forestry, conservation, and environmental monitoring. Though many studies have demonstrated that high accuracy can be achieved using bark-based species classification, these models often function as “black boxes”, limiting interpretability, trust, and adoption in critical forestry applications. Attribution-based Explainable AI (XAI) methods have been used to address this issue in related works. However, XAI applications are often dependent on local features (such as a head shape or paw in animal applications) and cannot describe global visual features (such as ruggedness or smoothness) that are present in texture-dominant images such as tree bark. Concept-based XAI methods, on the other hand, offer explanations based on global visual features with concepts, but they tend to require large overhead in building external concept image datasets and the concepts can be vague and subjective without good means of precise quantification. To address these challenges, we propose a lightweight post-hoc method to interpret visual models for tree species classification using operators and quantifiable concepts. Our approach eliminates computational overhead, enables the quantification of complex concepts, and evaluates both concept importance and the model’s reasoning process. To the best of our knowledge, our work is the first study to explain bark vision models in terms of global visual features with concepts. Using a human-annotated dataset as ground truth, our experiments demonstrate that our method significantly outperforms TCAV and Llama3.2 in concept importance ranking based on Kendall’s Tau, highlighting its superior alignment with human perceptions.
zh

[CV-33] Attention-Guided Integration of CLIP and SAM for Precise Object Masking in Robotic Manipulation

【速读】：该论文旨在解决在便利店产品操作中物体掩膜精度不足的问题。解决方案的关键在于整合CLIP（Contrastive Language-Image Pre-training）和SAM（Segment Anything Model）两个先进的人工智能模型，并利用梯度引导注意力机制及定制数据集进行微调，以实现更精确和自适应的对象操作。

链接: https://arxiv.org/abs/2502.18842
作者: Muhammad A. Muttaqien,Tomohiro Motoda,Ryo Hanai,Domae Yukiyasu
机构: National Institute of AIST(国立先进工业科学技术研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/SICE International Symposium on System Integration

点击查看摘要

Abstract:This paper introduces a novel pipeline to enhance the precision of object masking for robotic manipulation within the specific domain of masking products in convenience stores. The approach integrates two advanced AI models, CLIP and SAM, focusing on their synergistic combination and the effective use of multimodal data (image and text). Emphasis is placed on utilizing gradient-based attention mechanisms and customized datasets to fine-tune performance. While CLIP, SAM, and Grad- CAM are established components, their integration within this structured pipeline represents a significant contribution to the field. The resulting segmented masks, generated through this combined approach, can be effectively utilized as inputs for robotic systems, enabling more precise and adaptive object manipulation in the context of convenience store products.
zh

[CV-34] Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

【速读】：该论文旨在解决CLIP（Contrastive Language-Image Pre-training）模型解释性不足的问题。论文的关键在于提出了一种基于梯度的视觉和文本解释方法Grad-ECLIP（Gradient-based visual and textual Explanation method for CLIP），通过分解编码器架构并发现匹配相似度与中间空间特征之间的关系，生成有效的热图以展示图像区域或词汇对CLIP结果的影响。不同于先前仅依赖稀疏自注意力图的Transformer解释方法，Grad-ECLIP通过对标记特征应用通道和空间权重来生成高质量的视觉解释。这种方法验证了其在有效性及优越性方面的优势，并进一步分析了图像-文本匹配的工作机制以及CLIP在属性识别中的优缺点，同时探索了词汇具体性和抽象性与其在CLIP中使用的关系。最后，基于Grad-ECLIP能够指示输入图像特定文本显著区域的能力，提出了一种应用于增强CLIP微调中的细粒度对齐的方法。

链接: https://arxiv.org/abs/2502.18816
作者: Chenyang Zhao,Kun Wang,Janet H. Hsiao,Antoni B. Chan
机构: Department of Computer Science, City University of Hong Kong(计算机科学系, 香港城市大学); Division of Social Science and Department of Computer Science & Engineering, Hong Kong University of Science & Technology(社会科学部和社会科学与工程系, 香港科技大学); SenseTime Group Ltd.(商汤集团有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: this https URL.
zh

[CV-35] Spectral-Enhanced Transformers: Leverag ing Large-Scale Pretrained Models for Hyperspectral Object Tracking

【速读】：该论文旨在解决在使用快照拼接相机进行高光谱目标跟踪时，由于数据集的限制导致大型变压器模型无法充分发挥其潜力的问题。关键解决方案在于提出了一种可适应的、可学习的空间-光谱标记融合模块，该模块可以扩展到任何基于变压器的主干网络以学习高光谱数据中的固有空间-光谱特征，并引入了一种跨模态训练管道，以促进不同传感器模态收集的高光谱数据集之间的有效学习，从而实现从额外模态中提取互补知识，即使这些模态在测试时不被包含。这种方法使得模型能够在少量训练迭代下实现卓越性能。

链接: https://arxiv.org/abs/2502.18748
作者: Shaheer Mohamed,Tharindu Fernando,Sridha Sridharan,Peyman Moghadam,Clinton Fookes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to 14th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)

点击查看摘要

Abstract:Hyperspectral object tracking using snapshot mosaic cameras is emerging as it provides enhanced spectral information alongside spatial data, contributing to a more comprehensive understanding of material properties. Using transformers, which have consistently outperformed convolutional neural networks (CNNs) in learning better feature representations, would be expected to be effective for Hyperspectral object tracking. However, training large transformers necessitates extensive datasets and prolonged training periods. This is particularly critical for complex tasks like object tracking, and the scarcity of large datasets in the hyperspectral domain acts as a bottleneck in achieving the full potential of powerful transformer models. This paper proposes an effective methodology that adapts large pretrained transformer-based foundation models for hyperspectral object tracking. We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone for learning inherent spatial-spectral features in hyperspectral data. Furthermore, our model incorporates a cross-modality training pipeline that facilitates effective learning across hyperspectral datasets collected with different sensor modalities. This enables the extraction of complementary knowledge from additional modalities, whether or not they are present during testing. Our proposed model also achieves superior performance with minimal training iterations.
zh

[CV-36] MaskPlanner: Learning-Based Object-Centric Motion Generation from 3D Point Clouds

【速读】：该论文旨在解决对象中心运动生成（Object-Centric Motion Generation, OCMG）在工业应用中的挑战，特别是针对自由形式3D物体进行高效、可扩展且具泛化能力的多轨迹规划。现有方法依赖于专门的启发式算法、昂贵的优化过程或限制性的几何假设，这些局限性阻碍了它们在实际场景中的适应性。论文的关键解决方案是提出了一种全新的全数据驱动框架MaskPlanner，它直接从3D点云学习，预测局部路径段并同时推断“路径掩码”以将这些段分组成不同的路径。这种设计使得网络能够在单次前向传递中捕捉局部几何模式和全局任务需求，从而实现高效的轨迹生成和执行。

链接: https://arxiv.org/abs/2502.18745
作者: Gabriele Tiboni,Raffaello Camoriano,Tatiana Tommasi
机构: Department of Control and Computer Engineering, Politecnico di Torino (都灵理工大学), Italy; Istituto Italiano di Tecnologia (意大利技术研究院), Genoa, Italy
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL

点击查看摘要

Abstract:Object-Centric Motion Generation (OCMG) plays a key role in a variety of industrial applications \unicodex2014 such as robotic spray painting and welding \unicodex2014 requiring efficient, scalable, and generalizable algorithms to plan multiple long-horizon trajectories over free-form 3D objects. However, existing solutions rely on specialized heuristics, expensive optimization routines, or restrictive geometry assumptions that limit their adaptability to real-world scenarios. In this work, we introduce a novel, fully data-driven framework that tackles OCMG directly from 3D point clouds, learning to generalize expert path patterns across free-form surfaces. We propose MaskPlanner, a deep learning method that predicts local path segments for a given object while simultaneously inferring “path masks” to group these segments into distinct paths. This design induces the network to capture both local geometric patterns and global task requirements in a single forward pass. Extensive experimentation on a realistic robotic spray painting scenario shows that our approach attains near-complete coverage (above 99%) for unseen objects, while it remains task-agnostic and does not explicitly optimize for paint deposition. Moreover, our real-world validation on a 6-DoF specialized painting robot demonstrates that the generated trajectories are directly executable and yield expert-level painting quality. Our findings crucially highlight the potential of the proposed learning method for OCMG to reduce engineering overhead and seamlessly adapt to several industrial use cases.
zh

[CV-37] QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries

【速读】：该论文旨在解决视觉语言模型（Vision-Language Model, VLM）在从大规模互联网数据训练到适应机器人收集的真实场景图像流过程中存在的领域偏移（domain shift）问题。现有方法依赖于封闭集类别的定义，这在面对机器人需要响应多样化的自然语言查询时显得不切实际。论文提出了一种名为QueryAdapter的新框架，通过利用先前部署期间收集的未标记数据来对齐与查询相关的语义类别，从而快速适应预训练的VLM。关键解决方案在于优化可学习提示词令牌以及主动选择对象进行训练，能够在数分钟内生成适应模型，并且采用对象标题作为负类标签以处理与查询无关的对象，从而提高校准后的置信度评分。实验结果表明，QueryAdapter显著提升了物体检索性能，相比现有的无监督VLM适配器和3D场景图方法具有明显优势，并且展示了对抽象功能查询和不同数据集的强大泛化能力。

链接: https://arxiv.org/abs/2502.18735
作者: Nicolas Harvey Chapman,Feras Dayoub,Will Browne,Christopher Lehnert
机构: School of Electrical Engineering and Robotics, Queensland University of Technology (电气工程与机器人学院, 昆士兰科技大学); University of Adelaide (阿德莱德大学); School of Computer Science and the Australian Institute of Machine Learning (计算机科学学院和澳大利亚机器学习研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A domain shift exists between the large-scale, internet data used to train a Vision-Language Model (VLM) and the raw image streams collected by a robot. Existing adaptation strategies require the definition of a closed-set of classes, which is impractical for a robot that must respond to diverse natural language queries. In response, we present QueryAdapter; a novel framework for rapidly adapting a pre-trained VLM in response to a natural language query. QueryAdapter leverages unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. By optimising learnable prompt tokens and actively selecting objects for training, an adapted model can be produced in a matter of minutes. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation. In turn, we propose the use of object captions as negative class labels, helping to produce better calibrated confidence scores during adaptation. Extensive experiments on ScanNet++ demonstrate that QueryAdapter significantly enhances object retrieval performance compared to state-of-the-art unsupervised VLM adapters and 3D scene graph methods. Furthermore, the approach exhibits robust generalization to abstract affordance queries and other datasets, such as Ego4D.
zh

[CV-38] Adversarial Universal Stickers: Universal Perturbation Attacks on Traffic Sign using Stickers

【速读】：该论文旨在解决针对深度学习模型的普遍对抗性攻击问题。传统方法需要为每张图像添加不同的对抗扰动以使其被误分类，而这种方法在实际应用中效率低下。论文的关键解决方案在于提出了一种新型的普遍对抗性贴纸（universal adversarial stickers），这些贴纸外观为简单的黑白图案，并可以应用于任何交通标志上，使深度学习模型对其产生错误预测。通过利用街景视图中的交通标志图像进行虚拟实验，验证了这些贴纸能够在多种交通标志上持续误导常用的交通标志识别深度学习模型，从而证明了其实用性和有效性。

链接: https://arxiv.org/abs/2502.18724
作者: Anthony Etim,Jakub Szefer
机构: Yale University(耶鲁大学); Northwestern University(西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Adversarial attacks on deep learning models have proliferated in recent years. In many cases, a different adversarial perturbation is required to be added to each image to cause the deep learning model to misclassify it. This is ineffective as each image has to be modified in a different way. Meanwhile, research on universal perturbations focuses on designing a single perturbation that can be applied to all images in a data set, and cause a deep learning model to misclassify the images. This work advances the field of universal perturbations by exploring universal perturbations in the context of traffic signs and autonomous vehicle systems. This work introduces a novel method for generating universal perturbations that visually look like simple black and white stickers, and using them to cause incorrect street sign predictions. Unlike traditional adversarial perturbations, the adversarial universal stickers are designed to be applicable to any street sign: same sticker, or stickers, can be applied in same location to any street sign and cause it to be misclassified. Further, to enable safe experimentation with adversarial images and street signs, this work presents a virtual setting that leverages Street View images of street signs, rather than the need to physically modify street signs, to test the attacks. The experiments in the virtual setting demonstrate that these stickers can consistently mislead deep learning models used commonly in street sign recognition, and achieve high attack success rates on dataset of US traffic signs. The findings highlight the practical security risks posed by simple stickers applied to traffic signs, and the ease with which adversaries can generate adversarial universal stickers that can be applied to many street signs.
zh

[CV-39] Enhancing Image Classification with Augmentation: Data Augmentation Techniques for Improved Image Classification

【速读】：该论文旨在解决卷积神经网络（CNNs）在小数据集训练时易过拟合的问题。为解决这一问题，论文提出并评估了多种数据增强技术，包括三种新提出的策略：配对通道传输、新颖的遮挡方法以及新颖的掩膜方法。通过在Caltech-101数据集上微调基础EfficientNet-B0模型，并进行对比分析，论文展示了这些数据增强技术的有效性，证明了多样化数据增强手段是提升图像分类性能的有效途径。关键在于引入新的数据增强方法以扩充数据集，从而减少过拟合现象并提高模型泛化能力。

链接: https://arxiv.org/abs/2502.18691
作者: Saorj Kumar,Prince Asiamah,Oluwatoyin Jolaoso,Ugochukwu Esiowu
机构: Schulich School of Engineering (舒立克工程学院); University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) serve as the workhorse of deep learning, finding applications in various fields that rely on images. Given sufficient data, they exhibit the capacity to learn a wide range of concepts across diverse settings. However, a notable limitation of CNNs is their susceptibility to overfitting when trained on small datasets. The augmentation of such datasets can significantly enhance CNN performance by introducing additional data points for learning. In this study, we explore the effectiveness of 11 different sets of data augmentation techniques, which include three novel sets proposed in this work. The first set of data augmentation employs pairwise channel transfer, transferring Red, Green, Blue, Hue, and Saturation values from randomly selected images in the database to all images in the dataset. The second set introduces a novel occlusion approach, where objects in the images are occluded by randomly selected objects from the dataset. The third set involves a novel masking approach, using vertical, horizontal, circular, and checkered masks to occlude portions of the images. In addition to these novel techniques, we investigate other existing augmentation methods, including rotation, horizontal and vertical flips, resizing, translation, blur, color jitter, and random erasing, and their effects on accuracy and overfitting. We fine-tune a base EfficientNet-B0 model for each augmentation method and conduct a comparative analysis to showcase their efficacy. For the evaluation and comparison of these augmentation techniques, we utilize the Caltech-101 dataset. The ensemble of image augmentation techniques proposed emerges as the most effective on the Caltech-101 dataset. The results demonstrate that diverse data augmentation techniques present a viable means of enhancing datasets for improved image classification.
zh

[CV-40] Diffusion Models for conditional MRI generation

【速读】：该论文旨在解决通过生成模型创建具有病理特征（Healthy, Glioblastoma, Sclerosis, Dementia）和成像模式（T1w, T1ce, T2w, Flair, PD）条件的脑部磁共振成像（MRI）图像的问题。关键解决方案在于提出了一个潜扩散模型（Latent Diffusion Model, LDM），该模型能够生成与真实图像分布相似且视觉保真度和多样性之间保持平衡的图像，并展示了在未出现在训练数据中的配置下生成图像的能力。这验证了模型增加临床数据集样本数量、平衡代表性不足类别以及在医学领域评估AI模型的潜力，从而促进放射学诊断工具的发展，同时不损害患者隐私。

链接: https://arxiv.org/abs/2502.18620
作者: Miguel Herencia García del Castillo,Ricardo Moya Garcia,Manuel Jesús Cerezo Mazón,Ekaitz Arriola Garcia,Pablo Menéndez Fernández-Miranda
机构: Ainovis; Telefónica Innovación Digital
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this article, we present a Latent Diffusion Model (LDM) for the generation of brain Magnetic Resonance Imaging (MRI), conditioning its generation based on pathology (Healthy, Glioblastoma, Sclerosis, Dementia) and acquisition modality (T1w, T1ce, T2w, Flair, PD). To evaluate the quality of the generated images, the Fréchet Inception Distance (FID) and Multi-Scale Structural Similarity Index (MS-SSIM) metrics were employed. The results indicate that the model generates images with a distribution similar to real ones, maintaining a balance between visual fidelity and diversity. Additionally, the model demonstrates extrapolation capability, enabling the generation of configurations that were not present in the training data. The results validate the potential of the model to increase in the number of samples in clinical datasets, balancing underrepresented classes, and evaluating AI models in medicine, contributing to the development of diagnostic tools in radiology without compromising patient privacy. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.18620 [cs.CV] (or arXiv:2502.18620v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.18620 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-41] DeBUGCN – Detecting Backdoors in CNNs Using Graph Convolutional Networks

【速读】：该论文旨在解决深度神经网络(DNNs)在关键应用中的后门(特洛伊)攻击检测问题。解决方案的关键在于引入了一种新颖的后门攻击检测流程——通过图卷积网络进行检测(DeBUGCN)。具体而言，该方法利用DNN静态权重构建其各层的图形结构，并使用GCN作为二分类器来判断该DNN模型是否受到攻击。

链接: https://arxiv.org/abs/2502.18592
作者: Akash Vartak,Khondoker Murad Hossain,Tim Oates
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 11 tables, 8 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) are becoming commonplace in critical applications, making their susceptibility to backdoor (trojan) attacks a significant problem. In this paper, we introduce a novel backdoor attack detection pipeline, detecting attacked models using graph convolution networks (DeBUGCN). To the best of our knowledge, ours is the first use of GCNs for trojan detection. We use the static weights of a DNN to create a graph structure of its layers. A GCN is then used as a binary classifier on these graphs, yielding a trojan or clean determination for the DNN. To demonstrate the efficacy of our pipeline, we train hundreds of clean and trojaned CNN models on the MNIST handwritten digits and CIFAR-10 image datasets, and show the DNN classification results using DeBUGCN. For a true In-the-Wild use case, our pipeline is evaluated on the TrojAI dataset which consists of various CNN architectures, thus showing the robustness and model-agnostic behaviour of DeBUGCN. Furthermore, on comparing our results on several datasets with state-of-the-art trojan detection algorithms, DeBUGCN is faster and more accurate.
zh

[CV-42] Autonomous Vision-Guided Resection of Central Airway Obstruction

【速读】：该论文旨在解决现有气管肿瘤切除方法精度不足的问题，以实现有效的气道清理。解决方案的关键在于提出了一种基于视觉引导的自主气管肿瘤姑息性切除方法。该系统通过五阶多项式模型来规划工具轨迹，并利用定制的Faster R-CNN分割管道识别气管和肿瘤边界。此外，通过手持手术演示优化电凝刀角度，并规划路径以保持1毫米的安全距离，从而确保安全切除。实验结果表明，在离体动物组织模型上的连续五次实验中，成功清除了气道阻塞且未发生气管穿孔（肿瘤体积去除率超过90%），证明了该自主切除平台的可行性。

链接: https://arxiv.org/abs/2502.18586
作者: M. E. Smith,N. Yilmaz,T. Watts,P. M. Scheikl,J. Ge,A. Deguet,A. Kuntz,A. Krieger
机构: Department of Mechanical Engineering, Johns Hopkins University (约翰霍普金斯大学机械工程系), USA; Robotics Center and Kahlert School of Computing, University of Utah (犹他大学机器人中心和卡勒特计算学院), USA; Malone Center for Engineering and Healthcare, Johns Hopkins University (约翰霍普金斯大学马洛恩工程与医疗中心), USA
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to World Scientific, Journal of Medical Robotics Research (JMRR) 2025. 10 pages, 11 figures

点击查看摘要

Abstract:Existing tracheal tumor resection methods often lack the precision required for effective airway clearance, and robotic advancements offer new potential for autonomous resection. We present a vision-guided, autonomous approach for palliative resection of tracheal tumors. This system models the tracheal surface with a fifth-degree polynomial to plan tool trajectories, while a custom Faster R-CNN segmentation pipeline identifies the trachea and tumor boundaries. The electrocautery tool angle is optimized using handheld surgical demonstrations, and trajectories are planned to maintain a 1 mm safety clearance from the tracheal surface. We validated the workflow successfully in five consecutive experiments on ex-vivo animal tissue models, successfully clearing the airway obstruction without trachea perforation in all cases (with more than 90% volumetric tumor removal). These results support the feasibility of an autonomous resection platform, paving the way for future developments in minimally-invasive autonomous resection.
zh

[CV-43] Application of Attention Mechanism with Bidirectional Long Short-Term Memory (BiLSTM) and CNN for Human Conflict Detection using Computer Vision

【速读】：该论文旨在解决通过视频自动检测人类冲突的问题，这一任务由于公共数据集的稀缺性和人类交互的复杂性而具有挑战性。论文的关键解决方案在于结合先进的深度学习技术，包括注意力机制（Attention Mechanism）、卷积神经网络（CNNs）和双向长短期记忆网络（BiLSTM），以提升视频中暴力行为检测的准确性与鲁棒性。研究特别强调了注意力机制如何有助于聚焦视频中最相关的部分，从而增强模型性能。实验结果表明，CNNs与BiLSTM及注意力机制的结合为冲突监控提供了有前景的方案。

链接: https://arxiv.org/abs/2502.18555
作者: Erick da Silva Farias,Eduardo Palhares Junior
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The automatic detection of human conflicts through videos is a crucial area in computer vision, with significant applications in monitoring and public safety policies. However, the scarcity of public datasets and the complexity of human interactions make this task challenging. This study investigates the integration of advanced deep learning techniques, including Attention Mechanism, Convolutional Neural Networks (CNNs), and Bidirectional Long ShortTerm Memory (BiLSTM), to improve the detection of violent behaviors in videos. The research explores how the use of the attention mechanism can help focus on the most relevant parts of the video, enhancing the accuracy and robustness of the model. The experiments indicate that the combination of CNNs with BiLSTM and the attention mechanism provides a promising solution for conflict monitoring, offering insights into the effectiveness of different strategies. This work opens new possibilities for the development of automated surveillance systems that can operate more efficiently in real-time detection of violent events.
zh

[CV-44] Multi-class Seismic Building Damage Assessment from InSAR Imagery using Quadratic Variational Causal Bayesian Inference

【速读】：该论文旨在解决从InSAR数据中提取多类别建筑损坏分类的挑战，主要面对的问题包括重叠的损坏特征与环境噪声、多类别场景下的计算复杂性以及快速区域规模处理的需求。解决方案的关键在于提出了一种新型的多类别变分因果贝叶斯推理框架，并结合USGS地面失效模型和建筑易损性函数，通过整合InSAR观测数据来分离建筑物损坏信号，同时通过策略性剪枝保持计算效率。这一方法在五个主要地震事件中的评估显示，其分类准确性（AUC：0.94-0.96）显著优于现有方法，同时保持高精度且减少了超过40%的计算开销。

链接: https://arxiv.org/abs/2502.18546
作者: Xuechun Li,Susu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to Remote Sensing and Environment

点击查看摘要

Abstract:Interferometric Synthetic Aperture Radar (InSAR) technology uses satellite radar to detect surface deformation patterns and monitor earthquake impacts on buildings. While vital for emergency response planning, extracting multi-class building damage classifications from InSAR data faces challenges: overlapping damage signatures with environmental noise, computational complexity in multi-class scenarios, and the need for rapid regional-scale processing. Our novel multi-class variational causal Bayesian inference framework with quadratic variational bounds provides rigorous approximations while ensuring efficiency. By integrating InSAR observations with USGS ground failure models and building fragility functions, our approach separates building damage signals while maintaining computational efficiency through strategic pruning. Evaluation across five major earthquakes (Haiti 2021, Puerto Rico 2020, Zagreb 2020, Italy 2016, Ridgecrest 2019) shows improved damage classification accuracy (AUC: 0.94-0.96), achieving up to 35.7% improvement over existing methods. Our approach maintains high accuracy (AUC 0.93) across all damage categories while reducing computational overhead by over 40% without requiring extensive ground truth data.
zh

[CV-45] Convolutional neural networks for mineral prospecting through alteration mapping with remote sensing data

【速读】：该论文旨在解决传统地质填图方法在连续空间映射矿物蚀变带方面的低效问题。解决方案的关键在于利用卷积神经网络（Convolutional Neural Networks, CNNs）分析遥感数据，自动提取特征以进行分类和回归任务。通过使用Landsat 8、Landsat 9和ASTER数据，并结合地面实况数据与选择性主成分分析（Principal Component Analysis, PCA）的自动化训练方法，CNNs能够更精确地识别与矿化相关的微妙矿物学变化，从而提高地质填图的精度。

链接: https://arxiv.org/abs/2502.18533
作者: Ehsan Farahbakhsh,Dakshi Goel,Dhiraj Pimparkar,R. Dietmar Muller,Rohitash Chandra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Traditional geological mapping, based on field observations and rock sample analysis, is inefficient for continuous spatial mapping of features like alteration zones. Deep learning models, such as convolutional neural networks (CNNs), have revolutionised remote sensing data analysis by automatically extracting features for classification and regression tasks. CNNs can detect specific mineralogical changes linked to mineralisation by identifying subtle features in remote sensing data. This study uses CNNs with Landsat 8, Landsat 9, and ASTER data to map alteration zones north of Broken Hill, New South Wales, Australia. The model is trained using ground truth data and an automated approach with selective principal component analysis (PCA). We compare CNNs with traditional machine learning models, including k-nearest neighbours, support vector machines, and multilayer perceptron. Results show that ground truth-based training yields more reliable maps, with CNNs slightly outperforming conventional models in capturing spatial patterns. Landsat 9 outperforms Landsat 8 in mapping iron oxide areas using ground truth-trained CNNs, while ASTER data provides the most accurate argillic and propylitic alteration maps. This highlights CNNs’ effectiveness in improving geological mapping precision, especially for identifying subtle mineralisation-related alterations.
zh

[CV-46] IMPROVE: Iterative Model Pipeline Refinement and Optimization Leverag ing LLM Agents

【速读】：该论文旨在解决自动化机器学习（Machine Learning, ML）流水线设计过程中存在的优化不稳定和收敛缓慢的问题。现有方法通常尝试一次性优化整个流程，导致难以归因具体改进，并且缺乏细粒度调整，从而影响了整体性能。为了解决这一问题，论文提出了一种名为“迭代精化”（Iterative Refinement）的新策略。该策略模仿人类专家逐步细化模型的过程，通过系统性地根据实际训练反馈更新各个组件，提升了稳定性、可解释性和模型的整体性能。

链接: https://arxiv.org/abs/2502.18530
作者: Eric Xue,Zeyi Huang,Yuyang Ji,Haohan Wang
机构: University of Toronto (多伦多大学); University of Wisconsin - Madison (威斯康星大学麦迪逊分校); New York University (纽约大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Computer vision is a critical component in a wide range of real-world applications, including plant monitoring in agriculture and handwriting classification in digital systems. However, developing high-performance computer vision models traditionally demands both machine learning (ML) expertise and domain-specific knowledge, making the process costly, labor-intensive, and inaccessible to many. Large language model (LLM) agents have emerged as a promising solution to automate this workflow, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves stability, interpretability, and overall model performance. We implement this strategy in IMPROVE, an end-to-end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, including standard benchmarks and Kaggle competition datasets, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches. These findings establish Iterative Refinement as an effective new strategy for LLM-driven ML automation and position IMPROVE as an accessible solution for building high-quality computer vision models without requiring ML expertise.
zh

[CV-47] Optimized Custom CNN for Real-Time Tomato Leaf Disease Detection

【速读】：该论文旨在解决番茄作物在生长过程中因病害导致产量和品质下降的问题。解决方案的关键在于开发一种基于卷积神经网络（Convolutional Neural Networks, CNNs）的自动化病害检测系统。通过收集并预处理来自Brahmanbaria地区的番茄叶片数据集，并将其应用于包括YOLOv5、MobileNetV2、ResNet18以及自定义CNN模型在内的多种深度学习模型进行比较分析，研究发现自定义CNN模型达到了95.2%的高精度，显著优于其他模型。这表明深度学习技术在早期病害检测方面具有巨大潜力，能够为农民提供有价值的早期疾病检测手段，从而提高管理效率，促进番茄产量的提升及农业生产的可持续性。

链接: https://arxiv.org/abs/2502.18521
作者: Mangsura Kabir Oni,Tabia Tanzin Prama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Bangladesh, tomatoes are a staple vegetable, prized for their versatility in various culinary applications. However, the cultivation of tomatoes is often hindered by a range of diseases that can significantly reduce crop yields and quality. Early detection of these diseases is crucial for implementing timely interventions and ensuring the sustainability of tomato production. Traditional manual inspection methods, while effective, are labor-intensive and prone to human error. To address these challenges, this research paper sought to develop an automated disease detection system using Convolutional Neural Networks (CNNs). A comprehensive dataset of tomato leaves was collected from the Brahmanbaria district, preprocessed to enhance image quality, and then applied to various deep learning models. Comparative performance analysis was conducted between YOLOv5, MobileNetV2, ResNet18, and our custom CNN model. In our study, the Custom CNN model achieved an impressive accuracy of 95.2%, significantly outperforming the other models, which achieved an accuracy of 77%, 89.38% and 71.88% respectively. While other models showed solid performance, our Custom CNN demonstrated superior results specifically tailored for the task of tomato leaf disease detection. These findings highlight the strong potential of deep learning techniques for improving early disease detection in tomato crops. By leveraging these advanced technologies, farmers can gain valuable insights to detect diseases at an early stage, allowing for more effective management practices. This approach not only promises to boost tomato yields but also contributes to the sustainability and resilience of the agricultural sector, helping to mitigate the impact of plant diseases on crop production.
zh

[CV-48] CipherFace: A Fully Homomorphic Encryption-Driven Framework for Secure Cloud-Based Facial Recognition

【速读】：该论文旨在解决面部识别系统在利用云资源进行距离计算时，如何在确保嵌入向量（Embeddings）隐私的前提下，实现安全高效的面部识别。关键解决方案在于引入了CipherFace框架，该框架基于全同态加密（Fully Homomorphic Encryption, FHE），能够在不解密的情况下直接对加密数据进行距离计算。此外，论文还提出了一种新的加密距离计算方法，适用于欧氏距离和余弦距离，从而解决了在加密数据上进行安全相似性计算的关键挑战。

链接: https://arxiv.org/abs/2502.18514
作者: Sefik Serengil,Alper Ozpinar
机构: Vorboss Limited (维博斯有限公司); Ibn Haldun University (伊本哈杜努尼韦尔斯ิต伊)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Facial recognition systems rely on embeddings to represent facial images and determine identity by verifying if the distance between embeddings is below a pre-tuned threshold. While embeddings are not reversible to original images, they still contain sensitive information, making their security critical. Traditional encryption methods like AES are limited in securely utilizing cloud computational power for distance calculations. Homomorphic Encryption, allowing calculations on encrypted data, offers a robust alternative. This paper introduces CipherFace, a homomorphic encryption-driven framework for secure cloud-based facial recognition, which we have open-sourced at this http URL. By leveraging FHE, CipherFace ensures the privacy of embeddings while utilizing the cloud for efficient distance computation. Furthermore, we propose a novel encrypted distance computation method for both Euclidean and Cosine distances, addressing key challenges in performing secure similarity calculations on encrypted data. We also conducted experiments with different facial recognition models, various embedding sizes, and cryptosystem configurations, demonstrating the scalability and effectiveness of CipherFace in real-world applications.
zh

[CV-49] FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

【速读】：该论文旨在解决高分辨率文本导向视觉大语言模型（Vision Large Language Models, VLLMs）训练和部署效率低下的问题，特别是在视觉tokens压缩过程中导致的任务性能显著下降。解决方案的关键在于引入了一种轻量级的自蒸馏预训练阶段来压缩视觉tokens，此阶段仅需少量图像-文本配对数据和极小的学习参数量。此外，通过构建高质量的后训练阶段以缓解压缩token模型可能存在的性能退化问题。实验结果表明，该方法在保持计算开销显著降低的同时，在多种文本导向基准测试中超越了基线模型。

链接: https://arxiv.org/abs/2502.18512
作者: Jianjian Li,Junquan Fan,Feng Tang,Gang Huang,Shitao Zhu,Songlin Liu,Nian Xie,Wulong Liu,Yong Liao
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab. (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 18 figures, 6 tables

点击查看摘要

Abstract:The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.
zh

[CV-50] Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition AAAI-2025

【速读】：该论文旨在解决多教师知识蒸馏（Multi-teacher Knowledge Distillation, MTKD）中教师权重平衡的问题。现有方法通常从单个教师表现或教师-学生差距的角度出发设计加权策略，缺乏全面的信息指导。论文提出了一种结合强化学习的多教师知识蒸馏方法（MTKD-RL），其关键是通过构建状态信息来优化多教师权重，并利用奖励反馈更新权重，从而实现更有效的学生与教师之间的交互，提升模型性能。实验结果表明，MTKD-RL在图像分类、目标检测及语义分割等视觉识别任务上达到了当前最先进的性能。

链接: https://arxiv.org/abs/2502.18510
作者: Chuanguang Yang,Xinqiang Yu,Han Yang,Zhulin An,Chengqing Yu,Libo Huang,Yongjun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-2025

点击查看摘要

Abstract:Multi-teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi-teacher KD is how to balance distillation strengths among various teachers. Most existing methods often develop weighting strategies from an individual perspective of teacher performance or teacher-student gaps, lacking comprehensive information for guidance. This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights. In this framework, we construct both teacher performance and teacher-student gaps as state information to an agent. The agent outputs the teacher weight and can be updated by the return reward from the student. MTKD-RL reinforces the interaction between the student and teacher using an agent in an RL-based decision mechanism, achieving better matching capability with more meaningful weights. Experimental results on visual recognition tasks, including image classification, object detection, and semantic segmentation tasks, demonstrate that MTKD-RL achieves state-of-the-art performance compared to the existing multi-teacher KD works.
zh

[CV-51] REFINE: Inversion-Free Backdoor Defense via Model Reprogramming ICLR2025

【速读】：该论文旨在解决深度神经网络（DNNs）中的后门攻击问题，这种攻击允许对手在模型训练阶段植入隐藏的恶意行为。论文的关键解决方案是提出了一种名为REFINE的新方法，这是一种基于模型重新编程的无反转后门防御技术。REFINE包含两个关键组件：一是输入转换模块，用于破坏良性模式和后门模式，生成新的良性特征；二是输出重映射模块，重新定义模型的输出域以有效指导输入转换。通过进一步整合监督对比损失（supervised contrastive loss），REFINE增强了防御能力同时保持了模型的实用性。

链接: https://arxiv.org/abs/2502.18508
作者: Yukun Chen,Shuo Shao,Enhao Huang,Yiming Li,Pin-Yu Chen,Zhan Qin,Kui Ren
机构: State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学区块链与数据安全重点实验室); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新技术产业开发区（滨江）区块链与数据安全研究所); Nanyang Technological University (南洋理工大学); IBM Research (IBM研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper is accept by ICLR 2025. The first two authors contributed equally to this work. Our code is available at BackdoorBox ( this https URL ) and Github repository ( this https URL ). 28 pages

点击查看摘要

Abstract:Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf(1) an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf(2) an output remapping module that redefines the model’s output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.
zh

[CV-52] Physical Depth-aware Early Accident Anticipation: A Multi-dimensional Visual Feature Fusion Framework

【速读】：该论文旨在解决早期交通事故预测任务中，现有方法在粗略二维图像空间中建模交通参与者（如车辆、行人等）交互时存在的不足，这些方法可能无法充分捕捉其真实位置和交互关系。关键解决方案在于提出了一种物理深度感知学习框架，该框架引入了由大型模型Depth-Anything生成的单目深度特征，以引入更精细的空间三维信息，并结合视觉交互特征和视觉动态特征，从而提供更全面的场景感知。此外，该框架通过分析序列帧中物体间的交互关系来捕获事故的早期指标，并引入重构邻接矩阵以应对被遮挡的关键交通参与者，从而减轻遮挡对象对图学习的影响，保持时空连续性。

链接: https://arxiv.org/abs/2502.18496
作者: Hongpu Huang,Wei Zhou,Chen Wang
机构: School of Transportation, Southeast University, Nanjing 211189, China (东南大学交通学院，中国南京211189);
Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China (香港理工大学工业及系统工程系，中国香港)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early accident anticipation from dashcam videos is a highly desirable yet challenging task for improving the safety of intelligent vehicles. Existing advanced accident anticipation approaches commonly model the interaction among traffic agents (e.g., vehicles, pedestrians, etc.) in the coarse 2D image space, which may not adequately capture their true positions and interactions. To address this limitation, we propose a physical depth-aware learning framework that incorporates the monocular depth features generated by a large model named Depth-Anything to introduce more fine-grained spatial 3D information. Furthermore, the proposed framework also integrates visual interaction features and visual dynamic features from traffic scenes to provide a more comprehensive perception towards the scenes. Based on these multi-dimensional visual features, the framework captures early indicators of accidents through the analysis of interaction relationships between objects in sequential frames. Additionally, the proposed framework introduces a reconstruction adjacency matrix for key traffic participants that are occluded, mitigating the impact of occluded objects on graph learning and maintaining the spatio-temporal continuity. Experimental results on public datasets show that the proposed framework attains state-of-the-art performance, highlighting the effectiveness of incorporating visual depth features and the superiority of the proposed framework.
zh

[CV-53] A Comprehensive Survey on Composed Image Retrieval

【速读】：该论文旨在综述组合图像检索（Composed Image Retrieval, CIR）领域，并系统性地分类现有的监督学习CIR和零样本CIR模型。关键在于通过细粒度的分类法整理超过120篇来自顶级会议和期刊的文献，同时讨论与CIR相关的任务方法，并分析多种基准数据集上的实验结果。这为研究者提供了对该领域的全面理解以及未来研究方向的实用见解。

链接: https://arxiv.org/abs/2502.18495
作者: Xuemeng Song,Haoqiang Lin,Haokun Wen,Bohan Hou,Mingzhu Xu,Liqiang Nie
机构: Shandong University(山东大学); Qingdao(青岛), China; Shandong University(山东大学); Qingdao(青岛), China; Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Shenzhen(深圳), China; City University of Hong Kong(香港城市大学); Hong Kong(香港), China; Shandong University(山东大学); Qingdao(青岛), China; Shandong University(山东大学); Ji’nan(济南), China; Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Shenzhen(深圳), China
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user’s desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration.
zh

[CV-54] Event-based Solutions for Human-centered Applications: A Comprehensive Review

【速读】：该论文旨在解决人本应用领域中事件相机（Event Cameras）研究分散的问题，并提供一个全面的综述以统一面部和身体任务的研究。论文的关键在于首次系统地回顾了事件相机在人体中心应用中的进展、挑战与机遇，并探讨了较少被研究的领域，如事件压缩技术和仿真框架，这些对于事件相机的广泛应用至关重要。

链接: https://arxiv.org/abs/2502.18490
作者: Mira Adra,Simone Melcarne,Nelida Mirabet-Herranz,Jean-Luc Dugelay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras, often referred to as dynamic vision sensors, are groundbreaking sensors capable of capturing changes in light intensity asynchronously, offering exceptional temporal resolution and energy efficiency. These attributes make them particularly suited for human-centered applications, as they capture both the most intricate details of facial expressions and the complex motion dynamics of the human body. Despite growing interest, research in human-centered applications of event cameras remains scattered, with no comprehensive overview encompassing both body and face tasks. This survey bridges that gap by being the first to unify these domains, presenting an extensive review of advancements, challenges, and opportunities. We also examine less-explored areas, including event compression techniques and simulation frameworks, which are essential for the broader adoption of event cameras. This survey is designed to serve as a foundational reference that helps both new and experienced researchers understand the current state of the field and identify promising directions for future work in human-centered event camera applications. A summary of this survey can be found at this https URL
zh

[CV-55] Multi-modal Contrastive Learning for Tumor-specific Missing Modality Synthesis

【速读】：该论文旨在解决临床环境中难以获得高质量多模态磁共振成像（MRI）的问题，通过开发一种能够从现有源模态图像中合成缺失目标模态图像的生成模型来克服这一难题。解决方案的关键在于将多模态对比学习与针对关键肿瘤区域的集成方法相结合，并在对比学习过程中基于熵选择特征以增强其有效性。此外，该网络不仅能生成缺失的目标模态图像，还能同时预测分割输出，从而提高生成肿瘤区域的精确度，最终提升下游分割任务的性能。

链接: https://arxiv.org/abs/2502.19390
作者: Minjoo Lim,Bogyeong Kang,Tae-Eui Kam
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal magnetic resonance imaging (MRI) is essential for providing complementary information about brain anatomy and pathology, leading to more accurate diagnoses. However, obtaining high-quality multi-modal MRI in a clinical setting is difficult due to factors such as time constraints, high costs, and patient movement artifacts. To overcome this difficulty, there is increasing interest in developing generative models that can synthesize missing target modality images from the available source ones. Therefore, we design a generative model for missing MRI that integrates multi-modal contrastive learning with a focus on critical tumor regions. Specifically, we integrate multi-modal contrastive learning, tailored for multiple source modalities, and enhance its effectiveness by selecting features based on entropy during the contrastive learning process. Additionally, our network not only generates the missing target modality images but also predicts segmentation outputs, simultaneously. This approach improves the generator’s capability to precisely generate tumor regions, ultimately improving performance in downstream segmentation tasks. By leveraging a combination of contrastive, segmentation, and additional self-representation losses, our model effectively reflects target-specific information and generate high-quality target images. Consequently, our results in the Brain MR Image Synthesis challenge demonstrate that the proposed model excelled in generating the missing modality.
zh

[CV-56] Deep Learning-Based Transfer Learning for Classification of Cassava Disease

链接: https://arxiv.org/abs/2502.19351
作者: Ademir G. Costa Junior,Fábio S. da Silva,Ricardo Rios
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, in Portuguese language, 3 figures

点击查看摘要

[CV-57] Deep learning and classical computer vision techniques in medical image analysis: Case studies on brain MRI tissue segmentation lung CT COPD registration and skin lesion classification

【速读】：该论文旨在系统性评估多种医学成像模态下的分割、配准及分类任务，并探索经典方法与深度学习（Deep Learning, DL）方法在这些任务中的互补优势。研究涵盖了脑部MRI组织分割、肺部CT图像配准及皮肤病变分类等应用。关键解决方案在于采用不同的模型架构，如对于脑组织分割，3D DL模型优于2D和基于补丁的模型，其中nnU-Net达到了0.9397的Dice系数；在皮肤病变分类中，深度学习模型的集成方法表现出色，InceptionResNetV2和ResNet50分别达到了90.44%和93.62%的分类精度。这些结果突显了不同方法在特定任务中的有效性及其相互补充的能力。

链接: https://arxiv.org/abs/2502.19258
作者: Anyimadu Daniel Tweneboah,Suleiman Taofik Ahmed,Hossain Mohammad Imran
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 18 figures

点击查看摘要

Abstract:Medical imaging spans diverse tasks and modalities which play a pivotal role in disease diagnosis, treatment planning, and monitoring. This study presents a novel exploration, being the first to systematically evaluate segmentation, registration, and classification tasks across multiple imaging modalities. Integrating both classical and deep learning (DL) approaches in addressing brain MRI tissue segmentation, lung CT image registration, and skin lesion classification from dermoscopic images, we demonstrate the complementary strengths of these methodologies in diverse applications. For brain tissue segmentation, 3D DL models outperformed 2D and patch-based models, specifically nnU-Net achieving Dice of 0.9397, with 3D U-Net models on ResNet34 backbone, offering competitive results with Dice 0.8946. Multi-Atlas methods provided robust alternatives for cases where DL methods are not feasible, achieving average Dice of 0.7267. In lung CT registration, classical Elastix-based methods outperformed DL models, achieving a minimum Target Registration Error (TRE) of 6.68 mm, highlighting the effectiveness of parameter tuning. HighResNet performed best among DL models with a TRE of 7.40 mm. For skin lesion classification, ensembles of DL models like InceptionResNetV2 and ResNet50 excelled, achieving up to 90.44%, and 93.62% accuracies for binary and multiclass classification respectively. Also, adopting One-vs-All method, DL attained accuracies of 94.64% (mel vs. others), 95.35% (bcc vs. others), and 96.93% (scc vs. others), while ML models specifically Multi-Layer Perceptron (MLP) on handcrafted features offered interpretable alternatives with 85.04% accuracy using SMOTE for class imbalance correction on the multi-class task and 83.27% on the binary-class task. Links to source code are available on request.
zh

[CV-58] Multi-level Attention-guided Graph Neural Network for Image Restoration

【速读】：该论文旨在解决单尺度卷积神经网络方法在图像恢复任务中忽略多尺度信息整合的问题。论文的关键在于提出了一种多级注意力引导图神经网络，通过构建特征图中的元素块图和元素图，并利用多注意力机制提取局部结构特征和全局表示信息。这种方法能够实时学习动态连接，并通过图卷积算法传播和聚合信息，从而更有效地补充和纠正图像中缺失或不足的信息。

链接: https://arxiv.org/abs/2502.19181
作者: Jiatao Jiang,Zhen Cui,Chunyan Xu,Jian Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, deep learning has achieved remarkable success in the field of image restoration. However, most convolutional neural network-based methods typically focus on a single scale, neglecting the incorporation of multi-scale information. In image restoration tasks, local features of an image are often insufficient, necessitating the integration of global features to complement them. Although recent neural network algorithms have made significant strides in feature extraction, many models do not explicitly model global features or consider the relationship between global and local features. This paper proposes multi-level attention-guided graph neural network. The proposed network explicitly constructs element block graphs and element graphs within feature maps using multi-attention mechanisms to extract both local structural features and global representation information of the image. Since the network struggles to effectively extract global information during image degradation, the structural information of local feature blocks can be used to correct and supplement the global information. Similarly, when element block information in the feature map is missing, it can be refined using global element representation information. The graph within the network learns real-time dynamic connections through the multi-attention mechanism, and information is propagated and aggregated via graph convolution algorithms. By combining local element block information and global element representation information from the feature map, the algorithm can more effectively restore missing information in the image. Experimental results on several classic image restoration tasks demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance.
zh

[CV-59] RetinaRegen: A Hybrid Model for Readability and Detail Restoration in Fundus Images

【速读】：该论文旨在解决由于现实条件导致的眼底图像模糊或难以读取的问题，从而增加诊断不确定性。为应对这些挑战，论文提出了一种名为RetinaRegen的混合模型，该模型集成了可读性分类模型、扩散模型（Diffusion Model）和变分自编码器（Variational Autoencoder, VAE）。关键在于这种多模型集成方法能够显著提升眼底图像的质量，特别是在视盘（Optic Disc, OD）区域的恢复效果，实验结果显示其在可读性标签（RO）上的峰值信噪比（PSNR）达到27.4521，结构相似性指数（SSIM）为0.9556，以及LPIPS值为0.1911，表明该方法在关键区域恢复方面表现出色，提供了有效增强眼底图像质量和辅助临床诊断的解决方案。

链接: https://arxiv.org/abs/2502.19153
作者: Yuhan Tang,Yudian Wang,Weizhen Li,Ye Yue,Chengchang Pan,Honggang Qi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fundus image quality is crucial for diagnosing eye diseases, but real-world conditions often result in blurred or unreadable images, increasing diagnostic uncertainty. To address these challenges, this study proposes RetinaRegen, a hybrid model for retinal image restoration that integrates a readability classifi-cation model, a Diffusion Model, and a Variational Autoencoder (VAE). Ex-periments on the SynFundus-1M dataset show that the proposed method achieves a PSNR of 27.4521, an SSIM of 0.9556, and an LPIPS of 0.1911 for the readability labels of the optic disc (RO) region. These results demonstrate superior performance in restoring key regions, offering an effective solution to enhance fundus image quality and support clinical diagnosis.
zh

[CV-60] From Traditional to Deep Learning Approaches in Whole Slide Image Registration: A Methodological Review

【速读】：该论文旨在解决全幻灯片图像（Whole Slide Image, WSI）配准在分析肿瘤微环境（Tumor Microenvironment, TME）中的挑战。WSI配准任务涉及同一组织切片的不同染色或连续切片之间的空间信息对齐。论文的关键在于提供现有方法及其局限性的全面理解，并强调当前深度学习方法在WSI配准中的多样化方法。此外，论文还探讨了可用的数据集以及领域内使用的工具和软件，最终识别出该研究领域的开放性挑战和未来趋势。

链接: https://arxiv.org/abs/2502.19123
作者: Behnaz Elhaminia,Abdullah Alsalemi,Esha Nasir,Mostafa Jahanifar,Ruqayya Awan,Lawrence S. Young,Nasir M. Rajpoot,Fayyaz Minhas,Shan E Ahmed Raza
机构: Tissue Image Analytics (TIA) Centre, Department of Computer Science, University of Warwick, UK(华威大学); Division of Biomedical Sciences, Warwick Medical School, University of Warwick, UK(华威大学); Histofy Ltd, Coventry, UK(未知)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole slide image (WSI) registration is an essential task for analysing the tumour microenvironment (TME) in histopathology. It involves the alignment of spatial information between WSIs of the same section or serial sections of a tissue sample. The tissue sections are usually stained with single or multiple biomarkers before imaging, and the goal is to identify neighbouring nuclei along the Z-axis for creating a 3D image or identifying subclasses of cells in the TME. This task is considerably more challenging compared to radiology image registration, such as magnetic resonance imaging or computed tomography, due to various factors. These include gigapixel size of images, variations in appearance between differently stained tissues, changes in structure and morphology between non-consecutive sections, and the presence of artefacts, tears, and deformations. Currently, there is a noticeable gap in the literature regarding a review of the current approaches and their limitations, as well as the challenges and opportunities they present. We aim to provide a comprehensive understanding of the available approaches and their application for various purposes. Furthermore, we investigate current deep learning methods used for WSI registration, emphasising their diverse methodologies. We examine the available datasets and explore tools and software employed in the field. Finally, we identify open challenges and potential future trends in this area of research.
zh

[CV-61] Max360IQ: Blind Omnidirectional Image Quality Assessment with Multi-axis Attention

【速读】：该论文旨在解决非均匀失真条件下全景图像感知质量评估的挑战。解决方案的关键在于提出了一种新颖且有效的盲向全景图像质量评估（BOIQA）模型——多轴注意力全景图像质量评估模型（Max360IQ）。Max360IQ 主要由叠加的多轴注意力模块组成，能够捕捉全局和局部的空间交互，并通过多尺度特征集成模块融合多尺度特征，最终结合深度语义引导的质量回归模块来预测全景图像的质量。

链接: https://arxiv.org/abs/2502.19046
作者: Jiebin Yan,Ziwen Tan,Yuming Fang,Jiale Rao,Yifan Zuo
机构: School of Information Technology, Jiangxi University of Finance and Economics (江西财经大学信息学院), China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users’ quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4% and 0.8% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at this https URL.
zh

[CV-62] PolypFlow: Reinforcing Polyp Segmentation with Flow-Driven Dynamics

【速读】：该论文旨在解决结肠镜图像中息肉分割面临的挑战，包括不规则的病变形态、边界模糊以及成像条件的异质性。论文的关键解决方案是提出了PolypFLow架构，通过注入物理学启发的优化动力学到分割细化过程中，利用流匹配增强机制来显式建模分割置信度在不确定性下的动态演变。不同于传统的级联网络，PolypFLow框架通过学习的速度场解决常微分方程（ODE），逐步将粗略的初始预测与真实掩膜对齐。这种方法提供了两个关键优势：1）可解释的优化：中间流步骤可视化模型如何修正欠分割区域并锐化边界；2）边界感知鲁棒性：流动力学显式建模沿息肉边缘的梯度方向，增强了对低对比度区域和运动伪影的抵抗力。

链接: https://arxiv.org/abs/2502.19037
作者: Pu Wang,Huaizhi Ma,Zhihua Zhang,Zhuoran Zheng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate polyp segmentation remains challenging due to irregular lesion morphologies, ambiguous boundaries, and heterogeneous imaging conditions. While U-Net variants excel at local feature fusion, they often lack explicit mechanisms to model the dynamic evolution of segmentation confidence under uncertainty. Inspired by the interpretable nature of flow-based models, we present \textbfPolypFLow, a flow-matching enhanced architecture that injects physics-inspired optimization dynamics into segmentation refinement. Unlike conventional cascaded networks, our framework solves an ordinary differential equation (ODE) to progressively align coarse initial predictions with ground truth masks through learned velocity fields. This trajectory-based refinement offers two key advantages: 1) Interpretable Optimization: Intermediate flow steps visualize how the model corrects under-segmented regions and sharpens boundaries at each ODE-solver iteration, demystifying the ``black-box" refinement process; 2) Boundary-Aware Robustness: The flow dynamics explicitly model gradient directions along polyp edges, enhancing resilience to low-contrast regions and motion artifacts. Numerous experimental results show that PolypFLow achieves a state-of-the-art while maintaining consistent performance in different lighting scenarios.
zh

[CV-63] InternVQA: Advancing Compressed Video QualityAssessment with Distilling Large Foundation Model ISCAS2025

链接: https://arxiv.org/abs/2502.19026
作者: Fengbin Guan,Zihao Yu,Yiting Lu,Xin Li,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISCAS 2025(Lecture)

点击查看摘要

[CV-64] Subclass Classification of Gliomas Using MRI Fusion Technique

【速读】：该论文旨在解决胶质瘤（Glioma）的精确分类问题，以提高治疗规划和预后预测的准确性。胶质瘤具有不同的侵袭性和预后，因此其分类至关重要。论文的关键解决方案在于开发了一种算法，通过融合来自T1、T2、T1ce和液体衰减反转恢复（FLAIR）序列的MRI图像，并使用最大最小归一化（max-min normalization）进行预处理，以确保像素强度值的一致性。采用UNET架构分别在二维和三维图像上进行分割，随后利用加权平均技术融合多模态MRI图像中的分割区域。此外，将二维和三维分割输出结合，以捕捉肿瘤形状、边界和强度分布等详细特征，并提供脑体积内的全面视图。这些融合后的图像被用作输入，通过预训练的ResNet50模型进行胶质瘤亚类分类。该方法实现了99.25%的分类准确率、99.30%的精度、99.10%的召回率、99.19%的F1得分、84.49%的交并比（Intersection Over Union）以及99.76%的特异性，显著优于现有技术。

链接: https://arxiv.org/abs/2502.18775
作者: Kiranmayee Janardhan,Christy Bobby Thomas
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 7 figures, 1 algorithm, 4 tables, journal paper

点击查看摘要

Abstract:Glioma, the prevalent primary brain tumor, exhibits diverse aggressiveness levels and prognoses. Precise classification of glioma is paramount for treatment planning and predicting prognosis. This study aims to develop an algorithm to fuse the MRI images from T1, T2, T1ce, and fluid-attenuated inversion recovery (FLAIR) sequences to enhance the efficacy of glioma subclass classification as no tumor, necrotic core, peritumoral edema, and enhancing tumor. The MRI images from BraTS datasets were used in this work. The images were pre-processed using max-min normalization to ensure consistency in pixel intensity values across different images. The segmentation of the necrotic core, peritumoral edema, and enhancing tumor was performed on 2D and 3D images separately using UNET architecture. Further, the segmented regions from multimodal MRI images were fused using the weighted averaging technique. Integrating 2D and 3D segmented outputs enhances classification accuracy by capturing detailed features like tumor shape, boundaries, and intensity distribution in slices, while also providing a comprehensive view of spatial extent, shape, texture, and localization within the brain volume. The fused images were used as input to the pre-trained ResNet50 model for glioma subclass classification. The network is trained on 80% and validated on 20% of the data. The proposed method achieved a classification of accuracy of 99.25%, precision of 99.30%, recall of 99.10, F1 score of 99.19%, Intersection Over Union of 84.49%, and specificity of 99.76, which showed a significantly higher performance than existing techniques. These findings emphasize the significance of glioma segmentation and classification in aiding accurate diagnosis.
zh

[CV-65] rraTrace: Temporal Signature Land Use Mapping System

链接: https://arxiv.org/abs/2502.18704
作者: Angela Busheska,Vikram Iyer,Bruno Silva,Peder Olsen,Ranveer Chandra,Vaishnavi Ranganathan
机构: Microsoft Research (微软研究); Lafayette College (拉法耶特学院); University of Washington (华盛顿大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-66] A Comparative Review of the Histogram-based Image Segmentation Methods

【速读】：该论文旨在回顾和比较基于直方图的图像分割技术的历史及其最新进展，并将其分为四类：基于均值的方法、基于高斯混合模型的方法、基于熵的方法以及基于特征点的方法。论文首先描述了经典基于直方图的图像分割方法的原理，然后客观比较了它们的性能。此外，论文还对比了基于直方图的图像分割方法与通用深度学习方法在分割具有均匀或简单背景的对象时的效果。论文的关键在于强调基于直方图的图像分割方法在处理多种类型的图像时，相较于未经特殊训练的通用深度学习方法更为准确。

链接: https://arxiv.org/abs/2502.18550
作者: ZhenZhou Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The histogram of an image is the accurate graphical representation of the numerical grayscale distribution and it is also an estimate of the probability distribution of image pixels. Therefore, histogram has been widely adopted to calculate the clustering means and partitioning thresholds for image segmentation. There have been many classical histogram-based image segmentation methods proposed and played important roles in both academics and industry. In this article, the histories and recent advances of the histogram-based image segmentation techniques are first reviewed and then they are divided into four categories: (1), the means-based method; (2), the Gaussian-mixture-model-based method; (3), the entropy-based method and (4) the feature-points-based method. The principles of the classical histogram-based image segmentation methods are described at first and then their performances are compared objectively. In addition, the histogram-based image segmentation methods are compared with the general-purpose deep learning methods in segmenting objects with uniform or simple backgrounds. The histogram-based image segmentation methods are more accurate than the universal deep-learning methods without special training in segmenting many types of images.
zh

[CV-67] End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework

【速读】：该论文旨在解决传统脑成像分析流程中各步骤独立处理导致的高成本注释和质量控制问题。解决方案的关键在于提出UniBrain，一个统一的端到端框架，将所有处理步骤整合为单一优化过程，实现任务间的相互作用和精炼，从而在减少标注需求的同时提升精度和计算效率。

链接: https://arxiv.org/abs/2502.18523
作者: Yao Su,Keqi Han,Mingjie Zeng,Lichao Sun,Liang Zhan,Carl Yang,Lifang He,Xiangnan Kong
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain imaging analysis is fundamental in neuroscience, providing valuable insights into brain structure and function. Traditional workflows follow a sequential pipeline-brain extraction, registration, segmentation, parcellation, network generation, and classification-treating each step as an independent task. These methods rely heavily on task-specific training data and expert intervention to correct intermediate errors, making them particularly burdensome for high-dimensional neuroimaging data, where annotations and quality control are costly and time-consuming. We introduce UniBrain, a unified end-to-end framework that integrates all processing steps into a single optimization process, allowing tasks to interact and refine each other. Unlike traditional approaches that require extensive task-specific annotations, UniBrain operates with minimal supervision, leveraging only low-cost labels (i.e., classification and extraction) and a single labeled atlas. By jointly optimizing extraction, registration, segmentation, parcellation, network generation, and classification, UniBrain enhances both accuracy and computational efficiency while significantly reducing annotation effort. Experimental results demonstrate its superiority over existing methods across multiple tasks, offering a more scalable and reliable solution for neuroimaging analysis. Our code and data can be found at this https URL
zh

[CV-68] Rewards-based image analysis in microscopy

链接: https://arxiv.org/abs/2502.18522
作者: Kamyar Barakati,Yu Liu,Utkarsh Pratiush,Boris N. Slautin,Sergei V. Kalinin
机构: 未知
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注: 38 pages, 11 figures

点击查看摘要

[CV-69] FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor Recognition

【速读】：该论文旨在解决肿瘤识别领域中因标注数据稀缺而限制AI技术发展的挑战。论文的关键解决方案是引入FreeTumor，这是一种创新的生成式AI（Generative AI, GAI）框架，通过有效利用有限的标注数据和大规模未标注数据进行肿瘤合成训练，从而实现大规模肿瘤图像的合成以缓解数据稀缺问题。这种方法显著扩展了训练数据集的规模，提升了肿瘤识别的精度和效率。

链接: https://arxiv.org/abs/2502.18519
作者: Linshan Wu,Jiaxin Zhuang,Yanning Zhou,Sunan He,Jiabo Ma,Luyang Luo,Xi Wang,Xuefeng Ni,Xiaoling Zhong,Mingxiang Wu,Yinghua Zhao,Xiaohui Duan,Varut Vardhanabhuti,Pranav Rajpurkar,Hao Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To this end, we create the largest training dataset for tumor synthesis and recognition by curating 161,310 publicly available Computed Tomography (CT) volumes from 33 sources, with only 2.3% containing annotated tumors. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor scales up the recognition training datasets by over 40 times, showcasing a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.
zh

[CV-70] Gradient entropy (GradEn): The two dimensional version of slope entropy for image analysis

【速读】：该论文旨在解决复杂系统或信号中量化不规则性的问题，并特别关注于二维图像数据的特征提取。论文的关键解决方案是引入了一种名为梯度熵（Gradient Entropy, GradEn）的新方法，它是斜率熵（slope entropy）的二维扩展，能够同时考虑符号模式和幅度信息。实验结果表明，GradEn在区分具有不同特性的图像时表现出色，并且计算成本低，优于其他二维熵方法。

链接: https://arxiv.org/abs/2502.18516
作者: Runze Jiang,Pengjian Shang
机构: School of mathematics and statistics, Beijing Jiaotong University(北京交通大学数学与统计学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Information theory and Shannon entropy are essential for quantifying irregularity in complex systems or signals. Recently, two-dimensional entropy methods, such as two-dimensional sample entropy, distribution entropy, and permutation entropy, have been proposed for analyzing 2D texture or image data. This paper introduces Gradient entropy (GradEn), an extension of slope entropy to 2D, which considers both symbolic patterns and amplitude information, enabling better feature extraction from image data. We evaluate GradEn with simulated data, including 2D colored noise, 2D mixed processes, and the logistic map. Results show the ability of GradEn to distinguish images with various characteristics while maintaining low computational cost. Real-world datasets, consist of texture, fault gear, and railway corrugation signals, demonstrate the superior performance of GradEn in classification tasks compared to other 2D entropy methods. In conclusion, GradEn is an effective tool for image characterization, offering a novel approach for image processing and recognition.
zh

[CV-71] Exploring Patient Data Requirements in Training Effective AI Models for MRI-based Breast Cancer Classification MICCAI2024

链接: https://arxiv.org/abs/2502.18506
作者: Solha Kang,Wesley De Neve,Francois Rameau,Utku Ozbulak
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in MICCAI 2024 Deep Breast Workshop on AI and Imaging for Diagnostic and Treatment Challenges in Breast Care

点击查看摘要

[CV-72] Deciphering Functions of Neurons in Vision-Language Models

链接: https://arxiv.org/abs/2502.18485
作者: Jiaqi Xu,Cuiling Lan,Xuejin Chen,Yan Lu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 23 figures

点击查看摘要

人工智能

[AI-0] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

链接: https://arxiv.org/abs/2502.19417
作者: Lucy Xiaoyang Shi,Brian Ichter,Michael Equi,Liyiming Ke,Karl Pertsch,Quan Vuong,James Tanner,Anna Walling,Haohuan Wang,Niccolo Fusai,Adrian Li-Bell,Danny Driess,Lachy Groom,Sergey Levine,Chelsea Finn
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., “Could you make me a vegetarian sandwich?” or “I don’t like that one”) require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands (“pick up the cup”), our system can reason through complex prompts and incorporate situated feedback during task execution (“that’s not trash”). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping.

[AI-1] Less or More: Towards Glanceable Explanations for LLM Recommendations Using Ultra-Small Devices

链接: https://arxiv.org/abs/2502.19410
作者: Xinru Wang,Mengjie Yu,Hannah Nguyen,Michael Iuzzolino,Tianyi Wang,Peiqi Tang,Natasha Lynova,Co Tran,Ting Zhang,Naveen Sendhilnathan,Hrvoje Benko,Haijun Xia,Tanya Jonker
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable potential in recommending everyday actions as personal AI assistants, while Explainable AI (XAI) techniques are being increasingly utilized to help users understand why a recommendation is given. Personal AI assistants today are often located on ultra-small devices such as smartwatches, which have limited screen space. The verbosity of LLM-generated explanations, however, makes it challenging to deliver glanceable LLM explanations on such ultra-small devices. To address this, we explored 1) spatially structuring an LLM’s explanation text using defined contextual components during prompting and 2) presenting temporally adaptive explanations to users based on confidence levels. We conducted a user study to understand how these approaches impacted user experiences when interacting with LLM recommendations and explanations on ultra-small devices. The results showed that structured explanations reduced users’ time to action and cognitive load when reading an explanation. Always-on structured explanations increased users’ acceptance of AI recommendations. However, users were less satisfied with structured explanations compared to unstructured ones due to their lack of sufficient, readable details. Additionally, adaptively presenting structured explanations was less effective at improving user perceptions of the AI compared to the always-on structured explanations. Together with users’ interview feedback, the results led to design implications to be mindful of when personalizing the content and timing of LLM explanations that are displayed on ultra-small devices.

[AI-2] Efficient 4D fMRI ASD Classification using Spatial-Temporal-Omics-based Learning Framework

链接: https://arxiv.org/abs/2502.19386
作者: Ziqiao Weng,Weidong Cai,Bo Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at 2025 IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder impacting social and behavioral development. Resting-state fMRI, a non-invasive tool for capturing brain connectivity patterns, aids in early ASD diagnosis and differentiation from typical controls (TC). However, previous methods, which rely on either mean time series or full 4D data, are limited by a lack of spatial information or by high computational costs. This underscores the need for an efficient solution that preserves both spatial and temporal information. In this paper, we propose a novel, simple, and efficient spatial-temporal-omics learning framework designed to efficiently extract spatio-temporal features from fMRI for ASD classification. Our approach addresses these limitations by utilizing 3D time-domain derivatives as the spatial-temporal inter-voxel omics, which preserve full spatial resolution while capturing diverse statistical characteristics of the time series at each voxel. Meanwhile, functional connectivity features serve as the spatial-temporal inter-regional omics, capturing correlations across brain regions. Extensive experiments and ablation studies on the ABIDE dataset demonstrate that our framework significantly outperforms previous methods while maintaining computational efficiency. We believe our research offers valuable insights that will inform and advance future ASD studies, particularly in the realm of spatial-temporal-omics-based learning.

[AI-3] Preference-Based Gradient Estimation for ML-Based Approximate Combinatorial Optimization

链接: https://arxiv.org/abs/2502.19377
作者: Arman Mielke,Uwe Bauknecht,Thilo Strauss,Mathias Niepert
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preliminary work, under review

点击查看摘要

Abstract:Combinatorial optimization (CO) problems arise in a wide range of fields from medicine to logistics and manufacturing. While exact solutions are often not necessary, many applications require finding high-quality solutions quickly. For this purpose, we propose a data-driven approach to improve existing non-learned approximation algorithms for CO. We parameterize the approximation algorithm and train a graph neural network (GNN) to predict parameter values that lead to the best possible solutions. Our pipeline is trained end-to-end in a self-supervised fashion using gradient estimation, treating the approximation algorithm as a black box. We propose a novel gradient estimation scheme for this purpose, which we call preference-based gradient estimation. Our approach combines the benefits of the neural network and the non-learned approximation algorithm: The GNN leverages the information from the dataset to allow the approximation algorithm to find better solutions, while the approximation algorithm guarantees that the solution is feasible. We validate our approach on two well-known combinatorial optimization problems, the travelling salesman problem and the minimum k-cut problem, and show that our method is competitive with state of the art learned CO solvers.

[AI-4] Physics-Based Hybrid Machine Learning for Critical Heat Flux Prediction with Uncertainty Quantification

链接: https://arxiv.org/abs/2502.19357
作者: Aidan Furlong,Xingang Zhao,Robert Salko,Xu Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to the International Journal of Heat and Mass Transfer

点击查看摘要

Abstract:Critical heat flux is a key quantity in boiling system modeling due to its impact on heat transfer and component temperature and performance. This study investigates the development and validation of an uncertainty-aware hybrid modeling approach that combines machine learning with physics-based models in the prediction of critical heat flux in nuclear reactors for cases of dryout. Two empirical correlations, Biasi and Bowring, were employed with three machine learning uncertainty quantification techniques: deep neural network ensembles, Bayesian neural networks, and deep Gaussian processes. A pure machine learning model without a base model served as a baseline for comparison. This study examines the performance and uncertainty of the models under both plentiful and limited training data scenarios using parity plots, uncertainty distributions, and calibration curves. The results indicate that the Biasi hybrid deep neural network ensemble achieved the most favorable performance (with a mean absolute relative error of 1.846% and stable uncertainty estimates), particularly in the plentiful data scenario. The Bayesian neural network models showed slightly higher error and uncertainty but superior calibration. By contrast, deep Gaussian process models underperformed by most metrics. All hybrid models outperformed pure machine learning configurations, demonstrating resistance against data scarcity.

[AI-5] Joint Optimal Transport and Embedding for Network Alignment

链接: https://arxiv.org/abs/2502.19334
作者: Qi Yu,Zhichen Zeng,Yuchen Yan,Lei Ying,R. Srikant,Hanghang Tong
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Network alignment, which aims to find node correspondence across different networks, is the cornerstone of various downstream multi-network and Web mining tasks. Most of the embedding-based methods indirectly model cross-network node relationships by contrasting positive and negative node pairs sampled from hand-crafted strategies, which are vulnerable to graph noises and lead to potential misalignment of nodes. Another line of work based on the optimal transport (OT) theory directly models cross-network node relationships and generates noise-reduced alignments. However, OT methods heavily rely on fixed, pre-defined cost functions that prohibit end-to-end training and are hard to generalize. In this paper, we aim to unify the embedding and OT-based methods in a mutually beneficial manner and propose a joint optimal transport and embedding framework for network alignment named JOENA. For one thing (OT for embedding), through a simple yet effective transformation, the noise-reduced OT mapping serves as an adaptive sampling strategy directly modeling all cross-network node pairs for robust embedding this http URL another (embedding for OT), on top of the learned embeddings, the OT cost can be gradually trained in an end-to-end fashion, which further enhances the alignment quality. With a unified objective, the mutual benefits of both methods can be achieved by an alternating optimization schema with guaranteed convergence. Extensive experiments on real-world networks validate the effectiveness and scalability of JOENA, achieving up to 16% improvement in MRR and 20x speedup compared with the state-of-the-art alignment methods.

[AI-6] Partition Tree Weighting for Non-Stationary Stochastic Bandits

链接: https://arxiv.org/abs/2502.19325
作者: Joel Veness,Marcus Hutter,Andras Gyorgy,Jordi Grau-Moya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper considers a generalisation of universal source coding for interaction data, namely data streams that have actions interleaved with observations. Our goal will be to construct a coding distribution that is both universal \emphand can be used as a control policy. Allowing for action generation needs careful treatment, as naive approaches which do not distinguish between actions and observations run into the self-delusion problem in universal settings. We showcase our perspective in the context of the challenging non-stationary stochastic Bernoulli bandit problem. Our main contribution is an efficient and high performing algorithm for this problem that generalises the Partition Tree Weighting universal source coding technique for passive prediction to the control setting.

[AI-7] Faithful Logic Embeddings in HOL – A recipe to have it all: deep and shallow automated and interactive heavy and light proofs and counterexamples meta and object level

链接: https://arxiv.org/abs/2502.19311
作者: Christoph Benzmüller
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Logic (math.LO)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Deep and shallow embeddings of non-classical logics in classical higher-order logic have been explored, implemented, and used in various automated reasoning tools in recent years. This paper presents a recipe for the simultaneous deployment of different forms of deep and shallow embeddings in classical higher-order logic, enabling not only flexible interactive and automated theorem proving and counterexample finding at meta and object level, but also automated faithfulness proofs between the logic embeddings. The approach, which is fruitful for logic education, research and application, is deliberately illustrated here using simple propositional modal logic. However, the work presented is conceptual in nature and not limited to such a simple logic context.

[AI-8] WOFOSTGym: A Crop Simulator for Learning Annual and Perennial Crop Management Strategies

链接: https://arxiv.org/abs/2502.19308
作者: William Solow,Sandhya Saisubramanian,Alan Fern
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce WOFOSTGym, a novel crop simulation environment designed to train reinforcement learning (RL) agents to optimize agromanagement decisions for annual and perennial crops in single and multi-farm settings. Effective crop management requires optimizing yield and economic returns while minimizing environmental impact, a complex sequential decision-making problem well suited for RL. However, the lack of simulators for perennial crops in multi-farm contexts has hindered RL applications in this domain. Existing crop simulators also do not support multiple annual crops. WOFOSTGym addresses these gaps by supporting 23 annual crops and two perennial crops, enabling RL agents to learn diverse agromanagement strategies in multi-year, multi-crop, and multi-farm settings. Our simulator offers a suite of challenging tasks for learning under partial observability, non-Markovian dynamics, and delayed feedback. WOFOSTGym’s standard RL interface allows researchers without agricultural expertise to explore a wide range of agromanagement problems. Our experiments demonstrate the learned behaviors across various crop varieties and soil types, highlighting WOFOSTGym’s potential for advancing RL-driven decision support in agriculture.

[AI-9] Anomaly Detection in Complex Dynamical Systems: A Systematic Framework Using Embedding Theory and Physics-Inspired Consistency

链接: https://arxiv.org/abs/2502.19307
作者: Michael Somma,Thomas Gallien,Branka Stojanovic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly detection in complex dynamical systems is essential for ensuring reliability, safety, and efficiency in industrial and cyber-physical infrastructures. Predictive maintenance helps prevent costly failures, while cybersecurity monitoring has become critical as digitized systems face growing threats. Many of these systems exhibit oscillatory behaviors and bounded motion, requiring anomaly detection methods that capture structured temporal dependencies while adhering to physical consistency principles. In this work, we propose a system-theoretic approach to anomaly detection, grounded in classical embedding theory and physics-inspired consistency principles. We build upon the Fractal Whitney Embedding Prevalence Theorem, extending traditional embedding techniques to complex system dynamics. Additionally, we introduce state-derivative pairs as an embedding strategy to capture system evolution. To enforce temporal coherence, we develop a Temporal Differential Consistency Autoencoder (TDC-AE), incorporating a TDC-Loss that aligns the approximated derivatives of latent variables with their dynamic representations. We evaluate our method on the C-MAPSS dataset, a benchmark for turbofan aeroengine degradation. TDC-AE outperforms LSTMs and Transformers while achieving a 200x reduction in MAC operations, making it particularly suited for lightweight edge computing. Our findings support the hypothesis that anomalies disrupt stable system dynamics, providing a robust, interpretable signal for anomaly detection.

[AI-10] Corporate Fraud Detection in Rich-yet-Noisy Financial Graph

链接: https://arxiv.org/abs/2502.19305
作者: Shiqi Wang,Zhibo Zhang,Libing Fang,Cam-Tu Nguyen,Wenzhon Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:Corporate fraud detection aims to automatically recognize companies that conduct wrongful activities such as fraudulent financial statements or illegal insider trading. Previous learning-based methods fail to effectively integrate rich interactions in the company network. To close this gap, we collect 18-year financial records in China to form three graph datasets with fraud labels. We analyze the characteristics of the financial graphs, highlighting two pronounced issues: (1) information overload: the dominance of (noisy) non-company nodes over company nodes hinders the message-passing process in Graph Convolution Networks (GCN); and (2) hidden fraud: there exists a large percentage of possible undetected violations in the collected data. The hidden fraud problem will introduce noisy labels in the training dataset and compromise fraud detection results. To handle such challenges, we propose a novel graph-based method, namely, Knowledge-enhanced GCN with Robust Two-stage Learning ( \rm KeGCN_R ), which leverages Knowledge Graph Embeddings to mitigate the information overload and effectively learns rich representations. The proposed model adopts a two-stage learning method to enhance robustness against hidden frauds. Extensive experimental results not only confirm the importance of interactions but also show the superiority of \rm KeGCN_R over a number of strong baselines in terms of fraud detection effectiveness and robustness.

[AI-11] Combining Planning and Reinforcement Learning for Solving Relational Multiagent Domains

链接: https://arxiv.org/abs/2502.19297
作者: Nikhilesh Prabhakar,Ranveer Singh,Harsha Kokel,Sriraam Natarajan,Prasad Tadepalli
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiagent Reinforcement Learning (MARL) poses significant challenges due to the exponential growth of state and action spaces and the non-stationary nature of multiagent environments. This results in notable sample inefficiency and hinders generalization across diverse tasks. The complexity is further pronounced in relational settings, where domain knowledge is crucial but often underutilized by existing MARL algorithms. To overcome these hurdles, we propose integrating relational planners as centralized controllers with efficient state abstractions and reinforcement learning. This approach proves to be sample-efficient and facilitates effective task transfer and generalization.

[AI-12] Complex LLM Planning via Automated Heuristics Discovery

链接: https://arxiv.org/abs/2502.19295
作者: Hongyi Ling,Shubham Parashar,Sambhav Khurana,Blake Olson,Anwesha Basu,Gaurangi Sinha,Zhengzhong Tu,James Caverlee,Shuiwang Ji
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider enhancing large language models (LLMs) for complex planning tasks. While existing methods allow LLMs to explore intermediate steps to make plans, they either depend on unreliable self-verification or external verifiers to evaluate these steps, which demand significant data and computations. Here, we propose automated heuristics discovery (AutoHD), a novel approach that enables LLMs to explicitly generate heuristic functions to guide inference-time search, allowing accurate evaluation of intermediate states. These heuristic functions are further refined through a heuristic evolution process, improving their robustness and effectiveness. Our proposed method requires no additional model training or fine-tuning, and the explicit definition of heuristic functions generated by the LLMs provides interpretability and insights into the reasoning process. Extensive experiments across diverse benchmarks demonstrate significant gains over multiple baselines, including nearly twice the accuracy on some datasets, establishing our approach as a reliable and interpretable solution for complex planning tasks.

[AI-13] Multiview graph dual-attention deep learning and contrastive learning for multi-criteria recommender systems

链接: https://arxiv.org/abs/2502.19271
作者: Saman Forouzandeh,Pavel N. Krivitsky,Rohitash Chandra
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recommender systems leveraging deep learning models have been crucial for assisting users in selecting items aligned with their preferences and interests. However, a significant challenge persists in single-criteria recommender systems, which often overlook the diverse attributes of items that have been addressed by Multi-Criteria Recommender Systems (MCRS). Shared embedding vector for multi-criteria item ratings but have struggled to capture the nuanced relationships between users and items based on specific criteria. In this study, we present a novel representation for Multi-Criteria Recommender Systems (MCRS) based on a multi-edge bipartite graph, where each edge represents one criterion rating of items by users, and Multiview Dual Graph Attention Networks (MDGAT). Employing MDGAT is beneficial and important for adequately considering all relations between users and items, given the presence of both local (criterion-based) and global (multi-criteria) relations. Additionally, we define anchor points in each view based on similarity and employ local and global contrastive learning to distinguish between positive and negative samples across each view and the entire graph. We evaluate our method on two real-world datasets and assess its performance based on item rating predictions. The results demonstrate that our method achieves higher accuracy compared to the baseline method for predicting item ratings on the same datasets. MDGAT effectively capture the local and global impact of neighbours and the similarity between nodes.

[AI-14] Poster: Long PHP webshell files detection based on sliding window attention WWW NDSS2025 NDSS

链接: https://arxiv.org/abs/2502.19257
作者: Zhiqiang Wang,Haoyu Wang,Lu Hao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 3 pages(include 1 page poster), 1 figure. Accepted as a poster at the NDSS this http URL list: this http URL . Dataset/code available at this http URL

点击查看摘要

Abstract:Webshell is a type of backdoor, and web applications are widely exposed to webshell injection attacks. Therefore, it is important to study webshell detection techniques. In this study, we propose a webshell detection method. We first convert PHP source code to opcodes and then extract Opcode Double-Tuples (ODTs). Next, we combine CodeBert and FastText models for feature representation and classification. To address the challenge that deep learning methods have difficulty detecting long webshell files, we introduce a sliding window attention mechanism. This approach effectively captures malicious behavior within long files. Experimental results show that our method reaches high accuracy in webshell detection, solving the problem of traditional methods that struggle to address new webshell variants and anti-detection techniques.

[AI-15] Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverag e Perspective

链接: https://arxiv.org/abs/2502.19255
作者: Jiawei Huang,Bingcong Li,Christoph Dann,Niao He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 35 Pages

点击查看摘要

Abstract:Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property of the KL-regularized RLHF objective: \empha policy’s ability to cover the optimal policy is captured by its sub-optimality. Building on this insight, we propose a theoretical transfer learning algorithm with provable benefits compared to standard online learning. Our approach achieves low regret in the early stage by quickly adapting to the best available source reward models without prior knowledge of their quality, and over time, it attains an \tildeO(\sqrtT) regret bound \emphindependent of structural complexity measures. Inspired by our theoretical findings, we develop an empirical algorithm with improved computational efficiency, and demonstrate its effectiveness empirically in summarization tasks.

[AI-16] GraphBridge: Towards Arbitrary Transfer Learning in GNNs ICLR2025

链接: https://arxiv.org/abs/2502.19252
作者: Li Ju,Xingyi Yang,Qi Li,Xinchao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures, 6 tables, to be published in ICLR 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) are conventionally trained on a per-domain, per-task basis. It creates a significant barrier in transferring the acquired knowledge to different, heterogeneous data setups. This paper introduces GraphBridge, a novel framework to enable knowledge transfer across disparate tasks and domains in GNNs, circumventing the need for modifications to task configurations or graph structures. Specifically, GraphBridge allows for the augmentation of any pre-trained GNN with prediction heads and a bridging network that connects the input to the output layer. This architecture not only preserves the intrinsic knowledge of the original model but also supports outputs of arbitrary dimensions. To mitigate the negative transfer problem, GraphBridg merges the source model with a concurrently trained model, thereby reducing the source bias when applied to the target domain. Our method is thoroughly evaluated across diverse transfer learning scenarios, including Graph2Graph, Node2Node, Graph2Node, and graph2point-cloud. Empirical validation, conducted over 16 datasets representative of these scenarios, confirms the framework’s capacity for task- and domain-agnostic transfer learning within graph-like data, marking a significant advancement in the field of GNNs.

[AI-17] Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

链接: https://arxiv.org/abs/2502.19193
作者: Jinyu Cai,Yusei Ishimizu,Mingyue Zhang,Munan Li,Jialong Li,Kenji Tei
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: The manuscript has been submitted to IEEE Transactions on Computational Social Systems

点击查看摘要

Abstract:Social media platforms frequently impose restrictive policies to moderate user content, prompting the emergence of creative evasion language strategies. This paper presents a multi-agent framework based on Large Language Models (LLMs) to simulate the iterative evolution of language strategies under regulatory constraints. In this framework, participant agents, as social media users, continuously evolve their language expression, while supervisory agents emulate platform-level regulation by assessing policy violations. To achieve a more faithful simulation, we employ a dual design of language strategies (constraint and expression) to differentiate conflicting goals and utilize an LLM-driven GA (Genetic Algorithm) for the selection, mutation, and crossover of language strategies. The framework is evaluated using two distinct scenarios: an abstract password game and a realistic simulated illegal pet trade scenario. Experimental results demonstrate that as the number of dialogue rounds increases, both the number of uninterrupted dialogue turns and the accuracy of information transmission improve significantly. Furthermore, a user study with 40 participants validates the real-world relevance of the generated dialogues and strategies. Moreover, ablation studies validate the importance of the GA, emphasizing its contribution to long-term adaptability and improved overall results.

[AI-18] Provocations from the Humanities for Generative AI Research

链接: https://arxiv.org/abs/2502.19190
作者: Lauren Klein,Meredith Martin,André Brock,Maria Antoniak,Melanie Walsh,Jessica Marie Johnson,Lauren Tilton,David Mimno
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: working draft; final draft in preparation

点击查看摘要

Abstract:This paper presents a set of provocations for considering the uses, impact, and harms of generative AI from the perspective of humanities researchers. We provide a working definition of humanities research, summarize some of its most salient theories and methods, and apply these theories and methods to the current landscape of AI. Drawing from foundational work in critical data studies, along with relevant humanities scholarship, we elaborate eight claims with broad applicability to current conversations about generative AI: 1) Models make words, but people make meaning; 2) Generative AI requires an expanded definition of culture; 3) Generative AI can never be representative; 4) Bigger models are not always better models; 5) Not all training data is equivalent; 6) Openness is not an easy fix; 7) Limited access to compute enables corporate capture; and 8) AI universalism creates narrow human subjects. We conclude with a discussion of the importance of resisting the extraction of humanities research by computer science and related fields.

[AI-19] AutoML for Multi-Class Anomaly Compensation of Sensor Drift

链接: https://arxiv.org/abs/2502.19180
作者: Melanie Schaller,Mathis Kruse,Antonio Ortega,Marius Lindauer,Bodo Rosenhahn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in Measurement Journal

点击查看摘要

Abstract:Addressing sensor drift is essential in industrial measurement systems, where precise data output is necessary for maintaining accuracy and reliability in monitoring processes, as it progressively degrades the performance of machine learning models over time. Our findings indicate that the standard cross-validation method used in existing model training overestimates performance by inadequately accounting for drift. This is primarily because typical cross-validation techniques allow data instances to appear in both training and testing sets, thereby distorting the accuracy of the predictive evaluation. As a result, these models are unable to precisely predict future drift effects, compromising their ability to generalize and adapt to evolving data conditions. This paper presents two solutions: (1) a novel sensor drift compensation learning paradigm for validating models, and (2) automated machine learning (AutoML) techniques to enhance classification performance and compensate sensor drift. By employing strategies such as data balancing, meta-learning, automated ensemble learning, hyperparameter optimization, feature selection, and boosting, our AutoML-DC (Drift Compensation) model significantly improves classification performance against sensor drift. AutoML-DC further adapts effectively to varying drift severities.

[AI-20] Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems AAAI2025

链接: https://arxiv.org/abs/2502.19145
作者: Pierre Peigne-Lefebvre,Mikolaj Kniejski,Filip Sondej,Matthieu David,Jason Hoelscher-Obermaier,Christian Schroeder de Witt,Esben Kran
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted to AAAI 2025 Conference

点击查看摘要

Abstract:As AI agents are increasingly adopted to collaborate on complex objectives, ensuring the security of autonomous multi-agent systems becomes crucial. We develop simulations of agents collaborating on shared objectives to study these security risks and security trade-offs. We focus on scenarios where an attacker compromises one agent, using it to steer the entire system toward misaligned outcomes by corrupting other agents. In this context, we observe infectious malicious prompts - the multi-hop spreading of malicious instructions. To mitigate this risk, we evaluated several strategies: two “vaccination” approaches that insert false memories of safely handling malicious input into the agents’ memory stream, and two versions of a generic safety instruction strategy. While these defenses reduce the spread and fulfillment of malicious instructions in our experiments, they tend to decrease collaboration capability in the agent network. Our findings illustrate potential trade-off between security and collaborative efficiency in multi-agent systems, providing insights for designing more secure yet effective AI collaborations.

[AI-21] A Temporal Planning Framework for Multi-Agent Systems via LLM -Aided Knowledge Base Management

链接: https://arxiv.org/abs/2502.19135
作者: Enrico Saccon,Ahmet Tikna,Davide De Martini,Edoardo Lamon,Luigi Palopoli,Marco Roveri
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents a novel framework, called PLANTOR (PLanning with Natural language for Task-Oriented Robots), that integrates Large Language Models (LLMs) with Prolog-based knowledge management and planning for multi-robot tasks. The system employs a two-phase generation of a robot-oriented knowledge base, ensuring reusability and compositional reasoning, as well as a three-step planning procedure that handles temporal dependencies, resource constraints, and parallel task execution via mixed-integer linear programming. The final plan is converted into a Behaviour Tree for direct use in ROS2. We tested the framework in multi-robot assembly tasks within a block world and an arch-building scenario. Results demonstrate that LLMs can produce accurate knowledge bases with modest human feedback, while Prolog guarantees formal correctness and explainability. This approach underscores the potential of LLM integration for advanced robotics tasks requiring flexible, scalable, and human-understandable planning.

[AI-22] Voting or Consensus? Decision-Making in Multi-Agent Debate

链接: https://arxiv.org/abs/2502.19130
作者: Lars Benedikt Kaesberg,Jonas Becker,Jan Philip Wahle,Terry Ruas,Bela Gipp
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Much of the success of multi-agent debates depends on carefully choosing the right parameters. Among them, the decision-making protocol stands out. Systematic comparison of decision protocols is difficult because studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making addresses the challenges of different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time (i.e., decision protocol) to analyze how different methods affect the collaboration between agents and test different protocols on knowledge (MMLU, MMLU-Pro, GPQA) and reasoning datasets (StrategyQA, MuSR, SQuAD 2.0). Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks over the other decision protocol. Increasing the number of agents improves performance, while more discussion rounds before voting reduces it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.

[AI-23] Chemical knowledge-informed framework for privacy-aware retrosynthesis learning

链接: https://arxiv.org/abs/2502.19119
作者: Guikun Chen,Xu Zhang,Yi Yang,Wenguan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Chemical reaction data is a pivotal asset, driving advances in competitive fields such as pharmaceuticals, materials science, and industrial chemistry. Its proprietary nature renders it sensitive, as it often includes confidential insights and competitive advantages organizations strive to protect. However, in contrast to this need for confidentiality, the current standard training paradigm for machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries and frequent data transmission between entities, potentially exposing proprietary information to unauthorized access or interception during storage and transfer. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models. CKIF enables distributed training across multiple chemical organizations without compromising the confidentiality of proprietary reaction data. Instead of gathering raw reaction data, CKIF learns retrosynthesis models through iterative, chemical knowledge-informed aggregation of model parameters. In particular, the chemical properties of predicted reactants are leveraged to quantitatively assess the observable behaviors of individual models, which in turn determines the adaptive weights used for model aggregation. On a variety of reaction datasets, CKIF outperforms several strong baselines by a clear margin (e.g., ~20% performance improvement over FedAvg on USPTO-50K), showing its feasibility and superiority to stimulate further research on privacy-preserving retrosynthesis.

[AI-24] he Shady Light of Art Automation

链接: https://arxiv.org/abs/2502.19107
作者: Dejan Grba
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted to ISEA 2025

点击查看摘要

Abstract:Generative artificial intelligence (generative AI) has entered the mainstream culture and become a subject of extensive academic investigation. However, the character and background of its impact on art require subtler scrutiny and more nuanced contextualization. This paper summarizes a broader study of the roles that AI’s conceptual and ideological substrata play in influencing art notions. The focus is on divergent but coalescing and often questionable ideas, values, and political views that generative AI and other art-related AI technologies propagate from the computer science and AI/tech industry to the contemporary art and culture. The paper maps the main areas of this complex relationship and concisely critiques their key aspects.

[AI-25] XSS Adversarial Attacks Based on Deep Reinforcement Learning: A Replication and Extension Study

链接: https://arxiv.org/abs/2502.19095
作者: Samuele Pasini,Gianluca Maragliano,Jinhan Kim,Paolo Tonella
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Cross-site scripting (XSS) poses a significant threat to web application security. While Deep Learning (DL) has shown remarkable success in detecting XSS attacks, it remains vulnerable to adversarial attacks due to the discontinuous nature of its input-output mapping. These adversarial attacks employ mutation-based strategies for different components of XSS attack vectors, allowing adversarial agents to iteratively select mutations to evade detection. Our work replicates a state-of-the-art XSS adversarial attack, highlighting threats to validity in the reference work and extending it toward a more effective evaluation strategy. Moreover, we introduce an XSS Oracle to mitigate these threats. The experimental results show that our approach achieves an escape rate above 96% when the threats to validity of the replicated technique are addressed.

[AI-26] Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation

链接: https://arxiv.org/abs/2502.19091
作者: Humza Sami,Mubashir ul Islam,Samy Charas,Asav Gandhi,Pierre-Emmanuel Gaillardon,Valerio Tenace
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have substantially evolved Multi-Agent Systems (MASs) capabilities, enabling systems that not only automate tasks but also leverage near-human reasoning capabilities. To achieve this, LLM-based MASs need to be built around two critical principles: (i) a robust architecture that fully exploits LLM potential for specific tasks – or related task sets – and ( ii ) an effective methodology for equipping LLMs with the necessary capabilities to perform tasks and manage information efficiently. It goes without saying that a priori architectural designs can limit the scalability and domain adaptability of a given MAS. To address these challenges, in this paper we introduce Nexus: a lightweight Python framework designed to easily build and manage LLM-based MASs. Nexus introduces the following innovations: (i) a flexible multi-supervisor hierarchy, (ii) a simplified workflow design, and (iii) easy installation and open-source flexibility: Nexus can be installed via pip and is distributed under a permissive open-source license, allowing users to freely modify and extend its capabilities. Experimental results demonstrate that architectures built with Nexus exhibit state-of-the-art performance across diverse domains. In coding tasks, Nexus-driven MASs achieve a 99% pass rate on HumanEval and a flawless 100% on VerilogEval-Human, outperforming cutting-edge reasoning language models such as o3-mini and DeepSeek-R1. Moreover, these architectures display robust proficiency in complex reasoning and mathematical problem solving, achieving correct solutions for all randomly selected problems from the MATH dataset. In the realm of multi-objective optimization, Nexus-based architectures successfully address challenging timing closure tasks on designs from the VTR benchmark suite, while guaranteeing, on average, a power saving of nearly 30%. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) Cite as: arXiv:2502.19091 [cs.AI] (or arXiv:2502.19091v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.19091 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] Dealing with Inconsistency for Reasoning over Knowledge Graphs: A Survey

链接: https://arxiv.org/abs/2502.19023
作者: Anastasios Nentidis,Charilaos Akasiadis,Angelos Charalambidis,Alexander Artikis
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Knowledge Graphs (KGs), where the schema of the data is usually defined by particular ontologies, reasoning is a necessity to perform a range of tasks, such as retrieval of information, question answering, and the derivation of new knowledge. However, information to populate KGs is often extracted (semi-) automatically from natural language resources, or by integrating datasets that follow different semantic schemas, resulting in KG inconsistency. This, however, hinders the process of reasoning. In this survey, we focus on how to perform reasoning on inconsistent KGs, by analyzing the state of the art towards three complementary directions: a) the detection of the parts of the KG that cause the inconsistency, b) the fixing of an inconsistent KG to render it consistent, and c) the inconsistency-tolerant reasoning. We discuss existing work from a range of relevant fields focusing on how, and in which cases they are related to the above directions. We also highlight persisting challenges and future directions.

[AI-28] Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning ICLR2025

链接: https://arxiv.org/abs/2502.19009
作者: Jaehyeon Son,Soochan Lee,Gunhee Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

Abstract:Recent studies have shown that Transformers can perform in-context reinforcement learning (RL) by imitating existing RL algorithms, enabling sample-efficient adaptation to unseen tasks without parameter updates. However, these models also inherit the suboptimal behaviors of the RL algorithms they imitate. This issue primarily arises due to the gradual update rule employed by those algorithms. Model-based planning offers a promising solution to this limitation by allowing the models to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the suboptimal behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where Transformers simultaneously learn environment dynamics and improve policy in-context. We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.

[AI-29] A Multi-Agent DRL-Based Framework for Optimal Resource Allocation and Twin Migration in the Multi-Tier Vehicular Metaverse

链接: https://arxiv.org/abs/2502.19004
作者: Nahom Abishu Hayla,A. Mohammed Seid,Aiman Erbad,Tilahun M. Getu,Ala Al-Fuqaha,Mohsen Guizani
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 15 pages, 16 figures

点击查看摘要

Abstract:Although multi-tier vehicular Metaverse promises to transform vehicles into essential nodes – within an interconnected digital ecosystem – using efficient resource allocation and seamless vehicular twin (VT) migration, this can hardly be achieved by the existing techniques operating in a highly dynamic vehicular environment, since they can hardly balance multi-objective optimization problems such as latency reduction, resource utilization, and user experience (UX). To address these challenges, we introduce a novel multi-tier resource allocation and VT migration framework that integrates Graph Convolutional Networks (GCNs), a hierarchical Stackelberg game-based incentive mechanism, and Multi-Agent Deep Reinforcement Learning (MADRL). The GCN-based model captures both spatial and temporal dependencies within the vehicular network; the Stackelberg game-based incentive mechanism fosters cooperation between vehicles and infrastructure; and the MADRL algorithm jointly optimizes resource allocation and VT migration in real time. By modeling this dynamic and multi-tier vehicular Metaverse as a Markov Decision Process (MDP), we develop a MADRL-based algorithm dubbed the Multi-Objective Multi-Agent Deep Deterministic Policy Gradient (MO-MADDPG), which can effectively balances the various conflicting objectives. Extensive simulations validate the effectiveness of this algorithm that is demonstrated to enhance scalability, reliability, and efficiency while considerably improving latency, resource utilization, migration cost, and overall UX by 12.8%, 9.7%, 14.2%, and 16.1%, respectively.

[AI-30] he Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

链接: https://arxiv.org/abs/2502.19002
作者: Jinbo Wang,Mingze Wang,Zhanpeng Zhou,Junchi Yan,Weinan E,Lei Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 23 pages

点击查看摘要

Abstract:Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block’s sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly 2\times speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined 2\times speedup and 2\times memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.

[AI-31] DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

链接: https://arxiv.org/abs/2502.18952
作者: Lei Zhao,Sizhou Chen,Linfeng Feng,Xiao-Lei Zhang,Xuelong Li
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named this http URL, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.

[AI-32] SLAM in the Dark: Self-Supervised Learning of Pose Depth and Loop-Closure from Thermal Images

链接: https://arxiv.org/abs/2502.18932
作者: Yangfan Xu,Qu Hao,Lilian Zhang,Jun Mao,Xiaofeng He,Wenqi Wu,Changhao Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual SLAM is essential for mobile robots, drone navigation, and VR/AR, but traditional RGB camera systems struggle in low-light conditions, driving interest in thermal SLAM, which excels in such environments. However, thermal imaging faces challenges like low contrast, high noise, and limited large-scale annotated datasets, restricting the use of deep learning in outdoor scenarios. We present DarkSLAM, a noval deep learning-based monocular thermal SLAM system designed for large-scale localization and reconstruction in complex lighting this http URL approach incorporates the Efficient Channel Attention (ECA) mechanism in visual odometry and the Selective Kernel Attention (SKA) mechanism in depth estimation to enhance pose accuracy and mitigate thermal depth degradation. Additionally, the system includes thermal depth-based loop closure detection and pose optimization, ensuring robust performance in low-texture thermal scenes. Extensive outdoor experiments demonstrate that DarkSLAM significantly outperforms existing methods like SC-Sfm-Learner and Shin et al., delivering precise localization and 3D dense mapping even in challenging nighttime environments.

[AI-33] alking like Piping and Instrumentation Diagrams (PIDs)

链接: https://arxiv.org/abs/2502.18928
作者: Achmad Anggawirya Alimin,Dominik P. Goldstein,Lukas Schulze Balhorn,Artur M. Schweidtmann
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a methodology that allows communication with Piping and Instrumentation Diagrams (PIDs) using natural language. In particular, we represent PIDs through the DEXPI data model as labeled property graphs and integrate them with Large Language Models (LLMs). The approach consists of three main parts: 1) PIDs are cast into a graph representation from the DEXPI format using our pyDEXPI Python package. 2) A tool for generating PID knowledge graphs from pyDEXPI. 3) Integration of the PID knowledge graph to LLMs using graph-based retrieval augmented generation (graph-RAG). This approach allows users to communicate with PIDs using natural language. It extends LLM’s ability to retrieve contextual data from PIDs and mitigate hallucinations. Leveraging the LLM’s large corpus, the model is also able to interpret process information in PIDs, which could help engineers in their daily tasks. In the future, this work will also open up opportunities in the context of other generative Artificial Intelligence (genAI) solutions on PIDs, and AI-assisted HAZOP studies.

[AI-34] BeamVQ: Beam Search with Vector Quantization to Mitigate Data Scarcity in Physical Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2502.18925
作者: Weiyan Wang,Xingjian Shi,Ruiqi Shu,Yuan Gao,Rui Ray Chen,Kun Wang,Fan Xu,Jinbao Xue,Shuaipeng Li,Yangyu Tao,Di Wang,Hao Wu,Xiaomeng Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In practice, physical spatiotemporal forecasting can suffer from data scarcity, because collecting large-scale data is non-trivial, especially for extreme events. Hence, we propose \method, a novel probabilistic framework to realize iterative self-training with new self-ensemble strategies, achieving better physical consistency and generalization on extreme events. Following any base forecasting model, we can encode its deterministic outputs into a latent space and retrieve multiple codebook entries to generate probabilistic outputs. Then BeamVQ extends the beam search from discrete spaces to the continuous state spaces in this field. We can further employ domain-specific metrics (e.g., Critical Success Index for extreme events) to filter out the top-k candidates and develop the new self-ensemble strategy by combining the high-quality candidates. The self-ensemble can not only improve the inference quality and robustness but also iteratively augment the training datasets during continuous self-training. Consequently, BeamVQ realizes the exploration of rare but critical phenomena beyond the original dataset. Comprehensive experiments on different benchmarks and backbones show that BeamVQ consistently reduces forecasting MSE (up to 39%), enhancing extreme events detection and proving its effectiveness in handling data scarcity.

[AI-35] Dynamic Classification: Leverag ing Self-Supervised Classification to Enhance Prediction Performance

链接: https://arxiv.org/abs/2502.18891
作者: Ziyuan Zhong,Junyang Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:In this paper, we propose an innovative dynamic classification algorithm designed to achieve the objective of zero missed detections and minimal false positives. The algorithm partitions the data into N equivalent training subsets and N prediction subsets using a supervised model, followed by independent predictions from N separate predictive models. This enables each predictive model to operate within a smaller data range, thereby improving overall accuracy. Additionally, the algorithm leverages data generated through supervised learning to further refine prediction results, filtering out predictions that do not meet accuracy requirements without the need to introduce additional models. Experimental results demonstrate that, when data partitioning errors are minimal, the dynamic classification algorithm achieves exceptional performance with zero missed detections and minimal false positives, significantly outperforming existing model ensembles. Even in cases where classification errors are larger, the algorithm remains comparable to state of the art models. The key innovations of this study include self-supervised classification learning, the use of small-range subset predictions, and the direct rejection of substandard predictions. While the current algorithm still has room for improvement in terms of automatic parameter tuning and classification model efficiency, it has demonstrated outstanding performance across multiple datasets. Future research will focus on optimizing the classification component to further enhance the algorithm’s robustness and adaptability.

[AI-36] A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops ICLR2025

链接: https://arxiv.org/abs/2502.18865
作者: Shi Fu,Yingjie Wang,Yuzhu Chen,Xinmei Tian,Dacheng Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.

[AI-37] Investigating Generalization of One-shot LLM Steering Vectors

链接: https://arxiv.org/abs/2502.18862
作者: Jacob Dunefsky,Arman Cohan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures. Code is available at this https URL

点击查看摘要

Abstract:Steering vectors have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing steering vectors through gradient descent on a single training example, and systematically investigate how these vectors generalize. We consider several steering optimization techniques, including multiple novel ones, and find that the resulting vectors effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot steering vectors that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized steering vectors can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, to quantitatively assess steering effectiveness in instruction-tuned models, we develop a novel evaluation framework using sequence probabilities from the corresponding base model. With this framework, we analyze how steering vectors modulate an instruction-tuned LLM’s ability to recover from outputting false information, and find that this ability derives from the base model. Overall, our findings suggest that optimizing steering vectors on a single example can mediate misaligned behavior in LLMs, and provide a path toward better understanding the relationship between LLM behavior and activation space structure.

[AI-38] Intelligence Test

链接: https://arxiv.org/abs/2502.18858
作者: Jingtao Zhan,Jiahao Zhao,Jiayu Li,Yiqun Liu,Bo Zhang,Qingyao Ai,Jiaxin Mao,Hongning Wang,Min Zhang,Shaoping Ma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:How does intelligence emerge? We propose that intelligence is not a sudden gift or random occurrence, but rather a necessary trait for species to survive through Natural Selection. If a species passes the test of Natural Selection, it demonstrates the intelligence to survive in nature. Extending this perspective, we introduce Intelligence Test, a method to quantify the intelligence of any subject on any task. Like how species evolve by trial and error, Intelligence Test quantifies intelligence by the number of failed attempts before success. Fewer failures correspond to higher intelligence. When the expectation and variance of failure counts are both finite, it signals the achievement of an autonomous level of intelligence. Using Intelligence Test, we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve a level of autonomy in simple tasks, they are still far from autonomous in more complex tasks, such as vision, search, recommendation, and language. While scaling model size might help, this would come at an astronomical cost. Projections suggest that achieving general autonomy would require unimaginable 10^26 parameters. Even if Moore’s Law continuously holds, such a parameter scale would take 70 years. This staggering cost highlights the complexity of human tasks and the inadequacies of current AI. To further understand this phenomenon, we conduct a theoretical analysis. Our simulations suggest that human tasks possess a criticality property. As a result, autonomy requires a deep understanding of the task’s underlying mechanisms. Current AI, however, does not fully grasp these mechanisms and instead relies on superficial mimicry, making it difficult to reach an autonomous level. We believe Intelligence Test can not only guide the future development of AI but also offer profound insights into the intelligence of humans ourselves.

[AI-39] Reimagining Personal Data: Unlocking the Potential of AI-Generated Images in Personal Data Meaning-Making

链接: https://arxiv.org/abs/2502.18853
作者: Soobin Park,Hankyung Kim,Youn-kyung Lim
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 21 pages excluding reference and appendix. Accepted at ACM CHI 2025

点击查看摘要

Abstract:Image-generative AI provides new opportunities to transform personal data into alternative visual forms. In this paper, we illustrate the potential of AI-generated images in facilitating meaningful engagement with personal data. In a formative autobiographical design study, we explored the design and use of AI-generated images derived from personal data. Informed by this study, we designed a web-based application as a probe that represents personal data through generative images utilizing Open AI’s GPT-4 model and DALL-E 3. We then conducted a 21-day diary study and interviews using the probe with 16 participants to investigate users’ in-depth experiences with images generated by AI in everyday lives. Our findings reveal new qualities of experiences in users’ engagement with data, highlighting how participants constructed personal meaning from their data through imagination and speculation on AI-generated images. We conclude by discussing the potential and concerns of leveraging image-generative AI for personal data meaning-making.

[AI-40] Marking Code Without Breaking It: Code Watermarking for Detecting LLM -Generated Code

链接: https://arxiv.org/abs/2502.18851
作者: Jungin Kim,Shinwoo Park,Yo-Sub Han
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Code watermarking identifies AI-generated code by embedding patterns into the code during generation. Effective watermarking requires meeting two key conditions: the watermark should be reliably detectable, and the code should retain its original functionality. However, existing methods often modify tokens that are critical for program logic, such as keywords in conditional expressions or operators in arithmetic computations. These modifications can cause syntax errors or functional failures, limiting the practical use of watermarking. We present STONE, a method that preserves functional integrity by selectively inserting watermarks only into non-syntax tokens. By excluding tokens essential for code execution, STONE minimizes the risk of functional degradation. In addition, we introduce CWEM, a comprehensive evaluation metric that evaluates watermarking techniques based on correctness, detectability, and naturalness. While correctness and detectability have been widely used, naturalness remains underexplored despite its importance. Unnatural patterns can reveal the presence of a watermark, making it easier for adversaries to remove. We evaluate STONE using CWEM and compare its performance with the state-of-the-art approach. The results show that STONE achieves an average improvement of 7.69% in CWEM across Python, C++, and Java. Our code is available in this https URL. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.18851 [cs.CR] (or arXiv:2502.18851v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2502.18851 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-41] REALM-Bench: A Real-World Planning Benchmark for LLM s and Multi-Agent Systems

链接: https://arxiv.org/abs/2502.18836
作者: Longling Geng,Edward Y. Chang
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 4 figures, 9 tables

点击查看摘要

Abstract:This benchmark suite provides a comprehensive evaluation framework for assessing both individual LLMs and multi-agent systems in real-world planning scenarios. The suite encompasses eleven designed problems that progress from basic to highly complex, incorporating key aspects such as multi-agent coordination, inter-agent dependencies, and dynamic environmental disruptions. Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter-dependencies, and the frequency of unexpected disruptions requiring real-time adaptation. The benchmark includes detailed specifications, evaluation metrics, and baseline implementations using contemporary frameworks like LangGraph, enabling rigorous testing of both single-agent and multi-agent planning capabilities. Through standardized evaluation criteria and scalable complexity, this benchmark aims to drive progress in developing more robust and adaptable AI planning systems for real-world applications.

[AI-42] Data-Efficient Multi-Agent Spatial Planning with LLM s

链接: https://arxiv.org/abs/2502.18822
作者: Huangyuan Su,Aaron Walsman,Daniel Garces,Sham Kakade,Stephanie Gil
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In this project, our goal is to determine how to leverage the world-knowledge of pretrained large language models for efficient and robust learning in multiagent decision making. We examine this in a taxi routing and assignment problem where agents must decide how to best pick up passengers in order to minimize overall waiting time. While this problem is situated on a graphical road network, we show that with the proper prompting zero-shot performance is quite strong on this task. Furthermore, with limited fine-tuning along with the one-at-a-time rollout algorithm for look ahead, LLMs can out-compete existing approaches with 50 times fewer environmental interactions. We also explore the benefits of various linguistic prompting approaches and show that including certain easy-to-compute information in the prompt significantly improves performance. Finally, we highlight the LLM’s built-in semantic understanding, showing its ability to adapt to environmental factors through simple prompts.

[AI-43] BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction

链接: https://arxiv.org/abs/2502.18807
作者: Ruifeng Tan,Weixiang Hong,Jiayue Tang,Xibin Lu,Ruijun Ma,Xiang Zheng,Jia Li,Jiaqiang Huang,Tong-Yi Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: Under review

点击查看摘要

Abstract:Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.4 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 80 chemical systems, 12 operating temperatures, and 646 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in a series of neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at this https URL.

[AI-44] NeuroTree: Hierarchical Functional Brain Pathway Decoding for Mental Health Disorders

链接: https://arxiv.org/abs/2502.18786
作者: Jun-En Ding,Dongsheng Luo,Anna Zilverstand,Feng Liu
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Analyzing functional brain networks using functional magnetic resonance imaging (fMRI) is crucial for understanding psychiatric disorders and addictive behaviors. While existing fMRI-based graph convolutional networks (GCNs) show considerable promise for feature extraction, they often fall short in characterizing complex relationships between brain regions and demographic factors and accounting for interpretable variables linked to psychiatric conditions. We propose NeuroTree to overcome these limitations, integrating a k-hop AGE-GCN with neural ordinary differential equations (ODEs). This framework leverages an attention mechanism to optimize functional connectivity (FC), thereby enhancing dynamic FC feature learning for brain disease classification. Furthermore, NeuroTree effectively decodes fMRI network features into tree structures, which improves the capture of high-order brain regional pathway features and enables the identification of hierarchical neural behavioral patterns essential for understanding disease-related brain subnetworks. Our empirical evaluations demonstrate that NeuroTree achieves state-of-the-art performance across two distinct mental disorder datasets and provides valuable insights into age-related deterioration patterns. These findings underscore the model’s efficacy in predicting psychiatric disorders and elucidating their underlying neural mechanisms.

[AI-45] Research on Edge Computing and Cloud Collaborative Resource Scheduling Optimization Based on Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.18773
作者: Yuqing Wang,Xiao Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:This study addresses the challenge of resource scheduling optimization in edge-cloud collaborative computing using deep reinforcement learning (DRL). The proposed DRL-based approach improves task processing efficiency, reduces overall processing time, enhances resource utilization, and effectively controls task migrations. Experimental results demonstrate the superiority of DRL over traditional scheduling algorithms, particularly in managing complex task allocation, dynamic workloads, and multiple resource constraints. Despite its advantages, further improvements are needed to enhance learning efficiency, reduce training time, and address convergence issues. Future research should focus on increasing the algorithm’s fault tolerance to handle more complex and uncertain scheduling scenarios, thereby advancing the intelligence and efficiency of edge-cloud computing systems.

[AI-46] Online Prototypes and Class-Wise Hypergradients for Online Continual Learning with Pre-Trained Models

链接: https://arxiv.org/abs/2502.18762
作者: Nicolas Michel,Maorong Wang,Jiangpeng He,Toshihiko Yamasaki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Continual Learning (CL) addresses the problem of learning from a data sequence where the distribution changes over time. Recently, efficient solutions leveraging Pre-Trained Models (PTM) have been widely explored in the offline CL (offCL) scenario, where the data corresponding to each incremental task is known beforehand and can be seen multiple times. However, such solutions often rely on 1) prior knowledge regarding task changes and 2) hyper-parameter search, particularly regarding the learning rate. Both assumptions remain unavailable in online CL (onCL) scenarios, where incoming data distribution is unknown and the model can observe each datum only once. Therefore, existing offCL strategies fall largely behind performance-wise in onCL, with some proving difficult or impossible to adapt to the online scenario. In this paper, we tackle both problems by leveraging Online Prototypes (OP) and Class-Wise Hypergradients (CWH). OP leverages stable output representations of PTM by updating its value on the fly to act as replay samples without requiring task boundaries or storing past data. CWH learns class-dependent gradient coefficients during training to improve over sub-optimal learning rates. We show through experiments that both introduced strategies allow for a consistent gain in accuracy when integrated with existing approaches. We will make the code fully available upon acceptance.

[AI-47] Learning Autonomy: Off-Road Navigation Enhanced by Human Input

链接: https://arxiv.org/abs/2502.18760
作者: Akhil Nagariya,Dimitar Filev,Srikanth Saripalli,Gaurav Pandey
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the area of autonomous driving, navigating off-road terrains presents a unique set of challenges, from unpredictable surfaces like grass and dirt to unexpected obstacles such as bushes and puddles. In this work, we present a novel learning-based local planner that addresses these challenges by directly capturing human driving nuances from real-world demonstrations using only a monocular camera. The key features of our planner are its ability to navigate in challenging off-road environments with various terrain types and its fast learning capabilities. By utilizing minimal human demonstration data (5-10 mins), it quickly learns to navigate in a wide array of off-road conditions. The local planner significantly reduces the real world data required to learn human driving preferences. This allows the planner to apply learned behaviors to real-world scenarios without the need for manual fine-tuning, demonstrating quick adjustment and adaptability in off-road autonomous driving technology.

[AI-48] AgentS ociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms WWW’25

链接: https://arxiv.org/abs/2502.18754
作者: Yuwei Yan,Yu Shang,Qingbin Zeng,Yu Li,Keyu Zhao,Zhiheng Zheng,Xuefei Ning,Tianji Wu,Shengen Yan,Yu Wang,Fengli Xu,Yong Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 10 figures, in Proceedings of the ACM Web Conference 2025 (WWW '25)

点击查看摘要

Abstract:The AgentSociety Challenge is the first competition in the Web Conference that aims to explore the potential of Large Language Model (LLM) agents in modeling user behavior and enhancing recommender systems on web platforms. The Challenge consists of two tracks: the User Modeling Track and the Recommendation Track. Participants are tasked to utilize a combined dataset from Yelp, Amazon, and Goodreads, along with an interactive environment simulator, to develop innovative LLM agents. The Challenge has attracted 295 teams across the globe and received over 1,400 submissions in total over the course of 37 official competition days. The participants have achieved 21.9% and 20.3% performance improvement for Track 1 and Track 2 in the Development Phase, and 9.1% and 15.9% in the Final Phase, representing a significant accomplishment. This paper discusses the detailed designs of the Challenge, analyzes the outcomes, and highlights the most successful LLM agent designs. To support further research and development, we have open-sourced the benchmark environment at this https URL.

[AI-49] Intent Tagging: Exploring Micro-Prompting Interactions for Supporting Granular Human-GenAI Co-Creation Workflows

链接: https://arxiv.org/abs/2502.18737
作者: Frederic Gmeiner,Nicolai Marquardt,Michael Bentley,Hugo Romat,Michel Pahud,David Brown,Asta Roseway,Nikolas Martelaro,Kenneth Holstein,Ken Hinckley,Nathalie Riche
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 31 pages, 30 figures, 3 tables. To appear in the Proceedings of the 2025 ACM CHI Conference on Human Factors in Computing Systems, Yokohama, Japan

点击查看摘要

Abstract:Despite Generative AI (GenAI) systems’ potential for enhancing content creation, users often struggle to effectively integrate GenAI into their creative workflows. Core challenges include misalignment of AI-generated content with user intentions (intent elicitation and alignment), user uncertainty around how to best communicate their intents to the AI system (prompt formulation), and insufficient flexibility of AI systems to support diverse creative workflows (workflow flexibility). Motivated by these challenges, we created IntentTagger: a system for slide creation based on the notion of Intent Tags - small, atomic conceptual units that encapsulate user intent - for exploring granular and non-linear micro-prompting interactions for Human-GenAI co-creation workflows. Our user study with 12 participants provides insights into the value of flexibly expressing intent across varying levels of ambiguity, meta-intent elicitation, and the benefits and challenges of intent tag-driven workflows. We conclude by discussing the broader implications of our findings and design considerations for GenAI-supported content creation workflows.

[AI-50] AI-Instruments: Embodying Prompts as Instruments to Abstract Reflect Graphical Interface Commands as General-Purpose Tools

链接: https://arxiv.org/abs/2502.18736
作者: Nathalie Riche,Anna Offenwanger,Frederic Gmeiner,David Brown,Hugo Romat,Michel Pahud,Nicolai Marquardt,Kori Inkpen,Ken Hinckley
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 18 pages, 10 figures. To appear in the Proceedings of the 2025 ACM CHI Conference on Human Factors in Computing Systems, Yokohama, Japan. this https URL

点击查看摘要

Abstract:Chat-based prompts respond with verbose linear-sequential texts, making it difficult to explore and refine ambiguous intents, back up and reinterpret, or shift directions in creative AI-assisted design work. AI-Instruments instead embody “prompts” as interface objects via three key principles: (1) Reification of user-intent as reusable direct-manipulation instruments; (2) Reflection of multiple interpretations of ambiguous user-intents (Reflection-in-intent) as well as the range of AI-model responses (Reflection-in-response) to inform design “moves” towards a desired result; and (3) Grounding to instantiate an instrument from an example, result, or extrapolation directly from another instrument. Further, AI-Instruments leverage LLM’s to suggest, vary, and refine new instruments, enabling a system that goes beyond hard-coded functionality by generating its own instrumental controls from content. We demonstrate four technology probes, applied to image generation, and qualitative insights from twelve participants, showing how AI-Instruments address challenges of intent formulation, steering via direct manipulation, and non-linear iterative workflows to reflect and resolve ambiguous intents.

[AI-51] Cross-Modality Investigation on WESAD Stress Classification

链接: https://arxiv.org/abs/2502.18733
作者: Eric Oliver,Sagnik Dakshit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning’s growing prevalence has driven its widespread use in healthcare, where AI and sensor advancements enhance diagnosis, treatment, and monitoring. In mobile health, AI-powered tools enable early diagnosis and continuous monitoring of conditions like stress. Wearable technologies and multimodal physiological data have made stress detection increasingly viable, but model efficacy depends on data quality, quantity, and modality. This study develops transformer models for stress detection using the WESAD dataset, training on electrocardiograms (ECG), electrodermal activity (EDA), electromyography (EMG), respiration rate (RESP), temperature (TEMP), and 3-axis accelerometer (ACC) signals. The results demonstrate the effectiveness of single-modality transformers in analyzing physiological signals, achieving state-of-the-art performance with accuracy, precision and recall values in the range of 99.73% to 99.95% for stress detection. Furthermore, this study explores cross-modal performance and also explains the same using 2D visualization of the learned embedding space and quantitative analysis based on data variance. Despite the large body of work on stress detection and monitoring, the robustness and generalization of these models across different modalities has not been explored. This research represents one of the initial efforts to interpret embedding spaces for stress detection, providing valuable information on cross-modal performance.

[AI-52] Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

链接: https://arxiv.org/abs/2502.18726
作者: Alireza Daghighfarsoodeh,Chung-Yu Wang,Hamed Taherkhani,Melika Sepidband,Mohammad Abdollahi,Hadi Hemmati,Hung Viet Pham
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning (DL) has revolutionized areas such as computer vision, natural language processing, and more. However, developing DL systems is challenging due to the complexity of DL workflows. Large Language Models (LLMs), such as GPT, Claude, Llama, Mistral, etc., have emerged as promising tools to assist in DL code generation, offering potential solutions to these challenges. Despite this, existing benchmarks such as DS-1000 are limited, as they primarily focus on small DL code snippets related to pre/post-processing tasks and lack a comprehensive coverage of the full DL pipeline, including different DL phases and input data types. To address this, we introduce DeepBench, a novel benchmark dataset designed for function-level DL code generation. DeepBench categorizes DL problems based on three key aspects: phases such as pre-processing, model construction, and training; tasks, including classification, regression, and recommendation; and input data types such as tabular, image, and text. GPT-4o – the state-of-the-art LLM – achieved 31% accuracy on DeepBench, significantly lower than its 60% on DS-1000. We observed similar difficulty for other LLMs (e.g., 28% vs. 54% for Claude, 21% vs. 41% for LLaMA, and 15% vs. 20% for Mistral). This result underscores DeepBench’s greater complexity. We also construct a taxonomy of issues and bugs found in LLM-generated DL code, which highlights the distinct challenges that LLMs face when generating DL code compared to general code. Furthermore, our analysis also reveals substantial performance variations across categories, with differences of up to 7% among phases and 37% among tasks. These disparities suggest that DeepBench offers valuable insights into the LLMs’ performance and areas for potential improvement in the DL domain. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.18726 [cs.SE] (or arXiv:2502.18726v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.18726 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alireza Daghighfarsoodeh [view email] [v1] Wed, 26 Feb 2025 00:43:50 UTC (10,445 KB)

[AI-53] rajLLM : A Modular LLM -Enhanced Agent -Based Framework for Realistic Human Trajectory Simulation WWW2025

链接: https://arxiv.org/abs/2502.18712
作者: Chenlu Ju,Jiaxin Liu,Shobhit Sinha,Hao Xue,Flora Salim
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted WWW2025 Demo Paper

点击查看摘要

Abstract:This work leverages Large Language Models (LLMs) to simulate human mobility, addressing challenges like high costs and privacy concerns in traditional models. Our hierarchical framework integrates persona generation, activity selection, and destination prediction, using real-world demographic and psychological data to create realistic movement patterns. Both physical models and language models are employed to explore and demonstrate different methodologies for human mobility simulation. By structuring data with summarization and weighted density metrics, the system ensures scalable memory management while retaining actionable insights. Preliminary results indicate that LLM-driven simulations align with observed real-world patterns, offering scalable, interpretable insights for social problems such as urban planning, traffic management, and public health. The framework’s ability to dynamically generate personas and activities enables it to provide adaptable and realistic daily routines. This study demonstrates the transformative potential of LLMs in advancing mobility modeling for societal and urban applications. The source code and interactive demo for our framework are available at this https URL.

[AI-54] H-FLTN: A Privacy-Preserving Hierarchical Framework for Electric Vehicle Spatio-Temporal Charge Prediction

链接: https://arxiv.org/abs/2502.18697
作者: Robert Marlin,Raja Jurdak,Alsharif Abuadbba
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 14 pages, 7 tables, 2 figures, Journal Paper

点击查看摘要

Abstract:The widespread adoption of Electric Vehicles (EVs) poses critical challenges for energy providers, particularly in predicting charging time (temporal prediction), ensuring user privacy, and managing resources efficiently in mobility-driven networks. This paper introduces the Hierarchical Federated Learning Transformer Network (H-FLTN) framework to address these challenges. H-FLTN employs a three-tier hierarchical architecture comprising EVs, community Distributed Energy Resource Management Systems (DERMS), and the Energy Provider Data Centre (EPDC) to enable accurate spatio-temporal predictions of EV charging needs while preserving privacy. Temporal prediction is enhanced using Transformer-based learning, capturing complex dependencies in charging behavior. Privacy is ensured through Secure Aggregation, Additive Secret Sharing, and Peer-to-Peer (P2P) Sharing with Augmentation, which allow only secret shares of model weights to be exchanged while securing all transmissions. To improve training efficiency and resource management, H-FLTN integrates Dynamic Client Capping Mechanism (DCCM) and Client Rotation Management (CRM), ensuring that training remains both computationally and temporally efficient as the number of participating EVs increases. DCCM optimises client participation by limiting excessive computational loads, while CRM balances training contributions across epochs, preventing imbalanced participation. Our simulation results based on large-scale empirical vehicle mobility data reveal that DCCM and CRM reduce the training time complexity with increasing EVs from linear to constant. Its integration into real-world smart city infrastructure enhances energy demand forecasting, resource allocation, and grid stability, ensuring reliability and sustainability in future mobility ecosystems.

[AI-55] Policy-as-Prompt: Rethinking Content Moderation in the Age of Large Language Models

链接: https://arxiv.org/abs/2502.18695
作者: Konstantina Palla,José Luis Redondo García,Claudia Hauff,Francesco Fabbri,Henrik Lindström,Daniel R. Taber,Andreas Damianou,Mounia Lalmas
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Content moderation plays a critical role in shaping safe and inclusive online environments, balancing platform standards, user expectations, and regulatory frameworks. Traditionally, this process involves operationalising policies into guidelines, which are then used by downstream human moderators for enforcement, or to further annotate datasets for training machine learning moderation models. However, recent advancements in large language models (LLMs) are transforming this landscape. These models can now interpret policies directly as textual inputs, eliminating the need for extensive data curation. This approach offers unprecedented flexibility, as moderation can be dynamically adjusted through natural language interactions. This paradigm shift raises important questions about how policies are operationalised and the implications for content moderation practices. In this paper, we formalise the emerging policy-as-prompt framework and identify five key challenges across four domains: Technical Implementation (1. translating policy to prompts, 2. sensitivity to prompt structure and formatting), Sociotechnical (3. the risk of technological determinism in policy formation), Organisational (4. evolving roles between policy and machine learning teams), and Governance (5. model governance and accountability). Through analysing these challenges across technical, sociotechnical, organisational, and governance dimensions, we discuss potential mitigation approaches. This research provides actionable insights for practitioners and lays the groundwork for future exploration of scalable and adaptive content moderation systems in digital ecosystems.

[AI-56] Hybrid Voting-Based Task Assignment in Role-Playing Games

链接: https://arxiv.org/abs/2502.18690
作者: Daniel Weiner,Raj Korpan
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted for presentation at Dungeons, Neurons, and Dialogues: Social Interaction Dynamics in Contextual Games Workshop at 20th Annual ACM/IEEE International Conference on Human-Robot Interaction (HRI 2025)

点击查看摘要

Abstract:In role-playing games (RPGs), the level of immersion is critical-especially when an in-game agent conveys tasks, hints, or ideas to the player. For an agent to accurately interpret the player’s emotional state and contextual nuances, a foundational level of understanding is required, which can be achieved using a Large Language Model (LLM). Maintaining the LLM’s focus across multiple context changes, however, necessitates a more robust approach, such as integrating the LLM with a dedicated task allocation model to guide its performance throughout gameplay. In response to this need, we introduce Voting-Based Task Assignment (VBTA), a framework inspired by human reasoning in task allocation and completion. VBTA assigns capability profiles to agents and task descriptions to tasks, then generates a suitability matrix that quantifies the alignment between an agent’s abilities and a task’s requirements. Leveraging six distinct voting methods, a pre-trained LLM, and integrating conflict-based search (CBS) for path planning, VBTA efficiently identifies and assigns the most suitable agent to each task. While existing approaches focus on generating individual aspects of gameplay, such as single quests, or combat encounters, our method shows promise when generating both unique combat encounters and narratives because of its generalizable nature.

[AI-57] AI Mismatches: Identifying Potential Algorithmic Harms Before AI Development

链接: https://arxiv.org/abs/2502.18682
作者: Devansh Saxena,Ji-Youn Jung,Jodi Forlizzi,Kenneth Holstein,John Zimmerman
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: CHI Conference on Human Factors in Computing Systems (CHI '25), April 26-May 1, 2025, Yokohama, Japan

点击查看摘要

Abstract:AI systems are often introduced with high expectations, yet many fail to deliver, resulting in unintended harm and missed opportunities for benefit. We frequently observe significant “AI Mismatches”, where the system’s actual performance falls short of what is needed to ensure safety and co-create value. These mismatches are particularly difficult to address once development is underway, highlighting the need for early-stage intervention. Navigating complex, multi-dimensional risk factors that contribute to AI Mismatches is a persistent challenge. To address it, we propose an AI Mismatch approach to anticipate and mitigate risks early on, focusing on the gap between realistic model performance and required task performance. Through an analysis of 774 AI cases, we extracted a set of critical factors, which informed the development of seven matrices that map the relationships between these factors and highlight high-risk areas. Through case studies, we demonstrate how our approach can help reduce risks in AI development.

[AI-58] Comparing Native and Non-native English Speakers Behaviors in Collaborative Writing through Visual Analytics

链接: https://arxiv.org/abs/2502.18681
作者: Yuexi Chen,Yimin Xiao,Kazi Tasnim Zinat,Naomi Yamashita,Ge Gao,Zhicheng Liu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: accepted by CHI 2025

点击查看摘要

Abstract:Understanding collaborative writing dynamics between native speakers (NS) and non-native speakers (NNS) is critical for enhancing collaboration quality and team inclusivity. In this paper, we partnered with communication researchers to develop visual analytics solutions for comparing NS and NNS behaviors in 162 writing sessions across 27 teams. The primary challenges in analyzing writing behaviors are data complexity and the uncertainties introduced by automated methods. In response, we present \textscCOALA, a novel visual analytics tool that improves model interpretability by displaying uncertainties in author clusters, generating behavior summaries using large language models, and visualizing writing-related actions at multiple granularities. We validated the effectiveness of \textscCOALA through user studies with domain experts (N=2+2) and researchers with relevant experience (N=8). We present the insights discovered by participants using \textscCOALA, suggest features for future AI-assisted collaborative writing tools, and discuss the broader implications for analyzing collaborative processes beyond writing.

[AI-59] Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

链接: https://arxiv.org/abs/2502.18658
作者: Kevin Pu,Daniel Lazaro,Ian Arawjo,Haijun Xia,Ziang Xiao,Tovi Grossman,Yan Chen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade-offs between increasingly salient AI support: prompt-only, proactive agent, and proactive agent with presence and context (Codellaborator). In a within-subject study (N=18), we find that proactive agents increase efficiency compared to prompt-only paradigm, but also incur workflow disruptions. However, presence indicators and \reviseinteraction context support alleviated disruptions and improved users’ awareness of AI processes. We underscore trade-offs of Codellaborator on user control, ownership, and code understanding, emphasizing the need to adapt proactivity to programming processes. Our research contributes to the design exploration and evaluation of proactive AI systems, presenting design implications on AI-integrated programming workflow.

[AI-60] Independent Mobility GPT (IDM-GPT ): A Self-Supervised Multi-Agent Large Language Model Framework for Customized Traffic Mobility Analysis Using Machine Learning Models

链接: https://arxiv.org/abs/2502.18652
作者: Fengze Yang,Xiaoyue Cathy Liu,Lingjiu Lu,Bingzhang Wang,Chenxi(Dylan)Liu
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 24 pages, 4 figures, TRR accepted

点击查看摘要

Abstract:With the urbanization process, an increasing number of sensors are being deployed in transportation systems, leading to an explosion of big data. To harness the power of this vast transportation data, various machine learning (ML) and artificial intelligence (AI) methods have been introduced to address numerous transportation challenges. However, these methods often require significant investment in data collection, processing, storage, and the employment of professionals with expertise in transportation and ML. Additionally, privacy issues are a major concern when processing data for real-world traffic control and management. To address these challenges, the research team proposes an innovative Multi-agent framework named Independent Mobility GPT (IDM-GPT) based on large language models (LLMs) for customized traffic analysis, management suggestions, and privacy preservation. IDM-GPT efficiently connects users, transportation databases, and ML models economically. IDM-GPT trains, customizes, and applies various LLM-based AI agents for multiple functions, including user query comprehension, prompts optimization, data analysis, model selection, and performance evaluation and enhancement. With IDM-GPT, users without any background in transportation or ML can efficiently and intuitively obtain data analysis and customized suggestions in near real-time based on their questions. Experimental results demonstrate that IDM-GPT delivers satisfactory performance across multiple traffic-related tasks, providing comprehensive and actionable insights that support effective traffic management and urban mobility improvement.

[AI-61] WhatELSE: Shaping Narrative Spaces at Configurable Level of Abstraction for AI-bridged Interactive Storytelling

链接: https://arxiv.org/abs/2502.18641
作者: Zhuoran Lu,Qian Zhou,Yi Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: In Proceedings of CHI 2025

点击查看摘要

Abstract:Generative AI significantly enhances player agency in interactive narratives (IN) by enabling just-in-time content generation that adapts to player actions. While delegating generation to AI makes IN more interactive, it becomes challenging for authors to control the space of possible narratives - within which the final story experienced by the player emerges from their interaction with AI. In this paper, we present WhatELSE, an AI-bridged IN authoring system that creates narrative possibility spaces from example stories. WhatELSE provides three views (narrative pivot, outline, and variants) to help authors understand the narrative space and corresponding tools leveraging linguistic abstraction to control the boundaries of the narrative space. Taking innovative LLM-based narrative planning approaches, WhatELSE further unfolds the narrative space into executable game events. Through a user study (N=12) and technical evaluations, we found that WhatELSE enables authors to perceive and edit the narrative space and generates engaging interactive narratives at play-time.

[AI-62] Quantum Machine Learning in Precision Medicine and Drug Discovery – A Game Changer for Tailored Treatments?

链接: https://arxiv.org/abs/2502.18639
作者: Markus Bertl,Alan Mott,Salvatore Sinno,Bhavika Bhalgamiya
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: presented at AISoLA 2024

点击查看摘要

Abstract:The digitization of healthcare presents numerous challenges, including the complexity of biological systems, vast data generation, and the need for personalized treatment plans. Traditional computational methods often fall short, leading to delayed and sometimes ineffective diagnoses and treatments. Quantum Computing (QC) and Quantum Machine Learning (QML) offer transformative advancements with the potential to revolutionize medicine. This paper summarizes areas where QC promises unprecedented computational power, enabling faster, more accurate diagnostics, personalized treatments, and enhanced drug discovery processes. However, integrating quantum technologies into precision medicine also presents challenges, including errors in algorithms and high costs. We show that mathematically-based techniques for specifying, developing, and verifying software (formal methods) can enhance the reliability and correctness of QC. By providing a rigorous mathematical framework, formal methods help to specify, develop, and verify systems with high precision. In genomic data analysis, formal specification languages can precisely (1) define the behavior and properties of quantum algorithms designed to identify genetic markers associated with diseases. Model checking tools can systematically explore all possible states of the algorithm to (2) ensure it behaves correctly under all conditions, while theorem proving techniques provide mathematical (3) proof that the algorithm meets its specified properties, ensuring accuracy and reliability. Additionally, formal optimization techniques can (4) enhance the efficiency and performance of quantum algorithms by reducing resource usage, such as the number of qubits and gate operations. Therefore, we posit that formal methods can significantly contribute to enabling QC to realize its full potential as a game changer in precision medicine.

[AI-63] Differentially Private Iterative Screening Rules for Linear Regression

链接: https://arxiv.org/abs/2502.18578
作者: Amol Khanna,Fred Lu,Edward Raff
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Proceedings of the 15th ACM Conference on Data and Application Security and Privacy

点击查看摘要

Abstract:Linear L_1 -regularized models have remained one of the simplest and most effective tools in data science. Over the past decade, screening rules have risen in popularity as a way to eliminate features when producing the sparse regression weights of L_1 models. However, despite the increasing need of privacy-preserving models for data analysis, to the best of our knowledge, no differentially private screening rule exists. In this paper, we develop the first private screening rule for linear regression. We initially find that this screening rule is too strong: it screens too many coefficients as a result of the private screening step. However, a weakened implementation of private screening reduces overscreening and improves performance.

[AI-64] What is the Alignment Objective of GRPO?

链接: https://arxiv.org/abs/2502.18548
作者: Milan Vojnovic,Se-Young Yun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.18548 [cs.LG] (or arXiv:2502.18548v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.18548 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Milan Vojnovic [view email] [v1] Tue, 25 Feb 2025 15:56:56 UTC (600 KB)

[AI-65] Steganography Beyond Space-Time With Chain of Multimodal AI Agents

链接: https://arxiv.org/abs/2502.18547
作者: Ching-Chun Chang,Isao Echizen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Steganography is the art and science of covert writing, with a broad range of applications interwoven within the realm of cybersecurity. As artificial intelligence continues to evolve, its ability to synthesise realistic content emerges as a threat in the hands of cybercriminals who seek to manipulate and misrepresent the truth. Such synthetic content introduces a non-trivial risk of overwriting the subtle changes made for the purpose of steganography. When the signals in both the spatial and temporal domains are vulnerable to unforeseen overwriting, it calls for reflection on what can remain invariant after all. This study proposes a paradigm in steganography for audiovisual media, where messages are concealed beyond both spatial and temporal domains. A chain of multimodal agents is developed to deconstruct audiovisual content into a cover text, embed a message within the linguistic domain, and then reconstruct the audiovisual content through synchronising both aural and visual modalities with the resultant stego text. The message is encoded by biasing the word sampling process of a language generation model and decoded by analysing the probability distribution of word choices. The accuracy of message transmission is evaluated under both zero-bit and multi-bit capacity settings. Fidelity is assessed through both biometric and semantic similarities, capturing the identities of the recorded face and voice, as well as the core ideas conveyed through the media. Secrecy is examined through statistical comparisons between cover and stego texts. Robustness is tested across various scenarios, including audiovisual compression, face-swapping, voice-cloning and their combinations.

[AI-66] MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications

链接: https://arxiv.org/abs/2502.18540
作者: Zike Yuan,Ming Liu,Hui Wang,Bing Qin
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph-theoretic problems arise in real-world applications like logistics, communication networks, and traffic optimization. These problems are often complex, noisy, and irregular, posing challenges for traditional algorithms. Large language models (LLMs) offer potential solutions but face challenges, including limited accuracy and input length constraints. To address these challenges, we propose MA-GTS (Multi-Agent Graph Theory Solver), a multi-agent framework that decomposes these complex problems through agent collaboration. MA-GTS maps the implicitly expressed text-based graph data into clear, structured graph representations and dynamically selects the most suitable algorithm based on problem constraints and graph structure scale. This approach ensures that the solution process remains efficient and the resulting reasoning path is interpretable. We validate MA-GTS using the G-REAL dataset, a real-world-inspired graph theory dataset we created. Experimental results show that MA-GTS outperforms state-of-the-art approaches in terms of efficiency, accuracy, and scalability, with strong results across multiple benchmarks (G-REAL 94.2%, GraCoRe 96.9%, NLGraph 98.4%).MA-GTS is open-sourced at this https URL.

[AI-67] Revisiting Convolution Architecture in the Realm of DNA Foundation Models

链接: https://arxiv.org/abs/2502.18538
作者: Yu Bo,Weian Mao,Yanjun Shao,Weiqiang Bai,Peng Ye,Xinzhu Ma,Junbo Zhao,Hao Chen,Chunhua Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, a variety of methods based on Transformer and state space model (SSM) architectures have been proposed, advancing foundational DNA language models. However, there is a lack of comparison between these recent approaches and the classical architecture convolutional networks (CNNs) on foundation model benchmarks. This raises the question: are CNNs truly being surpassed by these recent approaches based on transformer and SSM architectures? In this paper, we develop a simple but well-designed CNN-based method termed ConvNova. ConvNova identifies and proposes three effective designs: 1) dilated convolutions, 2) gated convolutions, and 3) a dual-branch framework for gating mechanisms. Through extensive empirical experiments, we demonstrate that ConvNova significantly outperforms recent methods on more than half of the tasks across several foundation model benchmarks. For example, in histone-related tasks, ConvNova exceeds the second-best method by an average of 5.8%, while generally utilizing fewer parameters and enabling faster computation. In addition, the experiments observed findings that may be related to biological characteristics. This indicates that CNNs are still a strong competitor compared to Transformers and SSMs. We anticipate that this work will spark renewed interest in CNN-based methods for DNA foundation models.

[AI-68] A Survey of Zero-Knowledge Proof Based Verifiable Machine Learning

链接: https://arxiv.org/abs/2502.18535
作者: Zhizhi Peng,Taotao Wang,Chonghe Zhao,Guofu Liao,Zibin Lin,Yifeng Liu,Bin Cao,Long Shi,Qing Yang,Shengli Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 5 figures, 3 tables

点击查看摘要

Abstract:As machine learning technologies advance rapidly across various domains, concerns over data privacy and model security have grown significantly. These challenges are particularly pronounced when models are trained and deployed on cloud platforms or third-party servers due to the computational resource limitations of users’ end devices. In response, zero-knowledge proof (ZKP) technology has emerged as a promising solution, enabling effective validation of model performance and authenticity in both training and inference processes without disclosing sensitive data. Thus, ZKP ensures the verifiability and security of machine learning models, making it a valuable tool for privacy-preserving AI. Although some research has explored the verifiable machine learning solutions that exploit ZKP, a comprehensive survey and summary of these efforts remain absent. This survey paper aims to bridge this gap by reviewing and analyzing all the existing Zero-Knowledge Machine Learning (ZKML) research from June 2017 to December 2024. We begin by introducing the concept of ZKML and outlining its ZKP algorithmic setups under three key categories: verifiable training, verifiable inference, and verifiable testing. Next, we provide a comprehensive categorization of existing ZKML research within these categories and analyze the works in detail. Furthermore, we explore the implementation challenges faced in this field and discuss the improvement works to address these obstacles. Additionally, we highlight several commercial applications of ZKML technology. Finally, we propose promising directions for future advancements in this domain.

[AI-69] MAFE: Multi-Agent Fair Environments for Decision-Making Systems

链接: https://arxiv.org/abs/2502.18534
作者: Zachary McBride Lazri,Anirudh Nakra,Ivan Brugere,Danial Dervovic,Antigoni Polychroniadou,Furong Huang,Dana Dachman-Soled,Min Wu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fairness constraints applied to machine learning (ML) models in static contexts have been shown to potentially produce adverse outcomes among demographic groups over time. To address this issue, emerging research focuses on creating fair solutions that persist over time. While many approaches treat this as a single-agent decision-making problem, real-world systems often consist of multiple interacting entities that influence outcomes. Explicitly modeling these entities as agents enables more flexible analysis of their interventions and the effects they have on a system’s underlying dynamics. A significant challenge in conducting research on multi-agent systems is the lack of realistic environments that leverage the limited real-world data available for analysis. To address this gap, we introduce the concept of a Multi-Agent Fair Environment (MAFE) and present and analyze three MAFEs that model distinct social systems. Experimental results demonstrate the utility of our MAFEs as testbeds for developing multi-agent fair algorithms.

[AI-70] CuDIP: Enhancing Theorem Proving in LLM s via Curriculum Learning-based Direct Preference Optimization

链接: https://arxiv.org/abs/2502.18532
作者: Shuming Shi,Ruobing Zuo,Gaolei He,Jianlin Wang,Chenyang Xu,Zhengfeng Yang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated theorem proving (ATP) is one of the most challenging mathematical reasoning tasks for Large Language Models (LLMs). Most existing LLM-based ATP methods rely on supervised fine-tuning, which results in a limited alignment between the theorem proving process and human preferences. Direct Preference Optimization (DPO), which aligns LLMs with human preferences, has shown positive effects for certain tasks. However, the lack of high-quality preference data for theorem proving presents a significant challenge. In this paper, we innovatively apply DPO to formal automated theorem proving and introduces a Curriculum Learning-based DPO Iterative Theorem Proving (CuDIP) method. Specifically, we propose a method for constructing preference data which utilizes LLMs and existing theorem proving data to enhance the diversity of the preference data while reducing the reliance on human preference annotations. We then integrate this preference data construction method with curriculum learning to iteratively fine-tune the theorem proving model through DPO. Experimental results on the MiniF2F and ProofNet datasets demonstrate the effectiveness of the proposed method.

[AI-71] Heterogeneous Decision Making in Mixed Traffic: Uncertainty-aware Planning and Bounded Rationality

链接: https://arxiv.org/abs/2502.18529
作者: Hang Wang,Qiaoyi Fang,Junshan Zhang
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: CPAL 2025

点击查看摘要

Abstract:The past few years have witnessed a rapid growth of the deployment of automated vehicles (AVs). Clearly, AVs and human-driven vehicles (HVs) will co-exist for many years, and AVs will have to operate around HVs, pedestrians, cyclists, and more, calling for fundamental breakthroughs in AI designed for mixed traffic to achieve mixed autonomy. Thus motivated, we study heterogeneous decision making by AVs and HVs in a mixed traffic environment, aiming to capture the interactions between human and machine decision-making and develop an AI foundation that enables vehicles to operate safely and efficiently. There are a number of challenges to achieve mixed autonomy, including 1) humans drivers make driving decisions with bounded rationality, and it remains open to develop accurate models for HVs’ decision making; and 2) uncertainty-aware planning plays a critical role for AVs to take safety maneuvers in response to the human behavior. In this paper, we introduce a formulation of AV-HV interaction, where the HV makes decisions with bounded rationality and the AV employs uncertainty-aware planning based on the prediction on HV’s future actions. We conduct a comprehensive analysis on AV and HV’s learning regret to answer the questions: 1) How does the learning performance depend on HV’s bounded rationality and AV’s planning; 2) How do different decision making strategies impact the overall learning performance? Our findings reveal some intriguing phenomena, such as Goodhart’s Law in AV’s learning performance and compounding effects in HV’s decision making process. By examining the dynamics of the regrets, we gain insights into the interplay between human and machine decision making.

[AI-72] ARACNE: An LLM -Based Autonomous Shell Pentesting Agent

链接: https://arxiv.org/abs/2502.18528
作者: Tomas Nieponice,Veronica Valeros,Sebastian Garcia
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:We introduce ARACNE, a fully autonomous LLM-based pentesting agent tailored for SSH services that can execute commands on real Linux shell systems. Introduces a new agent architecture with multi-LLM model support. Experiments show that ARACNE can reach a 60% success rate against the autonomous defender ShelLM and a 57.58% success rate against the Over The Wire Bandit CTF challenges, improving over the state-of-the-art. When winning, the average number of actions taken by the agent to accomplish the goals was less than 5. The results show that the use of multi-LLM is a promising approach to increase accuracy in the actions.

[AI-73] GOD model: Privacy Preserved AI School for Personal Assistant

链接: https://arxiv.org/abs/2502.18527
作者: PIN AI Team:Bill Qingyun Sun,Laura Florescu,Boliang Zhang,Regan Peng,Smile Hu,Shouqiao Wang,Ben Wu,Xi Wang,Davide Crapis,Gavin Zhen Guo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personal AI assistants (e.g., Apple Intelligence, Meta AI) offer proactive recommendations that simplify everyday tasks, but their reliance on sensitive user data raises concerns about privacy and trust. To address these challenges, we introduce the Guardian of Data (GOD), a secure, privacy-preserving framework for training and evaluating AI assistants directly on-device. Unlike traditional benchmarks, the GOD model measures how well assistants can anticipate user needs-such as suggesting gifts-while protecting user data and autonomy. Functioning like an AI school, it addresses the cold start problem by simulating user queries and employing a curriculum-based approach to refine the performance of each assistant. Running within a Trusted Execution Environment (TEE), it safeguards user data while applying reinforcement and imitation learning to refine AI recommendations. A token-based incentive system encourages users to share data securely, creating a data flywheel that drives continuous improvement. By integrating privacy, personalization, and trust, the GOD model provides a scalable, responsible path for advancing personal AI assistants. For community collaboration, part of the framework is open-sourced at this https URL.

[AI-74] Reinforcement Learning-based Approach for Vehicle-to-Building Charging with Heterogeneous Agents and Long Term Rewards

链接: https://arxiv.org/abs/2502.18526
作者: Fangqi Liu,Rishav Sen,Jose Paolo Talusan,Ava Pettet,Aaron Kandel,Yoshinori Suzue,Ayan Mukhopadhyay,Abhishek Dubey
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Strategic aggregation of electric vehicle batteries as energy reservoirs can optimize power grid demand, benefiting smart and connected communities, especially large office buildings that offer workplace charging. This involves optimizing charging and discharging to reduce peak energy costs and net peak demand, monitored over extended periods (e.g., a month), which involves making sequential decisions under uncertainty and delayed and sparse rewards, a continuous action space, and the complexity of ensuring generalization across diverse conditions. Existing algorithmic approaches, e.g., heuristic-based strategies, fall short in addressing real-time decision-making under dynamic conditions, and traditional reinforcement learning (RL) models struggle with large state-action spaces, multi-agent settings, and the need for long-term reward optimization. To address these challenges, we introduce a novel RL framework that combines the Deep Deterministic Policy Gradient approach (DDPG) with action masking and efficient MILP-driven policy guidance. Our approach balances the exploration of continuous action spaces to meet user charging demands. Using real-world data from a major electric vehicle manufacturer, we show that our approach comprehensively outperforms many well-established baselines and several scalable heuristic approaches, achieving significant cost savings while meeting all charging requirements. Our results show that the proposed approach is one of the first scalable and general approaches to solving the V2B energy management challenge.

[AI-75] Class-Conditional Neural Polarizer: A Lightweight and Effective Backdoor Defense by Purifying Poisoned Features

链接: https://arxiv.org/abs/2502.18520
作者: Mingli Zhu,Shaokui Wei,Hongyuan Zha,Baoyuan Wu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies have highlighted the vulnerability of deep neural networks to backdoor attacks, where models are manipulated to rely on embedded triggers within poisoned samples, despite the presence of both benign and trigger information. While several defense methods have been proposed, they often struggle to balance backdoor mitigation with maintaining benign this http URL this work, inspired by the concept of optical polarizer-which allows light waves of specific polarizations to pass while filtering others-we propose a lightweight backdoor defense approach, NPD. This method integrates a neural polarizer (NP) as an intermediate layer within the compromised model, implemented as a lightweight linear transformation optimized via bi-level optimization. The learnable NP filters trigger information from poisoned samples while preserving benign content. Despite its effectiveness, we identify through empirical studies that NPD’s performance degrades when the target labels (required for purification) are inaccurately estimated. To address this limitation while harnessing the potential of targeted adversarial mitigation, we propose class-conditional neural polarizer-based defense (CNPD). The key innovation is a fusion module that integrates the backdoored model’s predicted label with the features to be purified. This architecture inherently mimics targeted adversarial defense mechanisms without requiring label estimation used in NPD. We propose three implementations of CNPD: the first is r-CNPD, which trains a replicated NP layer for each class and, during inference, selects the appropriate NP layer for defense based on the predicted class from the backdoored model. To efficiently handle a large number of classes, two variants are designed: e-CNPD, which embeds class information as additional features, and a-CNPD, which directs network attention using class information.

[AI-76] Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLM s

链接: https://arxiv.org/abs/2502.18518
作者: Peng Yifeng,Wu Zhizheng,Chen Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern large language models (LLMs) exhibit critical vulnerabilities to poison pill attacks: localized data poisoning that alters specific factual knowledge while preserving overall model utility. We systematically demonstrate these attacks exploit inherent architectural properties of LLMs, achieving 54.6% increased retrieval inaccuracy on long-tail knowledge versus dominant topics and up to 25.5% increase retrieval inaccuracy on compressed models versus original architectures. Through controlled mutations (e.g., temporal/spatial/entity alterations) and, our method induces localized memorization deterioration with negligible impact on models’ performance on regular standard benchmarks (e.g., 2% performance drop on MMLU/GPQA), leading to potential detection evasion. Our findings suggest: (1) Disproportionate vulnerability in long-tail knowledge may result from reduced parameter redundancy; (2) Model compression may increase attack surfaces, with pruned/distilled models requiring 30% fewer poison samples for equivalent damage; (3) Associative memory enables both spread of collateral damage to related concepts and amplification of damage from simultaneous attack, particularly for dominant topics. These findings raise concerns over current scaling paradigms since attack costs are lowering while defense complexity is rising. Our work establishes poison pills as both a security threat and diagnostic tool, revealing critical security-efficiency trade-offs in language model compression that challenges prevailing safety assumptions.

[AI-77] RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis

链接: https://arxiv.org/abs/2502.18517
作者: Jianwei Wang,Junyao Yang,Haoran Li,Huiping Zhuang,Cen Chen,Ziqian Zeng
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The success of large language models (LLMs) has attracted many individuals to fine-tune them for domain-specific tasks by uploading their data. However, in sensitive areas like healthcare and finance, privacy concerns often arise. One promising solution is to sample synthetic data with Differential Privacy (DP) guarantees to replace private data. However, these synthetic data contain significant flawed data, which are considered as noise. Existing solutions typically rely on naive filtering by comparing ROUGE-L scores or embedding similarities, which are ineffective in addressing the noise. To address this issue, we propose RewardDS, a novel privacy-preserving framework that fine-tunes a reward proxy model and uses reward signals to guide the synthetic data generation. Our RewardDS introduces two key modules, Reward Guided Filtering and Self-Optimizing Refinement, to both filter and refine the synthetic data, effectively mitigating the noise. Extensive experiments across medical, financial, and code generation domains demonstrate the effectiveness of our method.

[AI-78] A Multi-Agent Framework for Automated Vulnerability Detection and Repair in Solidity and Move Smart Contracts

链接: https://arxiv.org/abs/2502.18515
作者: Rabimba Karanjai,Sam Blackshear,Lei Xu,Weidong Shi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The rapid growth of the blockchain ecosystem and the increasing value locked in smart contracts necessitate robust security measures. While languages like Solidity and Move aim to improve smart contract security, vulnerabilities persist. This paper presents Smartify, a novel multi-agent framework leveraging Large Language Models (LLMs) to automatically detect and repair vulnerabilities in Solidity and Move smart contracts. Unlike traditional methods that rely solely on vast pre-training datasets, Smartify employs a team of specialized agents working on different specially fine-tuned LLMs to analyze code based on underlying programming concepts and language-specific security principles. We evaluated Smartify on a dataset for Solidity and a curated dataset for Move, demonstrating its effectiveness in fixing a wide range of vulnerabilities. Our results show that Smartify (Gemma2+codegemma) achieves state-of-the-art performance, surpassing existing LLMs and enhancing general-purpose models’ capabilities, such as Llama 3.1. Notably, Smartify can incorporate language-specific knowledge, such as the nuances of Move, without requiring massive language-specific pre-training datasets. This work offers a detailed analysis of various LLMs’ performance on smart contract repair, highlighting the strengths of our multi-agent approach and providing a blueprint for developing more secure and reliable decentralized applications in the growing blockchain landscape. We also provide a detailed recipe for extending this to other similar use cases.

[AI-79] ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models

链接: https://arxiv.org/abs/2502.18511
作者: Xuxu Liu,Siyuan Liang,Mengya Han,Yong Luo,Aishan Liu,Xiantao Cai,Zheng He,Dacheng Tao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish \textitELBA-Bench , a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning ( \textite.g., LoRA) or without fine-tuning techniques ( \textite.g., In-context-learning). \textitELBA-Bench provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research, with the goal of propelling further progress in this vital area.

[AI-80] Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

链接: https://arxiv.org/abs/2502.18509
作者: Ivoline Ngong,Swanand Kadhe,Hao Wang,Keerthiram Murugesan,Justin D. Weisz,Amit Dhurandhar,Karthikeyan Natesan Ramamurthy
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 22 pages, 2 figures

点击查看摘要

Abstract:Conversational agents are increasingly woven into individuals’ personal lives, yet users often underestimate the privacy risks involved. The moment users share information with these agents (e.g., LLMs), their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLMs. It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LLMs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally-deployable framework that operates between users and LLMs, and identifies and reformulates out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals through different approaches to classify information relevant to the intended goals.

[AI-81] Deep Learning-based Dual Watermarking for Image Copyright Protection and Authentication

链接: https://arxiv.org/abs/2502.18501
作者: Sudev Kumar Padhi,Archana Tiwari,Sk. Subidh Ali
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: IEEE Transactions on Artificial Intelligence. 2024 Oct 24

点击查看摘要

Abstract:Advancements in digital technologies make it easy to modify the content of digital images. Hence, ensuring digital images integrity and authenticity is necessary to protect them against various attacks that manipulate them. We present a Deep Learning (DL) based dual invisible watermarking technique for performing source authentication, content authentication, and protecting digital content copyright of images sent over the internet. Beyond securing images, the proposed technique demonstrates robustness to content-preserving image manipulations. It is also impossible to imitate or overwrite watermarks because the cryptographic hash of the image and the dominant features of the image in the form of perceptual hash are used as watermarks. We highlighted the need for source authentication to safeguard image integrity and authenticity, along with identifying similar content for copyright protection. After exhaustive testing, we obtained a high peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), which implies there is a minute change in the original image after embedding our watermarks. Our trained model achieves high watermark extraction accuracy and to the best of our knowledge, this is the first deep learning-based dual watermarking technique proposed in the literature.

[AI-82] Rule-based autocorrection of Piping and Instrumentation Diagrams (PIDs) on graphs

链接: https://arxiv.org/abs/2502.18493
作者: Lukas Schulze Balhorn,Niels Seijsener,Kevin Dao,Minji Kim,Dominik P. Goldstein,Ge H. M. Driessen,Artur M. Schweidtmann
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A piping and instrumentation diagram (PID) is a central reference document in chemical process engineering. Currently, chemical engineers manually review PIDs through visual inspection to find and rectify errors. However, engineering projects can involve hundreds to thousands of PID pages, creating a significant revision workload. This study proposes a rule-based method to support engineers with error detection and correction in PIDs. The method is based on a graph representation of PIDs, enabling automated error detection and correction, i.e., autocorrection, through rule graphs. We use our pyDEXPI Python package to generate PID graphs from DEXPI-standard PIDs. In this study, we developed 33 rules based on chemical engineering knowledge and heuristics, with five selected rules demonstrated as examples. A case study on an illustrative PID validates the reliability and effectiveness of the rule-based autocorrection method in revising PIDs.

[AI-83] LLM 4EFFI: Leverag ing Large Language Models to Enhance Code Efficiency and Correctness

链接: https://arxiv.org/abs/2502.18489
作者: Tong Ye,Weigang Huang,Xuhong Zhang,Tengfei Ma,Peiyu Liu,Jianwei Yin,Wenhai Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs), particularly Code LLMs, have demonstrated impressive performance in code generation. Current research primarily focuses on the correctness of generated code, while efficiency remains less explored. Recent works have focused on modifying the initial version of the code to improve its efficiency. However, such refinements are limited by the algorithmic design and overall logic of the initial code, resulting in only incremental improvements. In contrast, when human developers write high-quality code, they typically begin by designing several potential solutions at the logical level, evaluating various algorithms and their complexities, and then proceeding to implement and optimize the solution. In this study, we introduce \tool: \ulineLarge \ulineLanguage \ulineModel for Code \ulineEfficiency, a novel framework that enables LLMs to generate code that balances both efficiency and correctness. Specifically, \tool divides the efficiency optimization process into two domains: algorithmic exploration in the logic domain and implementation optimization in the code domain. The correctness of the code is then guaranteed through a synthetic test case refinement process. This approach, which prioritizes efficiency before ensuring correctness, offers a new paradigm for efficient code generation. Experiments demonstrate that \tool consistently improves both efficiency and correctness, achieving new state-of-the-art performance in code efficiency benchmarks across various LLM backbones.

[AI-84] AI Enhanced Ontology Driven NLP for Intelligent Cloud Resource Query Processing Using Knowledge Graphs KR

链接: https://arxiv.org/abs/2502.18484
作者: Krishna Chaitanya Sunkara(Independent Researcher),Krishnaiah Narukulla(Independent Researcher)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures, 4 tables. This paper not published at else where yet. The experimental setup has a potential to be revised using real time resources. Authors: Krishna Chaitanya Sunkara (IEEE Senior Member, Raleigh, NC, USA, Independent Researcher), Krishnaiah Narukulla (IEEE Senior Member, San Jose, CA, USA, Independent Researcher)

点击查看摘要

Abstract:The conventional resource search in cloud infrastructure relies on keyword-based searches or GUIDs, which demand exact matches and significant user effort to locate resources. These conventional search approaches often fail to interpret the intent behind natural language queries, making resource discovery inefficient and inaccessible to users. Though there exists some form of NLP based search engines, they are limited and focused more on analyzing the NLP query itself and extracting identifiers to find the resources. But they fail to search resources based on their behavior or operations or their capabilities or relationships or features or business relevance or the dynamic changing state or the knowledge these resources have. The search criteria has been changing with the inundation of AI based services which involved discovering not just the requested resources and identifiers but seeking insights. The real intent of a search has never been to just to list the resources but with some actual context such as to understand causes of some behavior in the system, compliance checks, capacity estimations, network constraints, or troubleshooting or business insights. This paper proposes an advanced Natural Language Processing (NLP) enhanced by ontology-based semantics to enable intuitive, human-readable queries which allows users to actually discover the intent-of-search itself. By constructing an ontology of cloud resources, their interactions, and behaviors, the proposed framework enables dynamic intent extraction and relevance ranking using Latent Semantic Indexing (LSI) and AI models. It introduces an automated pipeline which integrates ontology extraction by AI powered data crawlers, building a semantic knowledge base for context aware resource discovery.

[AI-85] Modeling Churn in Recommender Systems with Aggregated Preferences

链接: https://arxiv.org/abs/2502.18483
作者: Gur Keinan,Omer Ben-Porat
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While recommender systems (RSs) traditionally rely on extensive individual user data, regulatory and technological shifts necessitate reliance on aggregated user information. This shift significantly impacts the recommendation process, requiring RSs to engage in intensive exploration to identify user preferences. However, this approach risks user churn due to potentially unsatisfactory recommendations. In this paper, we propose a model that addresses the dual challenges of leveraging aggregated user information and mitigating churn risk. Our model assumes that the RS operates with a probabilistic prior over user types and aggregated satisfaction levels for various content types. We demonstrate that optimal policies naturally transition from exploration to exploitation in finite time, develop a branch-and-bound algorithm for computing these policies, and empirically validate its effectiveness.

[AI-86] MDE: Modality Discrimination Enhancement for Multi-modal Recommendation

链接: https://arxiv.org/abs/2502.18481
作者: Hang Zhou,Yucheng Wang,Huijing Zhan
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-modal recommendation systems aim to enhance performance by integrating an item’s content features across various modalities with user behavior data. Effective utilization of features from different modalities requires addressing two challenges: preserving semantic commonality across modalities (modality-shared) and capturing unique characteristics for each modality (modality-specific). Most existing approaches focus on aligning feature spaces across modalities, which helps represent modality-shared features. However, modality-specific distinctions are often neglected, especially when there are significant semantic variations between modalities. To address this, we propose a Modality Distinctiveness Enhancement (MDE) framework that prioritizes extracting modality-specific information to improve recommendation accuracy while maintaining shared features. MDE enhances differences across modalities through a novel multi-modal fusion module and introduces a node-level trade-off mechanism to balance cross-modal alignment and differentiation. Extensive experiments on three public datasets show that our approach significantly outperforms other state-of-the-art methods, demonstrating the effectiveness of jointly considering modality-shared and modality-specific features.

[AI-87] Beyond Self-Consistency: Loss-Balanced Perturbation-Based Regularization Improves Industrial-Scale Ads Ranking

链接: https://arxiv.org/abs/2502.18478
作者: Ilqar Ramazanli,Hamid Eghbalzadeh,Xiaoyi Liu,Yang Wang,Jiaxiang Fu,Kaushik Rangadurai,Sem Park,Bo Long,Xue Feng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perturbation-based regularization techniques address many challenges in industrial-scale large models, particularly with sparse labels, and emphasize consistency and invariance for perturbation in model predictions. One of the popular regularization techniques has been various forms of self-consistency, which involve making small modifications to input data while preserving contextual information and enforcing similar predictions through auxiliary loss functions. In this work, we explore the first successful application of perturbation-based regularization algorithms in large-scale ads ranking models, and further propose a novel regularization algorithm, namely, Loss-Balanced Small Perturbation Regularization (LSPR) that can be used in potentially any deep learning model. We have successfully demonstrate that both Self-Consistency Regularization approaches (SCR) and LSPR are scalable and can improve ads delivery systems. By conducting industrial-scale experiments, and numerical analysis, we additionally show that our proposed LSPR, performs consistently better compared to SCR, across various groups and signal availability setups. Finally, we report a successful application of the proposed LSPR in a billion-scale industrial ranking system, which to the best of our knowledge, is the first of its kind, and it is specially designed to address the various scalability challenges (e.g, various surfaces, geological locations, clients and so on) as we will mention in this paper.

[AI-88] Recommendations Beyond Catalogs: Diffusion Models for Personalized Generation

链接: https://arxiv.org/abs/2502.18477
作者: Gabriel Patron,Zhiwei Xu,Ishan Kapnadak,Felipe Maia Polo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern recommender systems follow the guiding principle of serving the right user, the right item at the right time. One of their main limitations is that they are typically limited to items already in the catalog. We propose REcommendations BEyond CAtalogs, REBECA, a new class of probabilistic diffusion-based recommender systems that synthesize new items tailored to individual tastes rather than retrieve items from the catalog. REBECA combines efficient training in embedding space with a novel diffusion prior that only requires users’ past ratings of items. We evaluate REBECA on real-world data and propose novel personalization metrics for generative recommender systems. Extensive experiments demonstrate that REBECA produces high-quality, personalized recommendations, generating images that align with users’ unique preferences.

[AI-89] A Contemporary Survey of Large Language Model Assisted Program Analysis

链接: https://arxiv.org/abs/2502.18474
作者: Jiayimei Wang,Tao Ni,Wei-Bin Lee,Qingchuan Zhao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing complexity of software systems has driven significant advancements in program analysis, as traditional methods unable to meet the demands of modern software development. To address these limitations, deep learning techniques, particularly Large Language Models (LLMs), have gained attention due to their context-aware capabilities in code comprehension. Recognizing the potential of LLMs, researchers have extensively explored their application in program analysis since their introduction. Despite existing surveys on LLM applications in cybersecurity, comprehensive reviews specifically addressing their role in program analysis remain scarce. In this survey, we systematically review the application of LLMs in program analysis, categorizing the existing work into static analysis, dynamic analysis, and hybrid approaches. Moreover, by examining and synthesizing recent studies, we identify future directions and challenges in the field. This survey aims to demonstrate the potential of LLMs in advancing program analysis practices and offer actionable insights for security researchers seeking to enhance detection frameworks or develop domain-specific models.

[AI-90] SOK: Exploring Hallucinations and Security Risks in AI-Assisted Software Development with Insights for LLM Deployment

链接: https://arxiv.org/abs/2502.18468
作者: Ariful Haque,Sunzida Siddique,Md. Mahfuzur Rahman,Ahmed Rafi Hasan,Laxmi Rani Das,Marufa Kamal,Tasnim Masura,Kishor Datta Gupta
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) such as GitHub Copilot, ChatGPT, Cursor AI, and Codeium AI into software development has revolutionized the coding landscape, offering significant productivity gains, automation, and enhanced debugging capabilities. These tools have proven invaluable for generating code snippets, refactoring existing code, and providing real-time support to developers. However, their widespread adoption also presents notable challenges, particularly in terms of security vulnerabilities, code quality, and ethical concerns. This paper provides a comprehensive analysis of the benefits and risks associated with AI-powered coding tools, drawing on user feedback, security analyses, and practical use cases. We explore the potential for these tools to replicate insecure coding practices, introduce biases, and generate incorrect or non-sensical code (hallucinations). In addition, we discuss the risks of data leaks, intellectual property violations and the need for robust security measures to mitigate these threats. By comparing the features and performance of these tools, we aim to guide developers in making informed decisions about their use, ensuring that the benefits of AI-assisted coding are maximized while minimizing associated risks.

[AI-91] ChatGPT vs. DeepSeek : A Comparative Study on AI-Based Code Generation

链接: https://arxiv.org/abs/2502.18467
作者: Md Motaleb Hossen Manik
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: AI-powered code generation, fueled by Large Language Models (LLMs), is revolutionizing software development. Models like OpenAI’s Codex and GPT-4, alongside DeepSeek, leverage vast code and natural language datasets. However, ensuring code quality, correctness, and managing complex tasks remains challenging, necessitating thorough evaluation. Methodology: This research compares ChatGPT (version o1) and DeepSeek (version R1) for Python code generation using online judge coding challenges. It evaluates correctness (online judge verdicts, up to three attempts), code quality (Pylint/Flake8), and efficiency (execution time/memory usage). Results: DeepSeek demonstrated higher correctness, particularly on algorithmic tasks, often achieving ‘Accepted’ on the first attempt. ChatGPT sometimes requires multiple attempts or failures. ChatGPT encountered fewer issues, used comparable or slightly less memory, consumed less execution times and wrote fewer lines of code. Conclusion: DeepSeek exhibited superior correctness in Python code generation, often requiring fewer attempts, suggesting an advantage in algorithmic problem-solving. Both models showed almost similar efficiency in execution time and memory use. Finally, this research provides insights for developers choosing AI coding assistants and informs future AI-driven software development research.

[AI-92] MLScent A tool for Anti-pattern detection in ML projects

链接: https://arxiv.org/abs/2502.18466
作者: Karthik Shivashankar,Antonio Martini
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 4th International Conference on AI Engineering Software Engineering for AI , CAIN 2025

点击查看摘要

Abstract:Machine learning (ML) codebases face unprecedented challenges in maintaining code quality and sustainability as their complexity grows exponentially. While traditional code smell detection tools exist, they fail to address ML-specific issues that can significantly impact model performance, reproducibility, and maintainability. This paper introduces MLScent, a novel static analysis tool that leverages sophisticated Abstract Syntax Tree (AST) analysis to detect anti-patterns and code smells specific to ML projects. MLScent implements 76 distinct detectors across major ML frameworks including TensorFlow (13 detectors), PyTorch (12 detectors), Scikit-learn (9 detectors), and Hugging Face (10 detectors), along with data science libraries like Pandas and NumPy (8 detectors each). The tool’s architecture also integrates general ML smell detection (16 detectors), and specialized analysis for data preprocessing and model training workflows. Our evaluation demonstrates MLScent’s effectiveness through both quantitative classification metrics and qualitative assessment via user studies feedback with ML practitioners. Results show high accuracy in identifying framework-specific anti-patterns, data handling issues, and general ML code smells across real-world projects. Comments: 4th International Conference on AI Engineering Software Engineering for AI , CAIN 2025 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.18466 [cs.SE] (or arXiv:2502.18466v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.18466 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Karthik Shivashankar [view email] [v1] Thu, 30 Jan 2025 11:19:16 UTC (187 KB)

[AI-93] Integrating Biological and Machine Intelligence: Attention Mechanisms in Brain-Computer Interfaces

链接: https://arxiv.org/abs/2502.19281
作者: Jiyuan Wang,Weishan Ye,Jialin He,Li Zhang,Gan Huang,Zhuliang Yu,Zhen Liang
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancement of deep learning, attention mechanisms have become indispensable in electroencephalography (EEG) signal analysis, significantly enhancing Brain-Computer Interface (BCI) applications. This paper presents a comprehensive review of traditional and Transformer-based attention mechanisms, their embedding strategies, and their applications in EEG-based BCI, with a particular emphasis on multimodal data fusion. By capturing EEG variations across time, frequency, and spatial channels, attention mechanisms improve feature extraction, representation learning, and model robustness. These methods can be broadly categorized into traditional attention mechanisms, which typically integrate with convolutional and recurrent networks, and Transformer-based multi-head self-attention, which excels in capturing long-range dependencies. Beyond single-modality analysis, attention mechanisms also enhance multimodal EEG applications, facilitating effective fusion between EEG and other physiological or sensory data. Finally, we discuss existing challenges and emerging trends in attention-based EEG modeling, highlighting future directions for advancing BCI technology. This review aims to provide valuable insights for researchers seeking to leverage attention mechanisms for improved EEG interpretation and application.

[AI-94] AI-Powered Bayesian Inference

链接: https://arxiv.org/abs/2502.19231
作者: Veronika Ročková,Sean O’Hagan
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Research note, 27 pages, 3 figures

点击查看摘要

Abstract:The advent of Generative Artificial Intelligence (GAI) has heralded an inflection point that changed how society thinks about knowledge acquisition. While GAI cannot be fully trusted for decision-making, it may still provide valuable information that can be integrated into a decision pipeline. Rather than seeing the lack of certitude and inherent randomness of GAI as a problem, we view it as an opportunity. Indeed, variable answers to given prompts can be leveraged to construct a prior distribution which reflects assuredness of AI predictions. This prior distribution may be combined with tailored datasets for a fully Bayesian analysis with an AI-driven prior. In this paper, we explore such a possibility within a non-parametric Bayesian framework. The basic idea consists of assigning a Dirichlet process prior distribution on the data-generating distribution with AI generative model as its baseline. Hyper-parameters of the prior can be tuned out-of-sample to assess the informativeness of the AI prior. Posterior simulation is achieved by computing a suitably randomized functional on an augmented data that consists of observed (labeled) data as well as fake data whose labels have been imputed using AI. This strategy can be parallelized and rapidly produces iid samples from the posterior by optimization as opposed to sampling from conditionals. Our method enables (predictive) inference and uncertainty quantification leveraging AI predictions in a coherent probabilistic manner.

[AI-95] Enhancing the Scalability and Applicability of Kohn-Sham Hamiltonians for Molecular Systems

链接: https://arxiv.org/abs/2502.19227
作者: Yunyang Li,Zaishuo Xia,Lin Huang,Xinran Wei,Han Yang,Sam Harshe,Zun Wang,Chang Liu,Jia Zhang,Bin Shao,Mark B. Gerstein
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density Functional Theory (DFT) is a pivotal method within quantum chemistry and materials science, with its core involving the construction and solution of the Kohn-Sham Hamiltonian. Despite its importance, the application of DFT is frequently limited by the substantial computational resources required to construct the Kohn-Sham Hamiltonian. In response to these limitations, current research has employed deep-learning models to efficiently predict molecular and solid Hamiltonians, with roto-translational symmetries encoded in their neural networks. However, the scalability of prior models may be problematic when applied to large molecules, resulting in non-physical predictions of ground-state properties. In this study, we generate a substantially larger training set (PubChemQH) than used previously and use it to create a scalable model for DFT calculations with physical accuracy. For our model, we introduce a loss function derived from physical principles, which we call Wavefunction Alignment Loss (WALoss). WALoss involves performing a basis change on the predicted Hamiltonian to align it with the observed one; thus, the resulting differences can serve as a surrogate for orbital energy differences, allowing models to make better predictions for molecular orbitals and total energies than previously possible. WALoss also substantially accelerates self-consistent-field (SCF) DFT calculations. Here, we show it achieves a reduction in total energy prediction error by a factor of 1347 and an SCF calculation speed-up by a factor of 18%. These substantial improvements set new benchmarks for achieving accurate and applicable predictions in larger molecular systems.

[AI-96] Robust Over-the-Air Computation with Type-Based Multiple Access

链接: https://arxiv.org/abs/2502.19014
作者: Marc Martinez-Gost,Ana Pérez-Neira,Miguel Ángel Lagunas
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Paper submitted to 33rd European Signal Processing Conference (EUSIPCO 2025)

点击查看摘要

Abstract:This paper utilizes the properties of type-based multiple access (TBMA) to investigate its effectiveness as a robust approach for over-the-air computation (AirComp) in the presence of Byzantine attacks, this is, adversarial strategies where malicious nodes intentionally distort their transmissions to corrupt the aggregated result. Unlike classical direct aggregation (DA) AirComp, which aggregates data in the amplitude of the signals and are highly vulnerable to attacks, TBMA distributes data over multiple radio resources, enabling the receiver to construct a histogram representation of the transmitted data. This structure allows the integration of classical robust estimators and supports the computation of diverse functions beyond the arithmetic mean, which is not feasible with DA. Through extensive simulations, we demonstrate that robust TBMA significantly outperforms DA, maintaining high accuracy even under adversarial conditions, and showcases its applicability in federated learning (FEEL) scenarios. Additionally, TBMA reduces channel state information (CSI) requirements, lowers energy consumption, and enhances resiliency by leveraging the diversity of the transmitted data. These results establish TBMA as a scalable and robust solution for AirComp, paving the way for secure and efficient aggregation in next-generation networks.

[AI-97] SE(3)-Equivariant Ternary Complex Prediction Towards Target Protein Degradation

链接: https://arxiv.org/abs/2502.18875
作者: Fanglei Xue,Meihan Zhang,Shuqi Li,Xinyu Gao,James A. Wohlschlegel,Wenbing Huang,Yi Yang,Weixian Deng
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-98] Bridging Critical Gaps in Convergent Learning: How Representational Alignment Evolves Across Layers Training and Distribution Shifts

链接: https://arxiv.org/abs/2502.18710
作者: Chaitanya Kapoor,Sudhanshu Srivastava,Meenakshi Khosla
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding convergent learning – the extent to which artificial and biological neural networks develop similar representations – is crucial for neuroscience and AI, as it reveals shared learning principles and guides brain-like model design. While several studies have noted convergence in early and late layers of vision networks, key gaps remain. First, much existing work relies on a limited set of metrics, overlooking transformation invariances required for proper alignment. We compare three metrics that ignore specific irrelevant transformations: linear regression (ignoring affine transformations), Procrustes (ignoring rotations and reflections), and permutation/soft-matching (ignoring unit order). Notably, orthogonal transformations align representations nearly as effectively as more flexible linear ones, and although permutation scores are lower, they significantly exceed chance, indicating a robust representational basis. A second critical gap lies in understanding when alignment emerges during training. Contrary to expectations that convergence builds gradually with task-specific learning, our findings reveal that nearly all convergence occurs within the first epoch – long before networks achieve optimal performance. This suggests that shared input statistics, architectural biases, or early training dynamics drive convergence rather than the final task solution. Finally, prior studies have not systematically examined how changes in input statistics affect alignment. Our work shows that out-of-distribution (OOD) inputs consistently amplify differences in later layers, while early layers remain aligned for both in-distribution and OOD inputs, suggesting that this alignment is driven by generalizable features stable across distribution shifts. These findings fill critical gaps in our understanding of representational convergence, with implications for neuroscience and AI.

[AI-99] Mind the Gap: Bridging the Divide Between AI Aspirations and the Reality of Autonomous Characterization

链接: https://arxiv.org/abs/2502.18604
作者: Grace Guinan,Addison Salvador,Michelle A. Smeaton,Andrew Glaws,Hilary Egan,Brian C. Wyatt,Babak Anasori,Kevin R. Fiedler,Matthew J. Olszta,Steven R. Spurgeon
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: 33 pages, 6 figures

点击查看摘要

[AI-100] Applications of Statistical Field Theory in Deep Learning

链接: https://arxiv.org/abs/2502.18553
作者: Zohar Ringel,Noa Rubin,Edo Mor,Moritz Helias,Inbar Seroussi
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

机器学习

[LG-0] Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

链接: https://arxiv.org/abs/2502.19414
作者: Shiven Sinha,Shashwat Goel,Ponnurangam Kumaraguru,Jonas Geiping,Matthias Bethge,Ameya Prabhu
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Technical Report

点击查看摘要

Abstract:There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only 9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs’ ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.

[LG-1] Verde: Verification via Refereed Delegation for Machine Learning Programs

链接: https://arxiv.org/abs/2502.19405
作者: Arasu Arun,Adam St. Arnaud,Alexey Titov,Brian Wilcox,Viktor Kolobaric,Marc Brinkmann,Oguzhan Ersoy,Ben Fielding,Joseph Bonneau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning programs, such as those performing inference, fine-tuning, and training of LLMs, are commonly delegated to untrusted compute providers. To provide correctness guarantees for the client, we propose adapting the cryptographic notion of refereed delegation to the machine learning setting. This approach enables a computationally limited client to delegate a program to multiple untrusted compute providers, with a guarantee of obtaining the correct result if at least one of them is honest. Refereed delegation of ML programs poses two technical hurdles: (1) an arbitration protocol to resolve disputes when compute providers disagree on the output, and (2) the ability to bitwise reproduce ML programs across different hardware setups, For (1), we design Verde, a dispute arbitration protocol that efficiently handles the large scale and graph-based computational model of modern ML programs. For (2), we build RepOps (Reproducible Operators), a library that eliminates hardware “non-determinism” by controlling the order of floating point operations performed on all hardware. Our implementation shows that refereed delegation achieves both strong guarantees for clients and practical overheads for compute providers.

[LG-2] General Reasoning Requires Learning to Reason from the Get-go

链接: https://arxiv.org/abs/2502.19402
作者: Seungwook Han,Jyothish Pari,Samuel J. Gershman,Pulkit Agrawal
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly – the hallmarks of artificial general intelligence (AGI) – remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM’s reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a \textitreasoning prior for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing architectures at learning to reason in novel scenarios. Comments: 11 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.19402 [cs.LG] (or arXiv:2502.19402v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.19402 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] HDEE: Heterogeneous Domain Expert Ensemble

链接: https://arxiv.org/abs/2502.19385
作者: Oğuzhan Ersoy,Jari Kolehmainen,Gabriel Passamani Andrade
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Training dense LLMs requires enormous amounts of data and centralized compute, which introduces fundamental bottlenecks and ever-growing costs for large models. Several studies aim to reduce this dependency on centralization by reducing the communication overhead of training dense models. Taking this idea of reducing communication overhead to a natural extreme, by training embarrassingly parallelizable ensembles of small independent experts, has been shown to outperform large dense models trained in traditional centralized settings. However, existing studies do not take into account underlying differences amongst data domains and treat them as monolithic, regardless of their underlying complexity, size, or distribution. In this paper, we explore the effects of introducing heterogeneity to these ensembles of domain expert models. Specifically, by allowing models within the ensemble to vary in size–as well as the number of training steps taken depending on the training data’s domain–we study the effect heterogeneity has on these ensembles when evaluated against domains included in, and excluded from, the training set. We use the same compute budget to train heterogeneous ensembles and homogeneous baselines for comparison. We show that the heterogeneous ensembles achieve the lowest perplexity scores in 20 out of the 21 data domains used in the evaluation. Our code is available at this https URL.

[LG-4] dCMF: Learning interpretable evolving patterns from temporal multiway data

链接: https://arxiv.org/abs/2502.19367
作者: Christos Chatzis,Carla Schenker,Jérémy E. Cohen,Evrim Acar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multiway datasets are commonly analyzed using unsupervised matrix and tensor factorization methods to reveal underlying patterns. Frequently, such datasets include timestamps and could correspond to, for example, health-related measurements of subjects collected over time. The temporal dimension is inherently different from the other dimensions, requiring methods that account for its intrinsic properties. Linear Dynamical Systems (LDS) are specifically designed to capture sequential dependencies in the observed data. In this work, we bridge the gap between tensor factorizations and dynamical modeling by exploring the relationship between LDS, Coupled Matrix Factorizations (CMF) and the PARAFAC2 model. We propose a time-aware coupled factorization model called d(ynamical)CMF that constrains the temporal evolution of the latent factors to adhere to a specific LDS structure. Using synthetic datasets, we compare the performance of dCMF with PARAFAC2 and t(emporal)PARAFAC2 which incorporates temporal smoothness. Our results show that dCMF and PARAFAC2-based approaches perform similarly when capturing smoothly evolving patterns that adhere to the PARAFAC2 structure. However, dCMF outperforms alternatives when the patterns evolve smoothly but deviate from the PARAFAC2 structure. Furthermore, we demonstrate that the proposed dCMF method enables to capture more complex dynamics when additional prior information about the temporal evolution is incorporated.

[LG-5] Deep Learning For Time Series Analysis With Application On Human Motion

链接: https://arxiv.org/abs/2502.19364
作者: Ali Ismail-Fawaz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data, defined by equally spaced points over time, is essential in fields like medicine, telecommunications, and energy. Analyzing it involves tasks such as classification, clustering, prototyping, and regression. Classification identifies normal vs. abnormal movements in skeleton-based motion sequences, clustering detects stock market behavior patterns, prototyping expands physical therapy datasets, and regression predicts patient recovery. Deep learning has recently gained traction in time series analysis due to its success in other domains. This thesis leverages deep learning to enhance classification with feature engineering, introduce foundation models, and develop a compact yet state-of-the-art architecture. We also address limited labeled data with self-supervised learning. Our contributions apply to real-world tasks, including human motion analysis for action recognition and rehabilitation. We introduce a generative model for human motion data, valuable for cinematic production and gaming. For prototyping, we propose a shape-based synthetic sample generation method to support regression models when data is scarce. Lastly, we critically evaluate discriminative and generative models, identifying limitations in current methodologies and advocating for a robust, standardized evaluation framework. Our experiments on public datasets provide novel insights and methodologies, advancing time series analysis with practical applications.

[LG-6] Recurrent Auto-Encoders for Enhanced Deep Reinforcement Learning in Wilderness Search and Rescue Planning

链接: https://arxiv.org/abs/2502.19356
作者: Jan-Hendrik Ewers,David Anderson,Douglas Thomson
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to Machine Learning with Applications

点击查看摘要

Abstract:Wilderness search and rescue operations are often carried out over vast landscapes. The search efforts, however, must be undertaken in minimum time to maximize the chance of survival of the victim. Whilst the advent of cheap multicopters in recent years has changed the way search operations are handled, it has not solved the challenges of the massive areas at hand. The problem therefore is not one of complete coverage, but one of maximizing the information gathered in the limited time available. In this work we propose that a combination of a recurrent autoencoder and deep reinforcement learning is a more efficient solution to the search problem than previous pure deep reinforcement learning or optimisation approaches. The autoencoder training paradigm efficiently maximizes the information throughput of the encoder into its latent space representation which deep reinforcement learning is primed to leverage. Without the overhead of independently solving the problem that the recurrent autoencoder is designed for, it is more efficient in learning the control task. We further implement three additional architectures for a comprehensive comparison of the main proposed architecture. Similarly, we apply both soft actor-critic and proximal policy optimisation to provide an insight into the performance of both in a highly non-linear and complex application with a large observation Results show that the proposed architecture is vastly superior to the benchmarks, with soft actor-critic achieving the best performance. This model further outperformed work from the literature whilst having below a fifth of the total learnable parameters and training in a quarter of the time.

[LG-7] CryptoPulse: Short-Term Cryptocurrency Forecasting with Dual-Prediction and Cross-Correlated Market Indicators

链接: https://arxiv.org/abs/2502.19349
作者: Amit Kumar,Taoran Ji
类目: Machine Learning (cs.LG); Pricing of Securities (q-fin.PR)
*备注: 10

点击查看摘要

Abstract:Cryptocurrencies fluctuate in markets with high price volatility, posing significant challenges for investors. To aid in informed decision-making, systems predicting cryptocurrency market movements have been developed, typically focusing on historical patterns. However, these methods often overlook three critical factors influencing market dynamics: 1) the macro investing environment, reflected in major cryptocurrency fluctuations affecting collaborative investor behaviors; 2) overall market sentiment, heavily influenced by news impacting investor strategies; and 3) technical indicators, offering insights into overbought or oversold conditions, momentum, and market trends, which are crucial for short-term price movements. This paper proposes a dual prediction mechanism that forecasts the next day’s closing price by incorporating macroeconomic fluctuations, technical indicators, and individual cryptocurrency price changes. Additionally, a novel refinement mechanism enhances predictions through market sentiment-based rescaling and fusion. Experiments demonstrate that the proposed model achieves state-of-the-art performance, consistently outperforming ten comparison methods.

[LG-8] I Know What I Dont Know: Improving Model Cascades Through Confidence Tuning

链接: https://arxiv.org/abs/2502.19335
作者: Stephan Rabanser,Nathalie Rauschmayr,Achin Kulshrestha,Petra Poklukar,Wittawat Jitkrittum,Sean Augenstein,Congchao Wang,Federico Tombari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

[LG-9] Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond

链接: https://arxiv.org/abs/2502.19301
作者: Qizhou Wang,Jin Peng Zhou,Zhanke Zhou,Saebyeol Shin,Bo Han,Kilian Q. Weinberger
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) should undergo rigorous audits to identify potential risks, such as copyright and privacy infringements. Once these risks emerge, timely updates are crucial to remove undesirable responses, ensuring legal and safe model usage. It has spurred recent research into LLM unlearning, focusing on erasing targeted undesirable knowledge without compromising the integrity of other, non-targeted responses. Existing studies have introduced various unlearning objectives to pursue LLM unlearning without necessitating complete retraining. However, each of these objectives has unique properties, and no unified framework is currently available to comprehend them thoroughly. To fill the gap, we propose a toolkit of the gradient effect (G-effect), quantifying the impacts of unlearning objectives on model performance from a gradient perspective. A notable advantage is its broad ability to detail the unlearning impacts from various aspects across instances, updating steps, and LLM layers. Accordingly, the G-effect offers new insights into identifying drawbacks of existing unlearning objectives, further motivating us to explore a series of new solutions for their mitigation and improvements. Finally, we outline promising directions that merit further studies, aiming at contributing to the community to advance this important field.

[LG-10] Global Graph Propagation with Hierarchical Information Transfer for Incomplete Contrastive Multi-view Clustering

链接: https://arxiv.org/abs/2502.19291
作者: Guoqing Chao,Kaixin Xu,Xijiong Xie,Yongyong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Incomplete multi-view clustering has become one of the important research problems due to the extensive missing multi-view data in the real world. Although the existing methods have made great progress, there are still some problems: 1) most methods cannot effectively mine the information hidden in the missing data; 2) most methods typically divide representation learning and clustering into two separate stages, but this may affect the clustering performance as the clustering results directly depend on the learned representation. To address these problems, we propose a novel incomplete multi-view clustering method with hierarchical information transfer. Firstly, we design the view-specific Graph Convolutional Networks (GCN) to obtain the representation encoding the graph structure, which is then fused into the consensus representation. Secondly, considering that one layer of GCN transfers one-order neighbor node information, the global graph propagation with the consensus representation is proposed to handle the missing data and learn deep representation. Finally, we design a weight-sharing pseudo-classifier with contrastive learning to obtain an end-to-end framework that combines view-specific representation learning, global graph propagation with hierarchical information transfer, and contrastive clustering for joint optimization. Extensive experiments conducted on several commonly-used datasets demonstrate the effectiveness and superiority of our method in comparison with other state-of-the-art approaches. The code is available at this https URL.

[LG-11] Efficient Federated Search for Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2502.19280
作者: Rachid Guerraoui,Anne-Marie Kermarrec,Diana Petrescu,Rafael Pires,Mathis Randl,Martijn de Vos
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
*备注: To appear in the proceedings of EuroMLSys’25

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various domains but remain susceptible to hallucinations and inconsistencies, limiting their reliability. Retrieval-augmented generation (RAG) mitigates these issues by grounding model responses in external knowledge sources. Existing RAG workflows often leverage a single vector database, which is impractical in the common setting where information is distributed across multiple repositories. We introduce RAGRoute, a novel mechanism for federated RAG search. RAGRoute dynamically selects relevant data sources at query time using a lightweight neural network classifier. By not querying every data source, this approach significantly reduces query overhead, improves retrieval efficiency, and minimizes the retrieval of irrelevant information. We evaluate RAGRoute using the MIRAGE and MMLU benchmarks and demonstrate its effectiveness in retrieving relevant documents while reducing the number of queries. RAGRoute reduces the total number of queries up to 77.5% and communication volume up to 76.2%.

[LG-12] Set and functional prediction: randomness exchangeability and conformal

链接: https://arxiv.org/abs/2502.19254
作者: Vladimir Vovk
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 15 pages

点击查看摘要

Abstract:This paper continues the study of the efficiency of conformal prediction as compared with more general randomness prediction and exchangeability prediction. It does not restrict itself to the case of classification, and our results will also be applicable to the case of regression. The price to pay is that efficiency will be attained only on average, albeit with respect to a wide range of probability measures on the label space.

[LG-13] INFO-SEDD: Continuous Time Markov Chains as Scalable Information Metrics Estimators

链接: https://arxiv.org/abs/2502.19183
作者: Alberto Foresti,Giulio Franzese,Pietro Michiardi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Information-theoretic quantities play a crucial role in understanding non-linear relationships between random variables and are widely used across scientific disciplines. However, estimating these quantities remains an open problem, particularly in the case of high-dimensional discrete distributions. Current approaches typically rely on embedding discrete data into a continuous space and applying neural estimators originally designed for continuous distributions, a process that may not fully capture the discrete nature of the underlying data. We consider Continuous-Time Markov Chains (CTMCs), stochastic processes on discrete state-spaces which have gained popularity due to their generative modeling applications. In this work, we introduce INFO-SEDD, a novel method for estimating information-theoretic quantities of discrete data, including mutual information and entropy. Our approach requires the training of a single parametric model, offering significant computational and memory advantages. Additionally, it seamlessly integrates with pretrained networks, allowing for efficient reuse of pretrained generative models. To evaluate our approach, we construct a challenging synthetic benchmark. Our experiments demonstrate that INFO-SEDD is robust and outperforms neural competitors that rely on embedding techniques. Moreover, we validate our method on a real-world task: estimating the entropy of an Ising model. Overall, INFO-SEDD outperforms competing methods and shows scalability to high-dimensional scenarios, paving the way for new applications where estimating MI between discrete distribution is the focus. The promising results in this complex, high-dimensional scenario highlight INFO-SEDD as a powerful new estimator in the toolkit for information-theoretical analysis.

[LG-14] A Model-Centric Review of Deep Learning for Protein Design

链接: https://arxiv.org/abs/2502.19173
作者: Gregory W. Kyro,Tianyin Qiu,Victor S. Batista
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Deep learning has transformed protein design, enabling accurate structure prediction, sequence optimization, and de novo protein generation. Advances in single-chain protein structure prediction via AlphaFold2, RoseTTAFold, ESMFold, and others have achieved near-experimental accuracy, inspiring successive work extended to biomolecular complexes via AlphaFold Multimer, RoseTTAFold All-Atom, AlphaFold 3, Chai-1, Boltz-1 and others. Generative models such as ProtGPT2, ProteinMPNN, and RFdiffusion have enabled sequence and backbone design beyond natural evolution-based limitations. More recently, joint sequence-structure co-design models, including ESM3, have integrated both modalities into a unified framework, resulting in improved designability. Despite these advances, challenges still exist pertaining to modeling sequence-structure-function relationships and ensuring robust generalization beyond the regions of protein space spanned by the training data. Future advances will likely focus on joint sequence-structure-function co-design frameworks that are able to model the fitness landscape more effectively than models that treat these modalities independently. Current capabilities, coupled with the dizzying rate of progress, suggest that the field will soon enable rapid, rational design of proteins with tailored structures and functions that transcend the limitations imposed by natural evolution. In this review, we discuss the current capabilities of deep learning methods for protein design, focusing on some of the most revolutionary and capable models with respect to their functionality and the applications that they enable, leading up to the current challenges of the field and the optimal path forward.

[LG-15] On the Byzantine Fault Tolerance of signSGD with Majority Vote

链接: https://arxiv.org/abs/2502.19170
作者: Emanuele Mengoli,Luzius Moll,Virgilio Strozzi,El-Mahdi El-Mhamdi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In distributed learning, sign-based compression algorithms such as signSGD with majority vote provide a lightweight alternative to SGD with an additional advantage: fault tolerance (almost) for free. However, for signSGD with majority vote, this fault tolerance has been shown to cover only the case of weaker adversaries, i.e., ones that are not omniscient or cannot collude to base their attack on common knowledge and strategy. In this work, we close this gap and provide new insights into how signSGD with majority vote can be resilient against omniscient and colluding adversaries, which craft an attack after communicating with other adversaries, thus having better information to perform the most damaging attack based on a common optimal strategy. Our core contribution is in providing a proof that begins by defining the omniscience framework and the strongest possible damage against signSGD with majority vote without imposing any restrictions on the attacker. Thanks to the filtering effect of the sign-based method, we upper-bound the space of attacks to the optimal strategy for maximizing damage by an attacker. Hence, we derive an explicit probabilistic bound in terms of incorrect aggregation without resorting to unknown constants, providing a convergence bound on signSGD with majority vote in the presence of Byzantine attackers, along with a precise convergence rate. Our findings are supported by experiments on the MNIST dataset in a distributed learning environment with adversaries of varying strength.

[LG-16] Generalizable deep learning for photoplethysmography-based blood pressure estimation – A Benchmarking Study ALT

链接: https://arxiv.org/abs/2502.19167
作者: Mohammad Moulaeifard,Peter H. Charlton,Nils Strodthoff
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 20 pages, 5 figures, code available at this https URL

点击查看摘要

Abstract:Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.

[LG-17] CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation

链接: https://arxiv.org/abs/2502.19166
作者: Kaiwen Yan,Hongcheng Guo,Xuanqing Shi,Jingyi Xu,Yaonan Gu,Zhoujun Li
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated testing, but also augments developer efficiency through improved maintainability and reusability of code. In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios. CodeIF encompasses a broad range of tasks, including function synthesis, error debugging, algorithmic refactoring, and code explanation, thereby providing a comprehensive suite to evaluate model performance across varying complexity levels and programming domains. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks. The experimental results offer valuable insights into how well current models align with human instructions, as well as the extent to which they can generate consistent, maintainable, and contextually relevant code. Our findings not only underscore the critical role that instruction-following LLMs can play in modern software development, but also illuminate pathways for future research aimed at enhancing their adaptability, reliability, and overall effectiveness in automated code generation.

[LG-18] Design of Cavity Backed Slotted Antenna using Machine Learning Regression Model

链接: https://arxiv.org/abs/2502.19164
作者: Vijay Kumar Sutrakar,Anjana PK,Rohit Bisariya,Soumya KK,Gopal Chawan M
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:In this paper, a regression-based machine learning model is used for the design of cavity backed slotted antenna. This type of antenna is commonly used in military and aviation communication systems. Initial reflection coefficient data of cavity backed slotted antenna is generated using electromagnetic solver. These reflection coefficient data is then used as input for training regression-based machine learning model. The model is trained to predict the dimensions of cavity backed slotted antenna based on the input reflection coefficient for a wide frequency band varying from 1 GHz to 8 GHz. This approach allows for rapid prediction of optimal antenna configurations, reducing the need for repeated physical testing and manual adjustments, may lead to significant amount of design and development cost saving. The proposed model also demonstrates its versatility in predicting multi frequency resonance across 1 GHz to 8 GHz. Also, the proposed approach demonstrates the potential for leveraging machine learning in advanced antenna design, enhancing efficiency and accuracy in practical applications such as radar, military identification systems and secure communication networks.

[LG-19] Design of Resistive Frequency Selective Surface based Radar Absorbing Structure-A Deep Learning Approach

链接: https://arxiv.org/abs/2502.19151
作者: Vijay Kumar Sutrakar,Nikhil Morge,Anjana PK,Abhilash PV
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:In this paper, deep learning-based approach for the design of radar absorbing structure using resistive frequency selective surface is proposed. In the present design, reflection coefficient is used as input of deep learning model and the Jerusalem cross based unit cell dimensions is predicted as outcome. Sequential neural network based deep learning model with adaptive moment estimation optimizer is used for designing multi frequency band absorbers. The model is used for designing radar absorber from L to Ka band depending on unit cell parameters and thickness. The outcome of deep learning model is further compared with full-wave simulation software and an excellent match is obtained. The proposed model can be used for the low-cost design of various radar absorbing structures using a single unit cell and thickness across the band of frequencies.

[LG-20] Random Similarity Isolation Forests

链接: https://arxiv.org/abs/2502.19122
作者: Sebastian Chwilczyński,Dariusz Brzezinski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With predictive models becoming prevalent, companies are expanding the types of data they gather. As a result, the collected datasets consist not only of simple numerical features but also more complex objects such as time series, images, or graphs. Such multi-modal data have the potential to improve performance in predictive tasks like outlier detection, where the goal is to identify objects deviating from the main data distribution. However, current outlier detection algorithms are dedicated to individual types of data. Consequently, working with mixed types of data requires either fusing multiple data-specific models or transforming all of the representations into a single format, both of which can hinder predictive performance. In this paper, we propose a multi-modal outlier detection algorithm called Random Similarity Isolation Forest. Our method combines the notions of isolation and similarity-based projection to handle datasets with mixtures of features of arbitrary data types. Experiments performed on 47 benchmark datasets demonstrate that Random Similarity Isolation Forest outperforms five state-of-the-art competitors. Our study shows that the use of multiple modalities can indeed improve the detection of anomalies and highlights the need for new outlier detection benchmarks tailored for multi-modal algorithms.

[LG-21] Software demodulation of weak radio signals using convolutional neural network

链接: https://arxiv.org/abs/2502.19097
作者: Mykola Kozlenko,Ihor Lazarovych,Valerii Tkachuk,Vira Vialkova
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages, 8 figures. This is the preprint version of the paper published in 2020 IEEE 7th International Conference on Energy Smart Systems (ESS). The final version is available at IEEE Xplore: this https URL . arXiv admin note: text overlap with arXiv:2502.16371

点击查看摘要

Abstract:In this paper we proposed the use of JT65A radio communication protocol for data exchange in wide-area monitoring systems in electric power systems. We investigated the software demodulation of the multiple frequency shift keying weak signals transmitted with JT65A communication protocol using deep convolutional neural network. We presented the demodulation performance in form of symbol and bit error rates. We focused on the interference immunity of the protocol over an additive white Gaussian noise with average signal-to-noise ratios in the range from -30 dB to 0 dB, which was obtained for the first time. We proved that the interference immunity is about 1.5 dB less than the theoretical limit of non-coherent demodulation of orthogonal MFSK signals.

[LG-22] MCLRL: A Multi-Domain Contrastive Learning with Reinforcement Learning Framework for Few-Shot Modulation Recognition

链接: https://arxiv.org/abs/2502.19071
作者: Dongwei Xu,Yutao Zhu,Yao Lu,Youpeng Feng,Yun Lin,Qi Xuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancements in wireless communication technology, automatic modulation recognition (AMR) plays a critical role in ensuring communication security and reliability. However, numerous challenges, including higher performance demands, difficulty in data acquisition under specific scenarios, limited sample size, and low-quality labeled data, hinder its development. Few-shot learning (FSL) offers an effective solution by enabling models to achieve satisfactory performance with only a limited number of labeled samples. While most FSL techniques are applied in the field of computer vision, they are not directly applicable to wireless signal processing. This study does not propose a new FSL-specific signal model but introduces a framework called MCLRL. This framework combines multi-domain contrastive learning with reinforcement learning. Multi-domain representations of signals enhance feature richness, while integrating contrastive learning and reinforcement learning architectures enables the extraction of deep features for classification. In downstream tasks, the model achieves excellent performance using only a few samples and minimal training cycles. Experimental results show that the MCLRL framework effectively extracts key features from signals, performs well in FSL tasks, and maintains flexibility in signal model selection.

[LG-23] A Sample-Level Evaluation and Generative Framework for Model Inversion Attacks AAAI-25 AAAI

链接: https://arxiv.org/abs/2502.19070
作者: Haoyang Li,Li Bai,Qingqing Ye,Haibo Hu,Yaxin Xiao,Huadi Zheng,Jianliang Xu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted to be appeared in 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic label-level private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private information of a single target sample, is also important but under-explored in the MI literature due to the limitations of existing evaluation metrics. To address this gap, this study introduces a novel metric tailored for training-sample analysis, namely, the Diversity and Distance Composite Score (DDCS), which evaluates the reconstruction fidelity of each training sample by encompassing various MI attack attributes. This, in turn, enhances the precision of sample-level privacy assessments. Leveraging DDCS as a new evaluative lens, we observe that many training samples remain resilient against even the most advanced MI attack. As such, we further propose a transfer learning framework that augments the generative capabilities of MI attackers through the integration of entropy loss and natural gradient descent. Extensive experiments verify the effectiveness of our framework on improving state-of-the-art MI attacks over various metrics including DDCS, coverage and FID. Finally, we demonstrate that DDCS can also be useful for MI defense, by identifying samples susceptible to MI attacks in an unsupervised manner. Comments: Accepted to be appeared in 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25) Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2502.19070 [cs.LG] (or arXiv:2502.19070v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.19070 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] Fatigue-PINN: Physics-Informed Fatigue-Driven Motion Modulation and Synthesis

链接: https://arxiv.org/abs/2502.19056
作者: Iliana Loi,Konstantinos Moustakas
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 13 pages, 9 pages. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Fatigue modeling is essential for motion synthesis tasks to model human motions under fatigued conditions and biomechanical engineering applications, such as investigating the variations in movement patterns and posture due to fatigue, defining injury risk mitigation and prevention strategies, formulating fatigue minimization schemes and creating improved ergonomic designs. Nevertheless, employing data-driven methods for synthesizing the impact of fatigue on motion, receives little to no attention in the literature. In this work, we present Fatigue-PINN, a deep learning framework based on Physics-Informed Neural Networks, for modeling fatigued human movements, while providing joint-specific fatigue configurations for adaptation and mitigation of motion artifacts on a joint level, resulting in more realistic animations. To account for muscle fatigue, we simulate the fatigue-induced fluctuations in the maximum exerted joint torques by leveraging a PINN adaptation of the Three-Compartment Controller model to exploit physics-domain knowledge for improving accuracy. This model also introduces parametric motion alignment with respect to joint-specific fatigue, hence avoiding sharp frame transitions. Our results indicate that Fatigue-PINN accurately simulates the effects of externally perceived fatigue on open-type human movements being consistent with findings from real-world experimental fatigue studies. Since fatigue is incorporated in torque space, Fatigue-PINN provides an end-to-end encoder-decoder-like architecture, to ensure transforming joint angles to joint torques and vice-versa, thus, being compatible with motion synthesis frameworks operating on joint angles.

[LG-25] Foundation Inference Models for Stochastic Differential Equations: A Transformer-based Approach for Zero-shot Function Estimation

链接: https://arxiv.org/abs/2502.19049
作者: Patrick Seifner,Kostadin Cvejoski,David Berghaus,Cesar Ojeda,Ramses J. Sanchez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic differential equations (SDEs) describe dynamical systems where deterministic flows, governed by a drift function, are superimposed with random fluctuations dictated by a diffusion function. The accurate estimation (or discovery) of these functions from data is a central problem in machine learning, with wide application across natural and social sciences alike. Yet current solutions are brittle, and typically rely on symbolic regression or Bayesian non-parametrics. In this work, we introduce FIM-SDE (Foundation Inference Model for SDEs), a transformer-based recognition model capable of performing accurate zero-shot estimation of the drift and diffusion functions of SDEs, from noisy and sparse observations on empirical processes of different dimensionalities. Leveraging concepts from amortized inference and neural operators, we train FIM-SDE in a supervised fashion, to map a large set of noisy and discretely observed SDE paths to their corresponding drift and diffusion functions. We demonstrate that one and the same (pretrained) FIM-SDE achieves robust zero-shot function estimation (i.e. without any parameter fine-tuning) across a wide range of synthetic and real-world processes, from canonical SDE systems (e.g. double-well dynamics or weakly perturbed Hopf bifurcations) to human motion recordings and oil price and wind speed fluctuations.

[LG-26] A HEART for the environment: Transformer-Based Spatiotemporal Modeling for Air Quality Prediction

链接: https://arxiv.org/abs/2502.19042
作者: Norbert Bodendorfer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and reliable air pollution forecasting is crucial for effective environmental management and policy-making. llull-environment is a sophisticated and scalable forecasting system for air pollution, inspired by previous models currently operational in Madrid and Valladolid (Spain). It contains (among other key components) an encoder-decoder convolutional neural network to forecast mean pollution levels for four key pollutants (NO _2 , O _3 , PM _10 , PM _2.5 ) using historical data, external forecasts, and other contextual features. This paper investigates the augmentation of this neural network with an attention mechanism to improve predictive accuracy. The proposed attention mechanism pre-processes tensors containing the input features before passing them to the existing mean forecasting model. The resulting model is a combination of several architectures and ideas and can be described as a “Hybrid Enhanced Autoregressive Transformer”, or HEART. The effectiveness of the approach is evaluated by comparing the mean square error (MSE) across different attention layouts against the system without such a mechanism. We observe a significant reduction in MSE of up to 22%, with an average of 7.5% across tested cities and pollutants. The performance of a given attention mechanism turns out to depend on the pollutant, highlighting the differences in their creation and dissipation processes. Our findings are not restricted to optimizing air quality prediction models, but are applicable generally to (fixed length) time series forecasting.

[LG-27] Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits

链接: https://arxiv.org/abs/2502.19006
作者: Shogo Iwazaki
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:We study the noise-free Gaussian Process (GP) bandits problem, in which the learner seeks to minimize regret through noise-free observations of the black-box objective function lying on the known reproducing kernel Hilbert space (RKHS). Gaussian process upper confidence bound (GP-UCB) is the well-known GP-bandits algorithm whose query points are adaptively chosen based on the GP-based upper confidence bound score. Although several existing works have reported the practical success of GP-UCB, the current theoretical results indicate its suboptimal performance. However, GP-UCB tends to perform well empirically compared with other nearly optimal noise-free algorithms that rely on a non-adaptive sampling scheme of query points. This paper resolves this gap between theoretical and empirical performance by showing the nearly optimal regret upper bound of noise-free GP-UCB. Specifically, our analysis shows the first constant cumulative regret in the noise-free settings for the squared exponential kernel and Matérn kernel with some degree of smoothness.

[LG-28] Long-term Causal Inference via Modeling Sequential Latent Confounding

链接: https://arxiv.org/abs/2502.18994
作者: Weilin Chen,Ruichu Cai,Yuguang Yan,Zhifeng Hao,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term causal inference is an important but challenging problem across various scientific domains. To solve the latent confounding problem in long-term observational studies, existing methods leverage short-term experimental data. Ghassami et al. propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption, which asserts that the confounding bias in the short-term outcome is equal to that in the long-term outcome, so that the long-term confounding bias and the causal effects can be identified. While effective in certain cases, this assumption is limited to scenarios with a one-dimensional short-term outcome. In this paper, we introduce a novel assumption that extends the CAECB assumption to accommodate temporal short-term outcomes. Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes, under which we theoretically establish the identification of long-term causal effects. Based on the identification result, we develop an estimator and conduct a theoretical analysis of its asymptotic properties. Extensive experiments validate our theoretical results and demonstrate the effectiveness of the proposed method.

[LG-29] Evaluating Membership Inference Attacks in heterogeneous-data setups

链接: https://arxiv.org/abs/2502.18986
作者: Bram van Dartel,Marc Damie,Florian Hahn
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted in SiMLA workshop 2025 (co-located with ACNS)

点击查看摘要

Abstract:Among all privacy attacks against Machine Learning (ML), membership inference attacks (MIA) attracted the most attention. In these attacks, the attacker is given an ML model and a data point, and they must infer whether the data point was used for training. The attacker also has an auxiliary dataset to tune their inference algorithm. Attack papers commonly simulate setups in which the attacker’s and the target’s datasets are sampled from the same distribution. This setting is convenient to perform experiments, but it rarely holds in practice. ML literature commonly starts with similar simplifying assumptions (i.e., “i.i.d.” datasets), and later generalizes the results to support heterogeneous data distributions. Similarly, our work makes a first step in the generalization of the MIA evaluation to heterogeneous data. First, we design a metric to measure the heterogeneity between any pair of tabular data distributions. This metric provides a continuous scale to analyze the phenomenon. Second, we compare two methodologies to simulate a data heterogeneity between the target and the attacker. These setups provide opposite performances: 90% attack accuracy vs. 50% (i.e., random guessing). Our results show that the MIA accuracy depends on the experimental setup; and even if research on MIA considers heterogeneous data setups, we have no standardized baseline of how to simulate it. The lack of such a baseline for MIA experiments poses a significant challenge to risk assessments in real-world machine learning scenarios. Comments: Accepted in SiMLA workshop 2025 (co-located with ACNS) Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.11 Cite as: arXiv:2502.18986 [cs.CR] (or arXiv:2502.18986v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2502.18986 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Invariance Pair-Guided Learning: Enhancing Robustness in Neural Networks

链接: https://arxiv.org/abs/2502.18975
作者: Martin Surner,Abdelmajid Khelil,Ludwig Bothmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution generalization of machine learning models remains challenging since the models are inherently bound to the training data distribution. This especially manifests, when the learned models rely on spurious correlations. Most of the existing approaches apply data manipulation, representation learning, or learning strategies to achieve generalizable models. Unfortunately, these approaches usually require multiple training domains, group labels, specialized augmentation, or pre-processing to reach generalizable models. We propose a novel approach that addresses these limitations by providing a technique to guide the neural network through the training phase. We first establish input pairs, representing the spurious attribute and describing the invariance, a characteristic that should not affect the outcome of the model. Based on these pairs, we form a corrective gradient complementing the traditional gradient descent approach. We further make this correction mechanism adaptive based on a predefined invariance condition. Experiments on ColoredMNIST, Waterbird-100, and CelebA datasets demonstrate the effectiveness of our approach and the robustness to group shifts.

[LG-31] One Set to Rule Them All: How to Obtain General Chemical Conditions via Bayesian Optimization over Curried Functions

链接: https://arxiv.org/abs/2502.18966
作者: Stefan P. Schmid,Ella Miray Rajaonson,Cher Tian Ser,Mohammad Haddadnia,Shi Xuan Leong,Alán Aspuru-Guzik,Agustinus Kristiadi,Kjell Jorner,Felix Strieth-Kalthoff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:General parameters are highly desirable in the natural sciences - e.g., chemical reaction conditions that enable high yields across a range of related transformations. This has a significant practical impact since those general parameters can be transferred to related tasks without the need for laborious and time-intensive re-optimization. While Bayesian optimization (BO) is widely applied to find optimal parameter sets for specific tasks, it has remained underused in experiment planning towards such general optima. In this work, we consider the real-world problem of condition optimization for chemical reactions to study how performing generality-oriented BO can accelerate the identification of general optima, and whether these optima also translate to unseen examples. This is achieved through a careful formulation of the problem as an optimization over curried functions, as well as systematic evaluations of generality-oriented strategies for optimization tasks on real-world experimental data. We find that for generality-oriented optimization, simple myopic optimization strategies that decouple parameter and task selection perform comparably to more complex ones, and that effective optimization is merely determined by an effective exploration of both parameter and task space.

[LG-32] Nonparametric Heterogeneous Long-term Causal Effect Estimation via Data Combination

链接: https://arxiv.org/abs/2502.18960
作者: Weilin Chen,Ruichu Cai,Junjie Wan,Zeqin Yang,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term causal inference has drawn increasing attention in many scientific domains. Existing methods mainly focus on estimating average long-term causal effects by combining long-term observational data and short-term experimental data. However, it is still understudied how to robustly and effectively estimate heterogeneous long-term causal effects, significantly limiting practical applications. In this paper, we propose several two-stage style nonparametric estimators for heterogeneous long-term causal effect estimation, including propensity-based, regression-based, and multiple robust estimators. We conduct a comprehensive theoretical analysis of their asymptotic properties under mild assumptions, with the ultimate goal of building a better understanding of the conditions under which some estimators can be expected to perform better. Extensive experiments across several semi-synthetic and real-world datasets validate the theoretical results and demonstrate the effectiveness of the proposed estimators.

[LG-33] Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential

链接: https://arxiv.org/abs/2502.18959
作者: Shijun Zhang,Hongkai Zhao,Yimin Zhong,Haomin Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Our code and implementation details are available at this https URL

点击查看摘要

Abstract:The two most critical ingredients of a neural network are its structure and the activation function employed, and more importantly, the proper alignment of these two that is conducive to the effective representation and learning in practice. In this work, we introduce a surprisingly effective synergy, termed the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), and demonstrate its surprising adaptability and efficiency in capturing high-frequency components. First, we theoretically establish that FMMNNs have exponential expressive power in terms of approximation capacity. Next, we analyze the optimization landscape of FMMNNs and show that it is significantly more favorable compared to fully connected neural networks. Finally, systematic and extensive numerical experiments validate our findings, demonstrating that FMMNNs consistently achieve superior accuracy and efficiency across various tasks, particularly impressive when high-frequency components are present.

[LG-34] Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset

链接: https://arxiv.org/abs/2502.18955
作者: Yiqin Yang,Quanwei Wang,Chenghao Li,Hao Hu,Chengjie Wu,Yuhua Jiang,Dianyu Zhong,Ziyou Zhang,Qianchuan Zhao,Chongjie Zhang,Xu Bo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) represents a significant shift in RL research, allowing agents to learn from pre-collected datasets without further interaction with the environment. A key, yet underexplored, challenge in offline RL is selecting an optimal subset of the offline dataset that enhances both algorithm performance and training efficiency. Reducing dataset size can also reveal the minimal data requirements necessary for solving similar problems. In response to this challenge, we introduce ReDOR (Reduced Datasets for Offline RL), a method that frames dataset selection as a gradient approximation optimization problem. We demonstrate that the widely used actor-critic framework in RL can be reformulated as a submodular optimization objective, enabling efficient subset selection. To achieve this, we adapt orthogonal matching pursuit (OMP), incorporating several novel modifications tailored for offline RL. Our experimental results show that the data subsets identified by ReDOR not only boost algorithm performance but also do so with significantly lower computational complexity.

[LG-35] CLLoRA: An Approach to Measure the Effects of the Context Length for LLM Fine-Tuning

链接: https://arxiv.org/abs/2502.18910
作者: Ping Zhang,Zhaorui Zhang,Sheng Di,Yao Xin,Benben Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large language model fine-tuning has been identified as an efficient approach to applying the pre-trained Large language models to other domains. To guarantee data privacy for different data owners, models are often fine-tuned in federated learning environments across different data owners, which often involve data heterogeneity issues and affect the fine-tuning performance. In addition, the length of the context for the training data has been identified as a major factor that affects the LLM’s model performance. To efficiently measure how the context length affects the LLM’s model performance in heterogeneous federated learning environments, we propose CLLoRA. CLLoRA utilizes the parameter-efficient fine-tuning approach LoRA based on different kinds of LLMs with varying sizes as the fine-tuning approach to investigate whether the quality and length of contexts can serve as standards for measuring non-IID context. The findings indicate that an imbalance in context quality not only affects local training on clients but also impacts the global model’s performance. However, context length has a minimal effect on local training but a more significant influence on the global model. These results provide insights into how context quality and length affect the model performance for LLM fine-tuning in federated learning environments. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2502.18910 [cs.LG] (or arXiv:2502.18910v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.18910 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] A Pipeline of Augmentation and Sequence Embedding for Classification of Imbalanced Network Traffic

链接: https://arxiv.org/abs/2502.18909
作者: Matin Shokri,Ramin Hasibi
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. arXiv admin note: text overlap with arXiv:1901.00204

点击查看摘要

Abstract:Network Traffic Classification (NTC) is one of the most important tasks in network management. The imbalanced nature of classes on the internet presents a critical challenge in classification tasks. For example, some classes of applications are much more prevalent than others, such as HTTP. As a result, machine learning classification models do not perform well on those classes with fewer data. To address this problem, we propose a pipeline to balance the dataset and classify it using a robust and accurate embedding technique. First, we generate artificial data using Long Short-Term Memory (LSTM) networks and Kernel Density Estimation (KDE). Next, we propose replacing one-hot encoding for categorical features with a novel embedding framework based on the “Flow as a Sentence” perspective, which we name FS-Embedding. This framework treats the source and destination ports, along with the packet’s direction, as one word in a flow, then trains an embedding vector space based on these new features through the learning classification task. Finally, we compare our pipeline with the training of a Convolutional Recurrent Neural Network (CRNN) and Transformers, both with imbalanced and sampled datasets, as well as with the one-hot encoding approach. We demonstrate that the proposed augmentation pipeline, combined with FS-Embedding, increases convergence speed and leads to a significant reduction in the number of model parameters, all while maintaining the same performance in terms of accuracy.

[LG-37] VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

链接: https://arxiv.org/abs/2502.18906
作者: Jiani Zheng,Lu Wang,Fangkai Yang,Chaoyun Zhang,Lingrui Mei,Wenjie Yin,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan,Qi Zhang
类目: Machine Learning (cs.LG)
*备注: 20pages,5 figures

点击查看摘要

Abstract:Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user’s goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.

[LG-38] Automated Code Generation and Validation for Software Components of Microcontrollers MICRO

链接: https://arxiv.org/abs/2502.18905
作者: Sebastian Haug,Christoph Böhm,Daniel Mayer
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Sebastian Haug: This paper, spanning 12 pages with 5 figures, presents my work on automated code generation and validation for STM32F407 microcontroller software components. Developed as part of a research project at Munich University of Applied Sciences and AGSOTEC GmbH, it leverages AST and RAG to streamline embedded development. Includes glossary and bibliography as supplementary materials

点击查看摘要

Abstract:This paper proposes a method for generating software components for embedded systems, integrating seamlessly into existing implementations without developer intervention. We demonstrate this by automatically generating hardware abstraction layer (HAL) code for GPIO operations on the STM32F407 microcontroller. Using Abstract Syntax Trees (AST) for code analysis and Retrieval-Augmented Generation (RAG) for component generation, our approach enables autonomous code completion for embedded applications.

[LG-39] abGLM: Tabular Graph Language Model for Learning Transferable Representations Through Multi-Modal Consistency Minimization AAAI2025

链接: https://arxiv.org/abs/2502.18847
作者: Anay Majee,Maria Xenochristou,Wei-Peng Chen
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Handling heterogeneous data in tabular datasets poses a significant challenge for deep learning models. While attention-based architectures and self-supervised learning have achieved notable success, their application to tabular data remains less effective over linear and tree based models. Although several breakthroughs have been achieved by models which transform tables into uni-modal transformations like image, language and graph, these models often underperform in the presence of feature heterogeneity. To address this gap, we introduce TabGLM (Tabular Graph Language Model), a novel multi-modal architecture designed to model both structural and semantic information from a table. TabGLM transforms each row of a table into a fully connected graph and serialized text, which are then encoded using a graph neural network (GNN) and a text encoder, respectively. By aligning these representations through a joint, multi-modal, self-supervised learning objective, TabGLM leverages complementary information from both modalities, thereby enhancing feature learning. TabGLM’s flexible graph-text pipeline efficiently processes heterogeneous datasets with significantly fewer parameters over existing Deep Learning approaches. Evaluations across 25 benchmark datasets demonstrate substantial performance gains, with TabGLM achieving an average AUC-ROC improvement of up to 5.56% over State-of-the-Art (SoTA) tabular learning methods.

[LG-40] FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting

链接: https://arxiv.org/abs/2502.18834
作者: Yifan Hu,Yuante Li,Peiyuan Liu,Yuxia Zhu,Naiqi Li,Tao Dai,Shu-tao Xia,Dawei Cheng,Changjun Jiang
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial time series (FinTS) record the behavior of human-brain-augmented decision-making, capturing valuable historical information that can be leveraged for profitable investment strategies. Not surprisingly, this area has attracted considerable attention from researchers, who have proposed a wide range of methods based on various backbones. However, the evaluation of the area often exhibits three systemic limitations: 1. Failure to account for the full spectrum of stock movement patterns observed in dynamic financial markets. (Diversity Gap), 2. The absence of unified assessment protocols undermines the validity of cross-study performance comparisons. (Standardization Deficit), and 3. Neglect of critical market structure factors, resulting in inflated performance metrics that lack practical applicability. (Real-World Mismatch). Addressing these limitations, we propose FinTSB, a comprehensive and practical benchmark for financial time series forecasting (FinTSF). To increase the variety, we categorize movement patterns into four specific parts, tokenize and pre-process the data, and assess the data quality based on some sequence characteristics. To eliminate biases due to different evaluation settings, we standardize the metrics across three dimensions and build a user-friendly, lightweight pipeline incorporating methods from various backbones. To accurately simulate real-world trading scenarios and facilitate practical implementation, we extensively model various regulatory constraints, including transaction fees, among others. Finally, we conduct extensive experiments on FinTSB, highlighting key insights to guide model selection under varying market conditions. Overall, FinTSB provides researchers with a novel and comprehensive platform for improving and evaluating FinTSF methods. The code is available at this https URL.

[LG-41] Optimal Approximate Matrix Multiplication over Sliding Windows

链接: https://arxiv.org/abs/2502.18830
作者: Ziqi Yao,Mingsong Chen,Cheng Chen
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the problem of approximate matrix multiplication (AMM) within the sliding window model, where algorithms utilize limited space to perform large-scale matrix multiplication in a streaming manner. This model has garnered increasing attention in the fields of machine learning and data mining due to its ability to handle time sensitivity and reduce the impact of outdated data. However, despite recent advancements, determining the optimal space bound for this problem remains an open question. In this paper, we introduce the DS-COD algorithm for AMM over sliding windows. This novel and deterministic algorithm achieves optimal performance regarding the space-error tradeoff. We provide theoretical error bounds and the complexity analysis for the proposed algorithm, and establish the corresponding space lower bound for the AMM sliding window problem. Additionally, we present an adaptive version of DS-COD, termed aDS-COD, which improves computational efficiency and demonstrates superior empirical performance. Extensive experiments conducted on both synthetic and real-world datasets validate our theoretical findings and highlight the practical effectiveness of our methods.

[LG-42] Adversarial Combinatorial Semi-bandits with Graph Feedback

链接: https://arxiv.org/abs/2502.18826
作者: Yuxiao Wen
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emphgraph feedback, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph G . We establish that the optimal regret over a time horizon T scales as \widetilde\Theta(S\sqrtT+\sqrt\alpha ST) , where S is the size of the combinatorial decisions and \alpha is the independence number of G . This result interpolates between the known regrets \widetilde\Theta(S\sqrtT) under full information (i.e., G is complete) and \widetilde\Theta(\sqrtKST) under the semi-bandit feedback (i.e., G has only self-loops), where K is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations.

[LG-43] CAMEx: Curvature-aware Merging of Experts ICLR2025

链接: https://arxiv.org/abs/2502.18821
作者: Dung V. Nguyen,Minh H. Nguyen,Luc Q. Nguyen,Rachel S.Y. Teo,Tan M. Nguyen,Linh Duy Tran
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Poster

点击查看摘要

Abstract:Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model’s generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (\textbfCurvature-\textbfAware \textbfMerging of \textbfExperts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold’s geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method.

[LG-44] Optimal Stochastic Trace Estimation in Generative Modeling AISTATS2025

链接: https://arxiv.org/abs/2502.18808
作者: Xinyang Liu,Hengrong Du,Wei Deng,Ruqi Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by AISTATS 2025

点击查看摘要

Abstract:Hutchinson estimators are widely employed in training divergence-based likelihoods for diffusion models to ensure optimal transport (OT) properties. However, this estimator often suffers from high variance and scalability concerns. To address these challenges, we investigate Hutch++, an optimal stochastic trace estimator for generative models, designed to minimize training variance while maintaining transport optimality. Hutch++ is particularly effective for handling ill-conditioned matrices with large condition numbers, which commonly arise when high-dimensional data exhibits a low-dimensional structure. To mitigate the need for frequent and costly QR decompositions, we propose practical schemes that balance frequency and accuracy, backed by theoretical guarantees. Our analysis demonstrates that Hutch++ leads to generations of higher quality. Furthermore, this method exhibits effective variance reduction in various applications, including simulations, conditional time series forecasts, and image generation.

[LG-45] Exploring Graph Tasks with Pure LLM s: A Comprehensive Benchmark and Investigation

链接: https://arxiv.org/abs/2502.18771
作者: Yuxiang Wang,Xinnan Dai,Wenqi Fan,Yao Ma
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph-structured data has become increasingly prevalent across various domains, raising the demand for effective models to handle graph tasks like node classification and link prediction. Traditional graph learning models like Graph Neural Networks (GNNs) have made significant strides, but their capabilities in handling graph data remain limited in certain contexts. In recent years, large language models (LLMs) have emerged as promising candidates for graph tasks, yet most studies focus primarily on performance benchmarks and fail to address their broader potential, including their ability to handle limited data, their transferability across tasks, and their robustness. In this work, we provide a comprehensive exploration of LLMs applied to graph tasks. We evaluate the performance of pure LLMs, including those without parameter optimization and those fine-tuned with instructions, across various scenarios. Our analysis goes beyond accuracy, assessing LLM ability to perform in few-shot/zero-shot settings, transfer across domains, understand graph structures, and demonstrate robustness in challenging scenarios. We conduct extensive experiments with 16 graph learning models alongside 6 LLMs (e.g., Llama3B, GPT-4o, Qwen-plus), comparing their performance on datasets like Cora, PubMed, ArXiv, and Products. Our findings show that LLMs, particularly those with instruction tuning, outperform traditional models in few-shot settings, exhibit strong domain transferability, and demonstrate excellent generalization and robustness. This work offers valuable insights into the capabilities of LLMs for graph learning, highlighting their advantages and potential for real-world applications, and paving the way for future research in this area. Codes and datasets are released in this https URL.

[LG-46] Bandit and Delayed Feedback in Online Structured Prediction

链接: https://arxiv.org/abs/2502.18709
作者: Yuki Shibukawa,Taira Tsuchiya,Shinsaku Sakaue,Kenji Yamanishi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages

点击查看摘要

Abstract:Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full information setup, we can achieve finite bounds on the surrogate regret, i.e., the extra target loss relative to the best possible surrogate loss. In practice, however, full information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For the bandit setting, using a standard inverse-weighted gradient estimator, we achieve a surrogate regret bound of O(\sqrtKT) for the time horizon T and the size of the output set K . However, K can be extremely large when outputs are highly complex, making this result less desirable. To address this, we propose an algorithm that achieves a surrogate regret bound of O(T^2/3) , which is independent of K . This is enabled with a carefully designed pseudo-inverse matrix estimator. Furthermore, for the delayed full information feedback setup, we obtain a surrogate regret bound of O(D^2/3 T^1/3) for the delay time D . We also provide algorithms for the delayed bandit feedback setup. Finally, we numerically evaluate the performance of the proposed algorithms in online classification with bandit feedback.

[LG-47] Differentially Private Federated Learning With Time-Adaptive Privacy Spending ICLR

链接: https://arxiv.org/abs/2502.18706
作者: Shahrzad Kiani,Nupur Kulkarni,Adam Dziedzic,Stark Draper,Franziska Boenisch
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: International Conference on Learning Representations (ICLR), April 2025, Singapore

点击查看摘要

Abstract:Federated learning (FL) with differential privacy (DP) provides a framework for collaborative machine learning, enabling clients to train a shared model while adhering to strict privacy constraints. The framework allows each client to have an individual privacy guarantee, e.g., by adding different amounts of noise to each client’s model updates. One underlying assumption is that all clients spend their privacy budgets uniformly over time (learning rounds). However, it has been shown in the literature that learning in early rounds typically focuses on more coarse-grained features that can be learned at lower signal-to-noise ratios while later rounds learn fine-grained features that benefit from higher signal-to-noise ratios. Building on this intuition, we propose a time-adaptive DP-FL framework that expends the privacy budget non-uniformly across both time and clients. Our framework enables each client to save privacy budget in early rounds so as to be able to spend more in later rounds when additional accuracy is beneficial in learning more fine-grained features. We theoretically prove utility improvements in the case that clients with stricter privacy budgets spend budgets unevenly across rounds, compared to clients with more relaxed budgets, who have sufficient budgets to distribute their spend more evenly. Our practical experiments on standard benchmark datasets support our theoretical results and show that, in practice, our algorithms improve the privacy-utility trade-offs compared to baseline schemes.

[LG-48] ukey Depth Mechanisms for Practical Private Mean Estimation

链接: https://arxiv.org/abs/2502.18698
作者: Gavin Brown,Lydia Zakynthinou
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Mean estimation is a fundamental task in statistics and a focus within differentially private statistical estimation. While univariate methods based on the Gaussian mechanism are widely used in practice, more advanced techniques such as the exponential mechanism over quantiles offer robustness and improved performance, especially for small sample sizes. Tukey depth mechanisms carry these advantages to multivariate data, providing similar strong theoretical guarantees. However, practical implementations fall behind these theoretical developments. In this work, we take the first step to bridge this gap by implementing the (Restricted) Tukey Depth Mechanism, a theoretically optimal mean estimator for multivariate Gaussian distributions, yielding improved practical methods for private mean estimation. Our implementations enable the use of these mechanisms for small sample sizes or low-dimensional data. Additionally, we implement variants of these mechanisms that use approximate versions of Tukey depth, trading off accuracy for faster computation. We demonstrate their efficiency in practice, showing that they are viable options for modest dimensions. Given their strong accuracy and robustness guarantees, we contend that they are competitive approaches for mean estimation in this regime. We explore future directions for improving the computational efficiency of these algorithms by leveraging fast polytope volume approximation techniques, paving the way for more accurate private mean estimation in higher dimensions. Comments: 17 pages, 10 figures Subjects: Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2502.18698 [cs.LG] (or arXiv:2502.18698v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.18698 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs

链接: https://arxiv.org/abs/2502.18663
作者: A.Chervov,A.Soibelman,S.Lytkin,I.Kiselev,S.Fironov,A.Lukyanenko,A.Dolgorukova,A.Ogurtsov,F.Petrov,S.Krymskii,M.Evseev,L.Grunvald,D.Gorodkov,G.Antiufeev,G.Verbii,V.Zamkovoy,L.Cheldieva,I.Koltsov,A. Sychev,M.Obozov,A.Eliseev,S.Nikolenko,N.Narynbaev,R.Turtayev,N. Rokotyan,S.Kovalev,A.Rozanov,V.Nelin,S.Ermilov,L.Shishina,D.Mamayeva,A.Korolkova,K.Khoruzhii,A.Romanov
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Combinatorics (math.CO); Group Theory (math.GR)
*备注: 28 pages

点击查看摘要

Abstract:This paper is the second in a series of studies on developing efficient artificial intelligence-based approaches to pathfinding on extremely large graphs (e.g. 10^70 nodes) with a focus on Cayley graphs and mathematical applications. The open-source CayleyPy project is a central component of our research. The present paper proposes a novel combination of a reinforcement learning approach with a more direct diffusion distance approach from the first paper. Our analysis includes benchmarking various choices for the key building blocks of the approach: architectures of the neural network, generators for the random walks and beam search pathfinding. We compared these methods against the classical computer algebra system GAP, demonstrating that they “overcome the GAP” for the considered examples. As a particular mathematical application we examine the Cayley graph of the symmetric group with cyclic shift and transposition generators. We provide strong support for the OEIS-A186783 conjecture that the diameter is equal to n(n-1)/2 by machine learning and mathematical methods. We identify the conjectured longest element and generate its decomposition of the desired length. We prove a diameter lower bound of n(n-1)/2-n/2 and an upper bound of n(n-1)/2+ 3n by presenting the algorithm with given complexity. We also present several conjectures motivated by numerical experiments, including observations on the central limit phenomenon (with growth approximated by a Gumbel distribution), the uniform distribution for the spectrum of the graph, and a numerical study of sorting networks. To stimulate crowdsourcing activity, we create challenges on the Kaggle platform and invite contributions to improve and benchmark approaches on Cayley graph pathfinding and other tasks.

[LG-50] Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

链接: https://arxiv.org/abs/2502.18655
作者: Amirhossein Roknilamouki,Arnob Ghosh,Ming Shi,Fatemeh Nourzad,Eylem Ekici,Ness B. Shroff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of \tilde\mathcalO\bigl(\bigl(1 + \tfrac1\tau\bigr) \sqrt\log(\tfrac1\tau) d^3 H^4 K \bigr) , applicable to both star-convex and non-star-convex cases, where d is the feature dimension, H the episode length, K the number of episodes, and \tau the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.

[LG-51] On the Privacy-Preserving Properties of Spiking Neural Networks with Unique Surrogate Gradients and Quantization Levels

链接: https://arxiv.org/abs/2502.18623
作者: Ayana Moshruba,Shay Snyder,Hamed Poursiami,Maryam Parsa
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:As machine learning models increasingly process sensitive data, understanding their vulnerability to privacy attacks is vital. Membership inference attacks (MIAs) exploit model responses to infer whether specific data points were used during training, posing a significant privacy risk. Prior research suggests that spiking neural networks (SNNs), which rely on event-driven computation and discrete spike-based encoding, exhibit greater resilience to MIAs than artificial neural networks (ANNs). This resilience stems from their non-differentiable activations and inherent stochasticity, which obscure the correlation between model responses and individual training samples. To enhance privacy in SNNs, we explore two techniques: quantization and surrogate gradients. Quantization, which reduces precision to limit information leakage, has improved privacy in ANNs. Given SNNs’ sparse and irregular activations, quantization may further disrupt the activation patterns exploited by MIAs. We assess the vulnerability of SNNs and ANNs under weight and activation quantization across multiple datasets, using the attack model’s receiver operating characteristic (ROC) curve area under the curve (AUC) metric, where lower values indicate stronger privacy, and evaluate the privacy-accuracy trade-off. Our findings show that quantization enhances privacy in both architectures with minimal performance loss, though full-precision SNNs remain more resilient than quantized ANNs. Additionally, we examine the impact of surrogate gradients on privacy in SNNs. Among five evaluated gradients, spike rate escape provides the best privacy-accuracy trade-off, while arctangent increases vulnerability to MIAs. These results reinforce SNNs’ inherent privacy advantages and demonstrate that quantization and surrogate gradient selection significantly influence privacy-accuracy trade-offs in SNNs.

[LG-52] A Distributional Treatment of Real2Sim2Real for Vision-Driven Deformable Linear Object Manipulation

链接: https://arxiv.org/abs/2502.18615
作者: Georgios Kamaras,Subramanian Ramamoorthy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies for a visuomotor DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.

[LG-53] oward Breaking Watermarks in Distortion-free Large Language Models AAAI’25

链接: https://arxiv.org/abs/2502.18608
作者: Shayleen Reynolds,Saheed Obitayo,Niccolò Dalmasso,Dung Daniel T. Ngo,Vamsi K. Potluru,Manuela Veloso
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 5 pages, AAAI’25 Workshop on Preventing and Detecting LLM Generated Misinformation

点击查看摘要

Abstract:In recent years, LLM watermarking has emerged as an attractive safeguard against AI-generated content, with promising applications in many real-world domains. However, there are growing concerns that the current LLM watermarking schemes are vulnerable to expert adversaries wishing to reverse-engineer the watermarking mechanisms. Prior work in “breaking” or “stealing” LLM watermarks mainly focuses on the distribution-modifying algorithm of Kirchenbauer et al. (2023), which perturbs the logit vector before sampling. In this work, we focus on reverse-engineering the other prominent LLM watermarking scheme, distortion-free watermarking (Kuditipudi et al. 2024), which preserves the underlying token distribution by using a hidden watermarking key sequence. We demonstrate that, even under a more sophisticated watermarking scheme, it is possible to “compromise” the LLM and carry out a “spoofing” attack. Specifically, we propose a mixed integer linear programming framework that accurately estimates the secret key used for watermarking using only a few samples of the watermarked dataset. Our initial findings challenge the current theoretical claims on the robustness and usability of existing LLM watermarking techniques.

[LG-54] Expected Variational Inequalities

链接: https://arxiv.org/abs/2502.18605
作者: Brian Hu Zhang,Ioannis Anagnostides,Emanuel Tewolde,Ratip Emin Berker,Gabriele Farina,Vincent Conitzer,Tuomas Sandholm
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Variational inequalities (VIs) encompass many fundamental problems in diverse areas ranging from engineering to economics and machine learning. However, their considerable expressivity comes at the cost of computational intractability. In this paper, we introduce and analyze a natural relaxation – which we refer to as expected variational inequalities (EVIs) – where the goal is to find a distribution that satisfies the VI constraint in expectation. By adapting recent techniques from game theory, we show that, unlike VIs, EVIs can be solved in polynomial time under general (nonmonotone) operators. EVIs capture the seminal notion of correlated equilibria, but enjoy a greater reach beyond games. We also employ our framework to capture and generalize several existing disparate results, including from settings such as smooth games, and games with coupled constraints or nonconcave utilities.

[LG-55] ghten The Lasso: A Convex Hull Volume-based Anomaly Detection Method

链接: https://arxiv.org/abs/2502.18601
作者: Uri Itai,Asael Bar Ilan,Teddy Lazebnik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancements in data-driven methodologies have underscored the critical importance of ensuring data quality. Consequently, detecting out-of-distribution (OOD) data has emerged as an essential task to maintain the reliability and robustness of data-driven models, in general, and machine and deep learning models, in particular. In this study, we leveraged the convex hull property of a dataset and the fact that anomalies highly contribute to the increase of the CH’s volume to propose a novel anomaly detection algorithm. Our algorithm computes the CH’s volume as an increasing number of data points are removed from the dataset to define a decision line between OOD and in-distribution data points. We compared the proposed algorithm to seven widely used anomaly detection algorithms over ten datasets, showing comparable results for state-of-the-art (SOTA) algorithms. Moreover, we show that with a computationally cheap and simple check, one can detect datasets that are well-suited for the proposed algorithm which outperforms the SOTA anomaly detection algorithms.

[LG-56] Error-related Potential driven Reinforcement Learning for adaptive Brain-Computer Interfaces

链接: https://arxiv.org/abs/2502.18594
作者: Aline Xavier Fidêncio,Felix Grün,Christian Klaes,Ioannis Iossifidis
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) provide alternative communication methods for individuals with motor disabilities by allowing control and interaction with external devices. Non-invasive BCIs, especially those using electroencephalography (EEG), are practical and safe for various applications. However, their performance is often hindered by EEG non-stationarities, caused by changing mental states or device characteristics like electrode impedance. This challenge has spurred research into adaptive BCIs that can handle such variations. In recent years, interest has grown in using error-related potentials (ErrPs) to enhance BCI performance. ErrPs, neural responses to errors, can be detected non-invasively and have been integrated into different BCI paradigms to improve performance through error correction or adaptation. This research introduces a novel adaptive ErrP-based BCI approach using reinforcement learning (RL). We demonstrate the feasibility of an RL-driven adaptive framework incorporating ErrPs and motor imagery. Utilizing two RL agents, the framework adapts dynamically to EEG non-stationarities. Validation was conducted using a publicly available motor imagery dataset and a fast-paced game designed to boost user engagement. Results show the framework’s promise, with RL agents learning control policies from user interactions and achieving robust performance across datasets. However, a critical insight from the game-based protocol revealed that motor imagery in a high-speed interaction paradigm was largely ineffective for participants, highlighting task design limitations in real-time BCI applications. These findings underscore the potential of RL for adaptive BCIs while pointing out practical constraints related to task complexity and user responsiveness. Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC) ACMclasses: F.2.0; I.2.6; I.6.6; I.2.m Cite as: arXiv:2502.18594 [cs.LG] (or arXiv:2502.18594v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.18594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] ransported Memory Networks accelerating Computational Fluid Dynamics

链接: https://arxiv.org/abs/2502.18591
作者: Matthias Schulz,Gwendal Jouan,Daniel Berger,Stefan Gavranovic,Dirk Hartmann
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In recent years, augmentation of differentiable PDE solvers with neural networks has shown promising results, particularly in fluid simulations. However, most approaches rely on convolutional neural networks and custom solvers operating on Cartesian grids with efficient access to cell data. This particular choice poses challenges for industrial-grade solvers that operate on unstructured meshes, where access is restricted to neighboring cells only. In this work, we address this limitation using a novel architecture, named Transported Memory Networks. The architecture draws inspiration from both traditional turbulence models and recurrent neural networks, and it is fully compatible with generic discretizations. Our results show that it is point-wise and statistically comparable to, or improves upon, previous methods in terms of both accuracy and computational efficiency.

[LG-58] arget Defense with Multiple Defenders and an Agile Attacker via Residual Policy Learning

链接: https://arxiv.org/abs/2502.18549
作者: Jiyue Tao,Tongsheng Shen,Dexin Zhao,Feitian Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The target defense problem involves intercepting an attacker before it reaches a designated target region using one or more defenders. This letter focuses on a particularly challenging scenario in which the attacker is more agile than the defenders, significantly increasing the difficulty of effective interception. To address this challenge, we propose a novel residual policy framework that integrates deep reinforcement learning (DRL) with the force-based Boids model. In this framework, the Boids model serves as a baseline policy, while DRL learns a residual policy to refine and optimize the defenders’ actions. Simulation experiments demonstrate that the proposed method consistently outperforms traditional interception policies, whether learned via vanilla DRL or fine-tuned from force-based methods. Moreover, the learned policy exhibits strong scalability and adaptability, effectively handling scenarios with varying numbers of defenders and attackers with different agility levels.

[LG-59] Programming with Pixels: Computer-Use Meets Software Engineering

链接: https://arxiv.org/abs/2502.18525
作者: Pranjal Aggarwal,Sean Welleck
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in software engineering (SWE) agents have largely followed a \textittool-based paradigm , where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they require predefined tools for each task and do not scale across programming languages and domains. We introduce \textttProgramming with Pixels (PwP), an agent environment that unifies software development tasks by enabling \textitcomputer-use agents -agents that operate directly within an IDE through visual perception, typing, and clicking, rather than relying on predefined tool APIs. To systematically evaluate these agents, we propose \textttPwP-Bench , a benchmark that unifies existing SWE benchmarks spanning tasks across multiple programming languages, modalities, and domains under a task-agnostic state and action space. Our experiments demonstrate that general-purpose computer-use agents can approach or even surpass specialized tool-based agents on a variety of SWE tasks without the need for hand-engineered tools. However, our analysis shows that current models suffer from limited visual grounding and fail to exploit many IDE tools that could simplify their tasks. When agents can directly access IDE tools, without visual interaction, they show significant performance improvements, highlighting the untapped potential of leveraging built-in IDE capabilities. Our results establish PwP as a scalable testbed for building and evaluating the next wave of software engineering agents. We release code and data at this https URL

[LG-60] Disproving Program Equivalence with LLM s

链接: https://arxiv.org/abs/2502.18473
作者: Miltiadis Allamanis,Pengcheng Yin
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

[LG-61] Spatial-RAG : Spatial Retrieval Augmented Generation for Real-World Spatial Reasoning Questions

链接: https://arxiv.org/abs/2502.18470
作者: Dazhou Yu,Riyang Bao,Gengchen Mai,Liang Zhao
类目: Information Retrieval (cs.IR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatial reasoning remains a challenge for Large Language Models (LLMs), which struggle with spatial data retrieval and reasoning. We propose Spatial Retrieval-Augmented Generation (Spatial-RAG), a framework that extends RAG to spatial tasks by integrating sparse spatial retrieval (spatial databases) and dense semantic retrieval (LLM-based similarity). A multi-objective ranking strategy balances spatial constraints and semantic relevance, while an LLM-guided generator ensures coherent responses. Experiments on a real-world tourism dataset show that Spatial-RAG significantly improves spatial question answering, bridging the gap between LLMs and spatial intelligence.

[LG-62] Modelling Chemical Reaction Networks using Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2502.19397
作者: Anna C. M. Thöni,William E. Robinson,Yoram Bachrach,Wilhelm T. S. Huck,Tal Kachman
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In chemical reaction network theory, ordinary differential equations are used to model the temporal change of chemical species concentration. As the functional form of these ordinary differential equations systems is derived from an empirical model of the reaction network, it may be incomplete. Our approach aims to elucidate these hidden insights in the reaction network by combining dynamic modelling with deep learning in the form of neural ordinary differential equations. Our contributions not only help to identify the shortcomings of existing empirical models but also assist the design of future reaction networks.

[LG-63] Fast and Accurate Antibody Sequence Design via Structure Retrieval

链接: https://arxiv.org/abs/2502.19395
作者: Xingyi Zhang,Kun Xie,Ningqiao Huang,Wei Liu,Peilin Zhao,Sibo Wang,Kangfei Zhao,Biaobin Jiang
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in protein design have leveraged diffusion models to generate structural scaffolds, followed by a process known as protein inverse folding, which involves sequence inference on these scaffolds. However, these methodologies face significant challenges when applied to hyper-variable structures such as antibody Complementarity-Determining Regions (CDRs), where sequence inference frequently results in non-functional sequences due to hallucinations. Distinguished from prevailing protein inverse folding approaches, this paper introduces Igseek, a novel structure-retrieval framework that infers CDR sequences by retrieving similar structures from a natural antibody database. Specifically, Igseek employs a simple yet effective multi-channel equivariant graph neural network to generate high-quality geometric representations of CDR backbone structures. Subsequently, it aligns sequences of structurally similar CDRs and utilizes structurally conserved sequence motifs to enhance inference accuracy. Our experiments demonstrate that Igseek not only proves to be highly efficient in structural retrieval but also outperforms state-of-the-art approaches in sequence recovery for both antibodies and T-Cell Receptors, offering a new retrieval-based perspective for therapeutic protein design.

[LG-64] owards More Accurate Full-Atom Antibody Co-Design

链接: https://arxiv.org/abs/2502.19391
作者: Jiayang Wu,Xingyi Zhang,Xiangyu Dong,Kun Xie,Ziqi Liu,Wensheng Gan,Sibo Wang,Le Song
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Antibody co-design represents a critical frontier in drug development, where accurate prediction of both 1D sequence and 3D structure of complementarity-determining regions (CDRs) is essential for targeting specific epitopes. Despite recent advances in equivariant graph neural networks for antibody design, current approaches often fall short in capturing the intricate interactions that govern antibody-antigen recognition and binding specificity. In this work, we present Igformer, a novel end-to-end framework that addresses these limitations through innovative modeling of antibody-antigen binding interfaces. Our approach refines the inter-graph representation by integrating personalized propagation with global attention mechanisms, enabling comprehensive capture of the intricate interplay between local chemical interactions and global conformational dependencies that characterize effective antibody-antigen binding. Through extensive validation on epitope-binding CDR design and structure prediction tasks, Igformer demonstrates significant improvements over existing methods, suggesting that explicit modeling of multi-scale residue interactions can substantially advance computational antibody design for therapeutic applications.

[LG-65] Enhancing Gradient-based Discrete Sampling via Parallel Tempering

链接: https://arxiv.org/abs/2502.19240
作者: Luxu Liang,Yuhang Jia,Feng Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 24 pages, 5 figures. arXiv admin note: text overlap with arXiv:2402.17699 by other authors

点击查看摘要

Abstract:While gradient-based discrete samplers are effective in sampling from complex distributions, they are susceptible to getting trapped in local minima, particularly in high-dimensional, multimodal discrete distributions, owing to the discontinuities inherent in these landscapes. To circumvent this issue, we combine parallel tempering, also known as replica exchange, with the discrete Langevin proposal and develop the Parallel Tempering enhanced Discrete Langevin Proposal (PTDLP), which are simulated at a series of temperatures. Significant energy differences prompt sample swaps, which are governed by a Metropolis criterion specifically designed for discrete sampling to ensure detailed balance is maintained. Additionally, we introduce an automatic scheme to determine the optimal temperature schedule and the number of chains, ensuring adaptability across diverse tasks with minimal tuning. Theoretically, we establish that our algorithm converges non-asymptotically to the target energy and exhibits faster mixing compared to a single chain. Empirical results further emphasize the superiority of our method in sampling from complex, multimodal discrete distributions, including synthetic problems, restricted Boltzmann machines, and deep energy-based models.

[LG-66] Langevin Multiplicative Weights Update with Applications in Polynomial Portfolio Management AAAI-2025

链接: https://arxiv.org/abs/2502.19210
作者: Yi Feng,Xiao Wang,Tian Xie
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted for AAAI-2025

点击查看摘要

[LG-67] Quantum Annealing Feature Selection on Light-weight Medical Image Datasets

链接: https://arxiv.org/abs/2502.19201
作者: Merlin A. Nau,Luca A. Nutricati,Bruno Camino,Paul A. Warburton,Andreas K. Maier
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the use of quantum computing algorithms on real quantum hardware to tackle the computationally intensive task of feature selection for light-weight medical image datasets. Feature selection is often formulated as a k of n selection problem, where the complexity grows binomially with increasing k and n. As problem sizes grow, classical approaches struggle to scale efficiently. Quantum computers, particularly quantum annealers, are well-suited for such problems, offering potential advantages in specific formulations. We present a method to solve larger feature selection instances than previously presented on commercial quantum annealers. Our approach combines a linear Ising penalty mechanism with subsampling and thresholding techniques to enhance scalability. The method is tested in a toy problem where feature selection identifies pixel masks used to reconstruct small-scale medical images. The results indicate that quantum annealing-based feature selection is effective for this simplified use case, demonstrating its potential in high-dimensional optimization tasks. However, its applicability to broader, real-world problems remains uncertain, given the current limitations of quantum computing hardware.

[LG-68] Forecasting intermittent time series with Gaussian Processes and Tweedie likelihood

链接: https://arxiv.org/abs/2502.19086
作者: Stefano Damato,Dario Azzimonti,Giorgio Corani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Under review

点击查看摘要

[LG-69] Efficient and Accurate Spatial Mixing of Machine Learned Interatomic Potentials for Materials Science

链接: https://arxiv.org/abs/2502.19081
作者: Fraser Birks,Thomas D Swinburne,James R Kermode
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 21 pages, 13 figures. To access the ML-MIX GitHub, click this https URL

点击查看摘要

[LG-70] Blending Optimal Control and Biologically Plausible Learning for Noise-Robust Physical Neural Networks

链接: https://arxiv.org/abs/2502.19053
作者: Satoshi Sunada,Tomoaki Niiyama,Kazutaka Kanno,Rin Nogami,André Röhm,Takato Awano,Atsushi Uchida
类目: Applied Physics (physics.app-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 28 pages, 10 figures

点击查看摘要

Abstract:The rapidly increasing computational demands for artificial intelligence (AI) have spurred the exploration of computing principles beyond conventional digital computers. Physical neural networks (PNNs) offer efficient neuromorphic information processing by harnessing the innate computational power of physical processes; however, training their weight parameters is computationally expensive. We propose a training approach for substantially reducing this training cost. Our training approach merges an optimal control method for continuous-time dynamical systems with a biologically plausible training method–direct feedback alignment. In addition to the reduction of training time, this approach achieves robust processing even under measurement errors and noise without requiring detailed system information. The effectiveness was numerically and experimentally verified in an optoelectronic delay system. Our approach significantly extends the range of physical systems practically usable as PNNs.

[LG-71] Stationary distribution of node2vec random walks on household models

链接: https://arxiv.org/abs/2502.19039
作者: Lars Schroeder,Clara Stegehuis
类目: Probability (math.PR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:The node2vec random walk has proven to be a key tool in network embedding algorithms. These random walks are tuneable, and their transition probabilities depend on the previous visited node and on the triangles containing the current and the previously visited node. Even though these walks are widely used in practice, most mathematical properties of node2vec walks are largely unexplored, including their stationary distribution. We study the node2vec random walk on community-structured household model graphs. We prove an explicit description of the stationary distribution of node2vec walks in terms of the walk parameters. We then show that by tuning the walk parameters, the stationary distribution can interpolate between uniform, size-biased, or the simple random walk stationary distributions, demonstrating the wide range of possible walks. We further explore these effects on some specific graph settings.

[LG-72] Graph Neural Networks embedded into Margules model for vapor-liquid equilibria prediction

链接: https://arxiv.org/abs/2502.18998
作者: Edgar Ivan Sanchez Medina,Kai Sundmacher
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive thermodynamic models are crucial for the early stages of product and process design. In this paper the performance of Graph Neural Networks (GNNs) embedded into a relatively simple excess Gibbs energy model, the extended Margules model, for predicting vapor-liquid equilibrium is analyzed. By comparing its performance against the established UNIFAC-Dortmund model it has been shown that GNNs embedded in Margules achieves an overall lower accuracy. However, higher accuracy is observed in the case of various types of binary mixtures. Moreover, since group contribution methods, like UNIFAC, are limited due to feasibility of molecular fragmentation or availability of parameters, the GNN in Margules model offers an alternative for VLE estimation. The findings establish a baseline for the predictive accuracy that simple excess Gibbs energy models combined with GNNs trained solely on infinite dilution data can achieve.

[LG-73] Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

链接: https://arxiv.org/abs/2502.18924
作者: Ziyue Jiang,Yi Ren,Ruiqi Li,Shengpeng Ji,Zhenhui Ye,Chen Zhang,Bai Jionghao,Xiaoda Yang,Jialong Zuo,Yu Zhang,Rui Liu,Xiang Yin,Zhou Zhao
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

[LG-74] Data-Driven and Theory-Guided Pseudo-Spectral Seismic Imaging Using Deep Neural Network Architectures WWW

链接: https://arxiv.org/abs/2502.18852
作者: Christopher Zerafa
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 163 pages, 91 figures, 16 tables, PhD Thesis, this https URL

点击查看摘要

Abstract:Full Waveform Inversion (FWI) reconstructs high-resolution subsurface models via multi-variate optimization but faces challenges with solver selection and data availability. Deep Learning (DL) offers a promising alternative, bridging data-driven and physics-based methods. While FWI in DL has been explored in the time domain, the pseudo-spectral approach remains underutilized, despite its success in classical FWI. This thesis integrates pseudo-spectral FWI into DL, formulating both data-driven and theory-guided approaches using Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs). These methods were theoretically derived, tested on synthetic and Marmousi datasets, and compared with deterministic and time-domain approaches. Results show that data-driven pseudo-spectral DNNs outperform classical FWI in deeper and over-thrust regions due to their global approximation capability. Theory-guided RNNs yield greater accuracy, with lower error and better fault identification. While DNNs excel in velocity contrast recovery, RNNs provide superior edge definition and stability in shallow and deep sections. Beyond enhancing FWI performance, this research identifies broader applications of DL-based inversion and outlines future directions for these frameworks. Comments: 163 pages, 91 figures, 16 tables, PhD Thesis, this https URL Subjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG) Cite as: arXiv:2502.18852 [physics.geo-ph] (or arXiv:2502.18852v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2502.18852 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christopher Zerafa [view email] [v1] Wed, 26 Feb 2025 05:46:53 UTC (45,524 KB)

[LG-75] Nonlinear Sparse Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data

链接: https://arxiv.org/abs/2502.18756
作者: Rong Wu,Ziqi Chen,Gen Li,Hai Shu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivation: Biomedical studies increasingly produce multi-view high-dimensional datasets (e.g., multi-omics) that demand integrative analysis. Existing canonical correlation analysis (CCA) and generalized CCA methods address at most two of the following three key aspects simultaneously: (i) nonlinear dependence, (ii) sparsity for variable selection, and (iii) generalization to more than two data views. There is a pressing need for CCA methods that integrate all three aspects to effectively analyze multi-view high-dimensional data. Results: We propose three nonlinear, sparse, generalized CCA methods, HSIC-SGCCA, SA-KGCCA, and TS-KGCCA, for variable selection in multi-view high-dimensional data. These methods extend existing SCCA-HSIC, SA-KCCA, and TS-KCCA from two-view to multi-view settings. While SA-KGCCA and TS-KGCCA yield multi-convex optimization problems solved via block coordinate descent, HSIC-SGCCA introduces a necessary unit-variance constraint previously ignored in SCCA-HSIC, resulting in a nonconvex, non-multiconvex problem. We efficiently address this challenge by integrating the block prox-linear method with the linearized alternating direction method of multipliers. Simulations and TCGA-BRCA data analysis demonstrate that HSIC-SGCCA outperforms competing methods in multi-view variable selection. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2502.18756 [stat.ML] (or arXiv:2502.18756v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2502.18756 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] Adaptive conditional latent diffusion maps beam loss to 2D phase space projections

链接: https://arxiv.org/abs/2502.18684
作者: Alexander Scheinker,Alan Williams
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-77] ght Bounds on the Binomial CDF and the Minimum of i.i.d Binomials in terms of KL-Divergence

链接: https://arxiv.org/abs/2502.18611
作者: Xiaohan Zhu,Mesrob I. Ohannessian,Nathan Srebro
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-78] Learning and Computation of Φ-Equilibria at the Frontier of Tractability

链接: https://arxiv.org/abs/2502.18582
作者: Brian Hu Zhang,Ioannis Anagnostides,Emanuel Tewolde,Ratip Emin Berker,Gabriele Farina,Vincent Conitzer,Tuomas Sandholm
类目: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: \Phi -equilibria – and the associated notion of \Phi -regret – are a powerful and flexible framework at the heart of online learning and game theory, whereby enriching the set of deviations \Phi begets stronger notions of rationality. Recently, Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '24) – abbreviated as DFFPS – settled the existence of efficient algorithms when \Phi contains only linear maps under a general, d -dimensional convex constraint set \mathcalX . In this paper, we significantly extend their work by resolving the case where \Phi is k -dimensional; degree- \ell polynomials constitute a canonical such example with k = d^O(\ell) . In particular, positing only oracle access to \mathcalX , we obtain two main positive results: i) a \textpoly(n, d, k, \textlog(1/\epsilon)) -time algorithm for computing \epsilon -approximate \Phi -equilibria in n -player multilinear games, and ii) an efficient online algorithm that incurs average \Phi -regret at most \epsilon using \textpoly(d, k)/\epsilon^2 rounds. We also show nearly matching lower bounds in the online learning setting, thereby obtaining for the first time a family of deviations that captures the learnability of \Phi -regret. From a technical standpoint, we extend the framework of DFFPS from linear maps to the more challenging case of maps with polynomial dimension. At the heart of our approach is a polynomial-time algorithm for computing an expected fixed point of any \phi : \mathcalX \to \mathcalX based on the ellipsoid against hope (EAH) algorithm of Papadimitriou and Roughgarden (JACM '08). In particular, our algorithm for computing \Phi -equilibria is based on executing EAH in a nested fashion – each step of EAH itself being implemented by invoking a separate call to EAH. Subjects: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2502.18582 [stat.ML] (or arXiv:2502.18582v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2502.18582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-79] Colored Jones Polynomials and the Volume Conjecture

链接: https://arxiv.org/abs/2502.18575
作者: Mark Hughes,Vishnu Jejjala,P. Ramadevi,Pratik Roy,Vivek Kumar Singh
类目: Geometric Topology (math.GT); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 27 pages, 16 figures

点击查看摘要

Abstract:Using the vertex model approach for braid representations, we compute polynomials for spin-1 placed on hyperbolic knots up to 15 crossings. These polynomials are referred to as 3-colored Jones polynomials or adjoint Jones polynomials. Training a subset of the data using a fully connected feedforward neural network, we predict the volume of the knot complement of hyperbolic knots from the adjoint Jones polynomial or its evaluations with 99.34% accuracy. A function of the adjoint Jones polynomial evaluated at the phase q=e^ 8 \pi i / 15 predicts the volume with nearly the same accuracy as the neural network. From an analysis of 2-colored and 3-colored Jones polynomials, we conjecture the best phase for n -colored Jones polynomials, and use this hypothesis to motivate an improved statement of the volume conjecture. This is tested for knots for which closed form expressions for the n -colored Jones polynomial are known, and we show improved convergence to the volume.

[LG-80] ransfer Learning for Transient Classification: From Simulations to Real Data and ZTF to LSST

链接: https://arxiv.org/abs/2502.18558
作者: Rithwik Gupta,Daniel Muthukrishna
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 1 table

点击查看摘要

[LG-81] Unraveling particle dark matter with Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2502.17597
作者: M.P. Bento,H.B. Câmara,J.F. Seabra
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 20 LaTeX pages, 11 Figures

点击查看摘要

Abstract:We parametrically solve the Boltzmann equations governing freeze-in dark matter (DM) in alternative cosmologies with Physics-Informed Neural Networks (PINNs), a mesh-free method. Through inverse PINNs, using a single DM experimental point – observed relic density – we determine the physical attributes of the theory, namely power-law cosmologies, inspired by braneworld scenarios, and particle interaction cross sections. The expansion of the Universe in such alternative cosmologies has been parameterized through a switch-like function reproducing the Hubble law at later times. Without loss of generality, we model more realistically this transition with a smooth function. We predict a distinct pair-wise relationship between power-law exponent and particle interactions: for a given cosmology with negative (positive) exponent, smaller (larger) cross sections are required to reproduce the data. Lastly, via Bayesian methods, we quantify the epistemic uncertainty of theoretical parameters found in inverse problems.

[LG-82] Classification and reconstruction for single-pixel imaging with classical and quantum neural networks

链接: https://arxiv.org/abs/2407.12506
作者: Sofya Manko,Dmitry Frolovtsev
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 11 pages, 7 figures, 1 table

点击查看摘要

Abstract:Single-pixel cameras are effective solution for imaging outside the visible spectrum where traditional CMOS/CCD cameras have challenges. Combined with machine learning, they can analyze images quickly enough for practical applications. Solving the problem of high-dimensional single-pixel visualization can potentially be accelerated using quantum machine learning, thereby expanding the range of practical problems. In this work we simulated a single-pixel imaging experiment using Hadamard basis patterns, where images from the MNIST handwritten digit dataset were used as objects. There were selected 64 measurements with maximum variance (6% of the number of pixels in the image). We created algorithms for classifying and reconstruction images based on these measurements using classical fully connected neural networks and parameterized quantum circuits. Classical and quantum classifiers showed accuracies of 96% and 95% respectively after 6 training epochs, which is quite competitive result. Image reconstruction was also demonstrated using classical and quantum neural networks after 10 training epochs, the structural similarity index measure values were 0.76 and 0.25, respectively, which indicates that the problem in such a formulation turned out to be too difficult for quantum neural networks in such a configuration for now.

信息检索

[IR-0] Agent -centric Information Access

链接: https://arxiv.org/abs/2502.19298
作者: Evangelos Kanoulas,Panagiotis Eustratiadis,Yongkang Li,Yougang Lyu,Vaishali Pal,Gabrielle Poerwawinata,Jingfen Qiao,Zihan Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) become more specialized, we envision a future where millions of expert LLMs exist, each trained on proprietary data and excelling in specific domains. In such a system, answering a query requires selecting a small subset of relevant models, querying them efficiently, and synthesizing their responses. This paper introduces a framework for agent-centric information access, where LLMs function as knowledge agents that are dynamically ranked and queried based on their demonstrated expertise. Unlike traditional document retrieval, this approach requires inferring expertise on the fly, rather than relying on static metadata or predefined model descriptions. This shift introduces several challenges, including efficient expert selection, cost-effective querying, response aggregation across multiple models, and robustness against adversarial manipulation. To address these issues, we propose a scalable evaluation framework that leverages retrieval-augmented generation and clustering techniques to construct and assess thousands of specialized models, with the potential to scale toward millions.

[IR-1] UQABench: Evaluating User Embedding for Prompting LLM s in Personalized Question Answering

链接: https://arxiv.org/abs/2502.19178
作者: Langming Liu,Shilei Liu,Yujin Yuan,Yizhen Zhang,Bencheng Yan,Zhiyuan Zeng,Zihao Wang,Jiaqi Liu,Di Wang,Wenbo Su,Pengjie Wang,Jian Xu,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable success in natural language processing (NLP). In practical scenarios like recommendations, as users increasingly seek personalized experiences, it becomes crucial to incorporate user interaction history into the context of LLMs to enhance personalization. However, from a practical utility perspective, user interactions’ extensive length and noise present challenges when used directly as text prompts. A promising solution is to compress and distill interactions into compact embeddings, serving as soft prompts to assist LLMs in generating personalized responses. Although this approach brings efficiency, a critical concern emerges: Can user embeddings adequately capture valuable information and prompt LLMs? To address this concern, we propose \name, a benchmark designed to evaluate the effectiveness of user embeddings in prompting LLMs for personalization. We establish a fair and standardized evaluation process, encompassing pre-training, fine-tuning, and evaluation stages. To thoroughly evaluate user embeddings, we design three dimensions of tasks: sequence understanding, action prediction, and interest perception. These evaluation tasks cover the industry’s demands in traditional recommendation tasks, such as improving prediction accuracy, and its aspirations for LLM-based methods, such as accurately understanding user interests and enhancing the user experience. We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings. Additionally, we reveal the scaling laws of leveraging user embeddings to prompt LLMs. The benchmark is available online.

[IR-2] A 106K Multi-Topic Multilingual Conversational User Dataset with Emoticons

链接: https://arxiv.org/abs/2502.19108
作者: Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Qinglang Guo,Min Zhang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Instant messaging has become a predominant form of communication, with texts and emoticons enabling users to express emotions and ideas efficiently. Emoticons, in particular, have gained significant traction as a medium for conveying sentiments and information, leading to the growing importance of emoticon retrieval and recommendation systems. However, one of the key challenges in this area has been the absence of datasets that capture both the temporal dynamics and user-specific interactions with emoticons, limiting the progress of personalized user modeling and recommendation approaches. To address this, we introduce the emoticon dataset, a comprehensive resource that includes time-based data along with anonymous user identifiers across different conversations. As the largest publicly accessible emoticon dataset to date, it comprises 22K unique users, 370K emoticons, and 8.3M messages. The data was collected from a widely-used messaging platform across 67 conversations and 720 hours of crawling. Strict privacy and safety checks were applied to ensure the integrity of both text and image data. Spanning across 10 distinct domains, the emoticon dataset provides rich insights into temporal, multilingual, and cross-domain behaviors, which were previously unavailable in other emoticon-based datasets. Our in-depth experiments, both quantitative and qualitative, demonstrate the dataset’s potential in modeling user behavior and personalized recommendation systems, opening up new possibilities for research in personalized retrieval and conversational AI. The dataset is freely accessible.

[IR-3] OntologyRAG : Better and Faster Biomedical Code Mapping with Retrieval-Augmented Generation (RAG RAG ) Leveraging Ontology Knowledge Graphs and Large Language Models ECIR2025

链接: https://arxiv.org/abs/2502.18992
作者: Hui Feng,Yuntzu Yin,Emiliano Reynares,Jay Nanavati
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted as a workshop paper for KEIR@ECIR 2025

点击查看摘要

Abstract:Biomedical ontologies, which comprehensively define concepts and relations for biomedical entities, are crucial for structuring and formalizing domain-specific information representations. Biomedical code mapping identifies similarity or equivalence between concepts from different ontologies. Obtaining high-quality mapping usually relies on automatic generation of unrefined mapping with ontology domain fine-tuned language models (LMs), followed by manual selections or corrections by coding experts who have extensive domain expertise and familiarity with ontology schemas. The LMs usually provide unrefined code mapping suggestions as a list of candidates without reasoning or supporting evidence, hence coding experts still need to verify each suggested candidate against ontology sources to pick the best matches. This is also a recurring task as ontology sources are updated regularly to incorporate new research findings. Consequently, the need of regular LM retraining and manual refinement make code mapping time-consuming and labour intensive. In this work, we created OntologyRAG, an ontology-enhanced retrieval-augmented generation (RAG) method that leverages the inductive biases from ontological knowledge graphs for in-context-learning (ICL) in large language models (LLMs). Our solution grounds LLMs to knowledge graphs with unrefined mappings between ontologies and processes questions by generating an interpretable set of results that include prediction rational with mapping proximity assessment. Our solution doesn’t require re-training LMs, as all ontology updates could be reflected by updating the knowledge graphs with a standard process. Evaluation results on a self-curated gold dataset show promises of using our method to enable coding experts to achieve better and faster code mapping. The code is available at this https URL.

[IR-4] OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

链接: https://arxiv.org/abs/2502.18965
作者: Jiaxin Deng,Shiyao Wang,Kuo Cai,Lejian Ren,Qigen Hu,Weifeng Ding,Qiang Luo,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user’s historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user’s browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6% increase in watch-time, which is a substantial improvement.

[IR-5] A Multifacet Hierarchical Sentiment-Topic Model with Application to Multi-Brand Online Review Analysis

链接: https://arxiv.org/abs/2502.18927
作者: Qiao Liang,Xinwei Deng
类目: Information Retrieval (cs.IR); Methodology (stat.ME)
*备注: 21 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Multi-brand analysis based on review comments and ratings is a commonly used strategy to compare different brands in marketing. It can help consumers make more informed decisions and help marketers understand their brand’s position in the market. In this work, we propose a multifacet hierarchical sentiment-topic model (MH-STM) to detect brand-associated sentiment polarities towards multiple comparative aspects from online customer reviews. The proposed method is built on a unified generative framework that explains review words with a hierarchical brand-associated topic model and the overall polarity score with a regression model on the empirical topic distribution. Moreover, a novel hierarchical Polya urn (HPU) scheme is proposed to enhance the topic-word association among topic hierarchy, such that the general topics shared by all brands are separated effectively from the unique topics specific to individual brands. The performance of the proposed method is evaluated on both synthetic data and two real-world review corpora. Experimental studies demonstrate that the proposed method can be effective in detecting reasonable topic hierarchy and deriving accurate brand-associated rankings on multi-aspects.

[IR-6] Hierarchical corpus encoder: Fusing generative retrieval and dense indices

链接: https://arxiv.org/abs/2502.18877
作者: Tongfei Chen,Ankita Sharma,Adam Pauls,Benjamin Van Durme
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.

[IR-7] On Aggregation Queries over Predicted Nearest Neighbors

链接: https://arxiv.org/abs/2502.18803
作者: Carrie Wang,Sihem Amer-Yahia,Laks V. S. Lakshmanan,Reynold Cheng
类目: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 14 pages, 11 figures, 9 tables

点击查看摘要

Abstract:We introduce Aggregation Queries over Nearest Neighbors (AQNNs), a novel type of aggregation queries over the predicted neighborhood of a designated object. AQNNs are prevalent in modern applications where, for instance, a medical professional may want to compute “the average systolic blood pressure of patients whose predicted condition is similar to a given insomnia patient”. Since prediction typically involves an expensive deep learning model or a human expert, we formulate query processing as the problem of returning an approximate aggregate by combining an expensive oracle and a cheaper model (e.g, a simple ML model) to compute the predictions. We design the Sampler with Precision-Recall in Target (SPRinT) framework for answering AQNNs. SPRinT consists of sampling, nearest neighbor refinement, and aggregation, and is tailored for various aggregation functions. It enjoys provable theoretical guarantees, including bounds on sample size and on error in approximate aggregates. Our extensive experiments on medical, e-commerce, and video datasets demonstrate that SPRinT consistently achieves the lowest aggregation error with minimal computation cost compared to its baselines. Scalability results show that SPRinT’s execution time and aggregation error remain stable as the dataset size increases, confirming its suitability for large-scale applications.

[IR-8] raining Large Recommendation Models via Graph-Language Token Alignment WWW’25

链接: https://arxiv.org/abs/2502.18757
作者: Mingdai Yang,Zhiwei Liu,Liangwei Yang,Xiaolong Liu,Chen Wang,Hao Peng,Philip S. Yu
类目: Information Retrieval (cs.IR)
*备注: 5 pages. Accepted by www’25 as short paper

点击查看摘要

Abstract:Recommender systems (RS) have become essential tools for helping users efficiently navigate the overwhelming amount of information on e-commerce and social platforms. However, traditional RS relying on Collaborative Filtering (CF) struggles to integrate the rich semantic information from textual data. Meanwhile, large language models (LLMs) have shown promising results in natural language processing, but directly using LLMs for recommendation introduces challenges, such as ambiguity in generating item predictions and inefficiencies in scalability. In this paper, we propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment. By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs. Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction, eliminating ambiguity in the free-form text as recommendation results. Extensive experiments on three benchmark datasets demonstrate the effectiveness of GLTA, with ablation studies validating each component.

[IR-9] Disrupt Your Research Using Generative AI Powered ScienceSage AAAI2025

链接: https://arxiv.org/abs/2502.18479
作者: Yong Zhang,Eric Herrison Gyamfi,Kelly Anderson,Sasha Roberts,Matt Barker
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted by Workshop of Deployable AI at AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLM) are disrupting science and research in different subjects and industries. Here we report a minimum-viable-product (MVP) web application called \textbfScienceSage . It leverages generative artificial intelligence (GenAI) to help researchers disrupt the speed, magnitude and scope of product innovation. \textbfScienceSage enables researchers to build, store, update and query a knowledge base (KB). A KB codifies user’s knowledge/information of a given domain in both vector index and knowledge graph (KG) index for efficient information retrieval and query. The knowledge/information can be extracted from user’s textual documents, images, videos, audios and/or the research reports generated based on a research question and the latest relevant information on internet. The same set of KBs interconnect three functions on \textbfScienceSage : ‘Generate Research Report’, ‘Chat With Your Documents’ and ‘Chat With Anything’. We share our learning to encourage discussion and improvement of GenAI’s role in scientific research.

[IR-10] Using LLM -Based Approaches to Enhance and Automate Topic Labeling

链接: https://arxiv.org/abs/2502.18469
作者: Trishia Khandelwal
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 2 tables

点击查看摘要

Abstract:Topic modeling has become a crucial method for analyzing text data, particularly for extracting meaningful insights from large collections of documents. However, the output of these models typically consists of lists of keywords that require manual interpretation for precise labeling. This study explores the use of Large Language Models (LLMs) to automate and enhance topic labeling by generating more meaningful and contextually appropriate labels. After applying BERTopic for topic modeling, we explore different approaches to select keywords and document summaries within each topic, which are then fed into an LLM to generate labels. Each approach prioritizes different aspects, such as dominant themes or diversity, to assess their impact on label quality. Additionally, recognizing the lack of quantitative methods for evaluating topic labels, we propose a novel metric that measures how semantically representative a label is of all documents within a topic.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-27

目录

概览 (2025-02-27)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载