Arxiv今日论文 | 2025-02-19

本篇博文主要内容为 2025-02-19 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（LLMs）在面对提示注入攻击、后门攻击及对抗性攻击时的脆弱性问题。这些攻击通过操纵提示或模型以生成有害输出。论文的关键解决方案是提出了UniGuardian，这是一种统一的防御机制，用于检测LLMs中的提示注入、后门攻击和对抗性攻击。此外，通过引入单次前向策略优化检测流程，使得攻击检测与文本生成能在一次前向传递中同时进行。

链接: https://arxiv.org/abs/2502.13141
作者: Huawei Lin,Yingjie Lao,Tong Geng,Tan Yu,Weijie Zhao
机构: Rochester Institute of Technology (罗切斯特理工学院); Tufts University (塔夫茨大学); University of Rochester (罗彻斯特大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 Pages, 8 Figures, 5 Tables, Keywords: Attack Defending, Security, Prompt Injection, Backdoor Attacks, Adversarial Attacks, Prompt Trigger Attacks

点击查看摘要

Abstract:Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.
zh

[NLP-1] Sleepless Nights Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions

【速读】：该论文旨在解决生成用于评估鼓励积极行为改变（如健康和生活方式辅导）的交互式代理的合成用户的端到端框架问题。关键在于通过两阶段方法创建合成用户：首先基于现实世界的健康和生活方式因素以及基本的人口统计和行为属性生成结构化数据；其次根据这些结构化数据开发完整的用户档案。该框架通过生成式代理模型或语言模型模拟合成用户与辅导代理之间的互动，并通过专家盲评验证其有效性，从而确保合成用户在健康和行为特征方面更真实地反映实际人类用户。

链接: https://arxiv.org/abs/2502.13135
作者: Taedong Yun,Eric Yang,Mustafa Safdari,Jong Ha Lee,Vaishnavi Vinod Kumar,S. Sara Mahdavi,Jonathan Amar,Derek Peyton,Reut Aharony,Andreas Michaelides,Logan Schneider,Isaac Galatzer-Levy,Yugang Jia,John Canny,Arthur Gretton,Maja Matarić
机构: Google DeepMind(谷歌深度思维); Verily Life Sciences(维里利生命科学); Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an end-to-end framework for generating synthetic users for evaluating interactive agents designed to encourage positive behavior changes, such as in health and lifestyle coaching. The synthetic users are grounded in health and lifestyle conditions, specifically sleep and diabetes management in this study, to ensure realistic interactions with the health coaching agent. Synthetic users are created in two stages: first, structured data are generated grounded in real-world health and lifestyle factors in addition to basic demographics and behavioral attributes; second, full profiles of the synthetic users are developed conditioned on the structured data. Interactions between synthetic users and the coaching agent are simulated using generative agent-based models such as Concordia, or directly by prompting a language model. Using two independently-developed agents for sleep and diabetes coaching as case studies, the validity of this framework is demonstrated by analyzing the coaching agent’s understanding of the synthetic users’ needs and challenges. Finally, through multiple blinded evaluations of user-coach interactions by human experts, we demonstrate that our synthetic users with health and behavioral attributes more accurately portray real human users with the same attributes, compared to generic synthetic users not grounded in such attributes. The proposed framework lays the foundation for efficient development of conversational agents through extensive, realistic, and grounded simulated interactions.
zh

[NLP-2] Rethinking Diverse Human Preference Learning through Principal Component Analysis

【速读】：该论文旨在解决理解人类偏好以改进基础模型和构建个性化AI系统的需求，而传统奖励模型难以全面捕捉多样化和复杂的偏好。解决方案的关键在于引入分解奖励模型（Decomposed Reward Models, DRMs），通过二元比较提取不同的人类偏好，无需细粒度标注。论文的关键创新点是将人类偏好表示为向量，并利用主成分分析（Principal Component Analysis, PCA）进行分析。通过构建优选与拒绝响应之间的嵌入差异数据集，DRMs识别出能够捕获偏好不同方面的正交基向量，这些分解后的奖励可以灵活组合以适应不同的用户需求，从而提供了一种可解释且可扩展的传统奖励模型替代方案。

链接: https://arxiv.org/abs/2502.13131
作者: Feng Luo,Rui Yang,Hao Sun,Chunyuan Deng,Jiarui Yao,Jingyan Shen,Huan Zhang,Hanjie Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.
zh

[NLP-3] Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长上下文理解时存在的不足。尽管LLMs能够处理从2K到2M甚至更多的令牌序列，但单纯增加输入序列长度并不一定能有效提升其对长上下文的理解能力。为此，研究的关键在于将链式思维（Chain-of-Thought, CoT）推理以监督学习的方式融入LLMs，从而促进有效的长上下文理解。论文通过引入LongFinanceQA数据集来实现这一目标，该数据集包含金融领域的合成数据，并且在最终结论前包含了中间的CoT推理步骤，这促使LLMs进行显式的推理，从而提高理解和解释长上下文的准确性和透明度。为了生成这些合成的CoT推理，论文提出了一种属性驱动的主体推断（Property-driven Agentic Inference, PAI）框架，它模拟了人类推理过程中的属性提取、检索和总结等步骤。

链接: https://arxiv.org/abs/2502.13127
作者: Jingyang Lin,Andy Wong,Tian Xia,Shenghua He,Hui Wei,Mei Han,Jiebo Luo
机构: PAII Inc.(PAII公司); University of Rochester(罗彻斯特大学), NY, United States; University of California, Merced(加州大学默塞德分校), CA, United States
类目: Computation and Language (cs.CL)
备注: 15 Pages, 6 Tables, 8 Figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-driven Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI’s reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong’s financial subset.
zh

[NLP-4] RuozhiBench: Evaluating LLM s with Logical Fallacies and Misleading Premises

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在识别和应对包含逻辑谬误或刻意误导前提的文本方面的能力不足的问题。为了解决这一问题，论文引入了RuozhiBench，这是一个包含677个经过精心策划的问题的双语数据集，这些问题涵盖了各种形式的欺骗性推理，并通过大量的人力和专家评审精心制作。

链接: https://arxiv.org/abs/2502.13125
作者: Zenan Zhai,Hao Li,Xudong Han,Zhenxuan Zhang,Yixuan Zhang,Timothy Baldwin,Haonan Li
机构: LibrAI; MBZUAI; The University of Melbourne
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning. However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied. To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns. Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
zh

[NLP-5] NaturalReasoning : Reasoning in the Wild with 2.8M Challenging Questions

【速读】：该论文旨在解决在扩展推理能力至传统领域（如数学和编程）之外时，由于缺乏多样且高质量的问题所面临的挑战。解决方案的关键在于提出了一种可扩展的方法，用于生成涵盖多个领域的多样化且具挑战性的推理问题及其参考答案，并构建了一个包含280万问题的综合数据集NaturalReasoning。这些领域包括STEM学科（例如物理、计算机科学）、经济学和社会科学等。通过知识蒸馏实验，证明了NaturalReasoning能够有效地激发和转移强教师模型的推理能力，并且还展示了其在使用外部奖励模型或自奖励机制进行无监督自我训练方面的有效性。

链接: https://arxiv.org/abs/2502.13124
作者: Weizhe Yuan,Jane Yu,Song Jiang,Karthik Padthe,Yang Li,Dong Wang,Ilia Kulikov,Kyunghyun Cho,Yuandong Tian,Jason E Weston,Xian Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Dataset at this https URL

点击查看摘要

Abstract:Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding.
zh

[NLP-6] Adapting Psycholinguistic Research for LLM s: Gender-inclusive Language in a Coreference Context ACL2025

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）处理性别包容性语言的方式，并评估这些模型是否以中立的态度解释此类语言。研究的关键在于分析LLMs生成的核心指代词是否与特定的性别表达一致，或者是否反映出模型的偏见。通过将心理语言学方法从法语适应到英语和德语，研究发现英语中的LLMs通常保持先行词的性别，但表现出潜在的男性偏见；而在德语中，这种偏见更为强烈，且能够克服所有测试的性别中立化策略。

链接: https://arxiv.org/abs/2502.13120
作者: Marion Bartl,Thomas Brendan Murphy,Susan Leavy
机构: Insight SFI Research Centre for Data Analytics (数据 analytics 研究中心); School of Information and Communication Studies (信息和通信研究学院); University College Dublin (都柏林大学); School of Mathematics and Statistics (数学和统计学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, submitted to ACL 2025 (ARR February 2025 cycle)

点击查看摘要

Abstract:Gender-inclusive language is often used with the aim of ensuring that all individuals, regardless of gender, can be associated with certain concepts. While psycholinguistic studies have examined its effects in relation to human cognition, it remains unclear how Large Language Models (LLMs) process gender-inclusive language. Given that commercial LLMs are gaining an increasingly strong foothold in everyday applications, it is crucial to examine whether LLMs in fact interpret gender-inclusive language neutrally, because the language they generate has the potential to influence the language of their users. This study examines whether LLM-generated coreferent terms align with a given gender expression or reflect model biases. Adapting psycholinguistic methods from French to English and German, we find that in English, LLMs generally maintain the antecedent’s gender but exhibit underlying masculine bias. In German, this bias is much stronger, overriding all tested gender-neutralization strategies.
zh

[NLP-7] STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

【速读】：该论文旨在解决如何全面评估大型语言模型（LLM）在微观经济推理方面的能力，特别是针对非策略性设置的问题。现有大多数基准测试侧重于特定应用，未能涵盖广泛的经济任务。为填补这一空白，论文提出了将微观经济推理细分为58个不同元素的方法，并覆盖多个领域、视角和类型。关键解决方案在于引入了一种名为auto-STEER的新型LLM辅助数据生成协议，该协议能够通过自适应手写模板生成新的问题，从而自动化地产生多样化的测试题，减少模型过拟合的风险。

链接: https://arxiv.org/abs/2502.13119
作者: Narun Raman,Taylor Lundy,Thiago Amin,Jesse Perla,Kevin-Leyton Brown
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into 58 distinct elements, focusing on the logic of supply and demand, each grounded in up to 10 distinct domains, 5 perspectives, and 3 types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on 27 LLMs, ranging from small open-source models to the current state of the art. We examined each model’s ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.
zh

[NLP-8] he influence of motion features in temporal perception

【速读】：该论文旨在探讨方式动词（Manner-of-motion verbs）在塑造主观时间感知和情感共鸣方面的作用。研究通过四项互补的研究，分析这些动词如何影响时间概念化，考察它们在字面和隐喻（时间）语境中的使用。关键在于揭示快速动词（如“fly, zoom”）与慢速动词（如“crawl, drag”）分别唤起动态和参与性的时间体验，以及它们所关联的不同情感和能动性。研究表明，在隐喻语境中，方式动词对情感和体验的编码效果更为显著，超越了其字面意义，并且参与者更倾向于在情感强烈的时间语境中使用方式动词而非路径动词（如“go, pass”），因为方式动词更能有效地捕捉时间的体验和情感特质。

链接: https://arxiv.org/abs/2502.13114
作者: Rosa Illan Castillo,Javier Valenzuela
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper examines the role of manner-of-motion verbs in shaping subjective temporal perception and emotional resonance. Through four complementary studies, we explore how these verbs influence the conceptualization of time, examining their use in literal and metaphorical (temporal) contexts. Our findings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging temporal experiences, often linked to positive emotions and greater agency. In contrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and negative emotions, reflecting tedious or constrained experiences of time. These effects are amplified in metaphorical contexts, where manner verbs encode emotional and experiential nuances that transcend their literal meanings. We also find that participants prefer manner verbs over path verbs (e.g., go, pass) in emotionally charged temporal contexts, as manner verbs capture the experiential and emotional qualities of time more effectively. These findings highlight the interplay between language, motion, and emotion in shaping temporal perception, offering insights into how linguistic framing influences subjective experiences of time.
zh

[NLP-9] Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization

【速读】：该论文旨在解决临床问答（Clinical Question Answering, CQA）模型在从电子医疗记录（Electronic Medical Records, EMRs）中提取答案后无法对答案进行分类的问题。这一分类能力对于结构化检索、内容过滤及医学决策支持至关重要。为了解决这一局限性，论文提出了一种多任务学习（Multi-Task Learning, MTL）框架，该框架同时训练CQA模型进行答案抽取和医学分类。通过预测答案跨度并将其分类为五个标准化的医学类别（诊断、药物、症状、程序和实验室报告），该模型实现了更结构化且可解释的输出。关键在于结合了多任务学习机制，不仅提升了CQA性能，还引入了有效的分类与结构化医学信息检索方法。

链接: https://arxiv.org/abs/2502.13108
作者: Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Amit Agarwal,Bhargava Kumar,Srikant Panda,Tejaswini Kumar
机构: University of Washington(华盛顿大学); New York University(纽约大学); Liverpool John Moores University(利物浦约翰摩尔斯大学); Birla Institute of Technology(比拉斯技术学院); Columbia University(哥伦比亚大学); Columbia University(哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical Question Answering (CQA) plays a crucial role in medical decision-making, enabling physicians to extract relevant information from Electronic Medical Records (EMRs). While transformer-based models such as BERT, BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in CQA, existing models lack the ability to categorize extracted answers, which is critical for structured retrieval, content filtering, and medical decision support. To address this limitation, we introduce a Multi-Task Learning (MTL) framework that jointly trains CQA models for both answer extraction and medical categorization. In addition to predicting answer spans, our model classifies responses into five standardized medical categories: Diagnosis, Medication, Symptoms, Procedure, and Lab Reports. This categorization enables more structured and interpretable outputs, making clinical QA models more useful in real-world healthcare settings. We evaluate our approach on emrQA, a large-scale dataset for medical question answering. Results show that MTL improves F1-score by 2.2% compared to standard fine-tuning, while achieving 90.7% accuracy in answer categorization. These findings suggest that MTL not only enhances CQA performance but also introduces an effective mechanism for categorization and structured medical information retrieval. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.13108 [cs.CL] (or arXiv:2502.13108v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] Understanding and Rectifying Safety Perception Distortion in VLMs

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在整合视觉模态后对有害请求和越狱攻击更为敏感的问题。研究发现，这种现象源于多模态输入导致的安全感知失真（safety perception distortion），即与纯文本模型相比，多模态输入使得模型倾向于高估有害输入的安全性。为了解决这一问题，论文提出了一种名为激活位移解耦与校准（Activation Shift Disentanglement and Calibration, ShiftDC）的方法。该方法无需额外训练，通过分解和校准多模态引入的激活位移来减轻模态对安全性的负面影响，从而恢复基础语言模型（LLM）固有的安全性对齐，同时保持VLM的视觉-语言能力。

链接: https://arxiv.org/abs/2502.13095
作者: Xiaohan Zou,Jian Kang,George Kesidis,Lu Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a “safer” direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC restores the inherent safety alignment of the LLM backbone while preserving the vision-language capabilities of VLMs. Empirical results demonstrate that ShiftDC significantly enhances alignment performance on safety benchmarks without impairing model utility.
zh

[NLP-11] xt2World: Benchmarking Large Language Models for Symbolic World Model Generation

【速读】：该论文旨在解决利用大规模语言模型（Large Language Models, LLMs）从文本描述生成符号化世界模型时所面临的一系列挑战，包括评估随机性、依赖间接度量以及领域范围有限等问题。为了解决这些局限性，论文引入了一个新的基准测试Text2World，基于规划域定义语言（PDDL），包含数百个不同的领域，并采用多准则、基于执行的度量方法以实现更稳健的评估。解决方案的关键在于通过Text2World基准测试发现，使用大规模强化学习训练的推理模型表现最优，从而证明了这一新基准在提升LLMs世界建模能力方面的有效性。

链接: https://arxiv.org/abs/2502.13092
作者: Mengkang Hu,Tianxing Chen,Yude Zou,Yuheng Lei,Qiguang Chen,Ming Li,Hongyuan Zhang,Wenqi Shao,Ping Luo
机构: The University of Hong Kong(香港大学); Shenzhen University(深圳大学); Harbin Institute of Technology(哈尔滨工业大学); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at this https URL.
zh

[NLP-12] KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits

【速读】：该论文旨在解决专利分析中文档表示不直观且难以解析的问题。为了解决这一问题，论文提出了KAPPA框架，其关键是引入了一个语义校准的关键短语生成范式，该范式结合预训练语言模型与基于提示的分层解码策略，以利用专利的多层级结构特征，从而构建有效的专利肖像（patent portraits）。

链接: https://arxiv.org/abs/2502.13076
作者: Xin Xia,Yujin Wang,Jun Zhou,Guisheng Zhong,Linning Cai,Chen Zhang
机构: Tsinghua University (清华大学); China Intellectual Property Society (中国知识产权学会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patent analysis highly relies on concise and interpretable document representations, referred to as patent portraits. Keyphrases, both present and absent, are ideal candidates for patent portraits due to their brevity, representativeness, and clarity. In this paper, we introduce KAPPA, an integrated framework designed to construct keyphrase-based patent portraits and enhance patent analysis. KAPPA operates in two phases: patent portrait construction and portrait-based analysis. To ensure effective portrait construction, we propose a semantic-calibrated keyphrase generation paradigm that integrates pre-trained language models with a prompt-based hierarchical decoding strategy to leverage the multi-level structural characteristics of patents. For portrait-based analysis, we develop a comprehensive framework that employs keyphrase-based patent portraits to enable efficient and accurate patent analysis. Extensive experiments on benchmark datasets of keyphrase generation, the proposed model achieves significant improvements compared to state-of-the-art baselines. Further experiments conducted on real-world patent applications demonstrate that our keyphrase-based portraits effectively capture domain-specific knowledge and enrich semantic representation for patent analysis tasks.
zh

[NLP-13] Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

【速读】：该论文旨在解决序列标记（Sequence Token）压缩的问题，目标是将较长的标记序列压缩为较短的实值向量序列，以替代传统的标记嵌入或键值缓存，从而减少现有语言模型中的计算量。论文的关键在于通过采用针对每个样本的优化程序替换编码器，实现高达1500:1的压缩比，揭示了当前实际可达到的压缩解决方案与理论上限之间存在两个数量级的差距。研究表明，压缩极限并非由输入长度决定，而是由需要减少的不确定性所决定，即在没有任何条件的情况下，该序列的交叉熵损失。这表明输入嵌入的理论容量与实际应用之间存在显著差距，暗示在模型设计中有很大的优化空间。

链接: https://arxiv.org/abs/2502.13063
作者: Yuri Kuratov,Mikhail Arkhipov,Aydar Bulatov,Mikhail Burtsev
机构: AIRI(人工智能学院), Moscow, Russia; Neural Networks and Deep Learning Lab(神经网络与深度学习实验室), MIPT, Dolgoprudny, Russia; Independent Researcher(独立研究员), Amsterdam, Netherlands; London Institute for Mathematical Sciences(伦敦数学科学研究所), London, UK
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.
zh

[NLP-14] Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection

【速读】：该论文旨在解决在互联网上仇恨表情包检测中大型多模态模型泛化能力不足的问题。解决方案的关键在于提出了一种名为Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL)的新框架，这是一种两阶段微调方法，旨在提升模型在特定领域的准确性和跨领域的泛化能力。

链接: https://arxiv.org/abs/2502.13061
作者: Jingbiao Mei,Jinghong Chen,Guangyu Yang,Weizhe Lin,Bill Byrne
机构: University of Cambridge(剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Under Review

点击查看摘要

Abstract:Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While large multimodal models have shown strong generalization across various tasks, they exhibit poor generalization to hateful meme detection due to the dynamic nature of memes tied to emerging social trends and breaking news. Recent work further highlights the limitations of conventional supervised fine-tuning for large multimodal models in this context. To address these challenges, we propose Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a novel two-stage fine-tuning framework designed to improve both in-domain accuracy and cross-domain generalization. Experimental results on six widely used meme classification datasets demonstrate that LMM-RGCL achieves state-of-the-art performance, outperforming agent-based systems such as VPD-PALI-X-55B. Furthermore, our method effectively generalizes to out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
zh

[NLP-15] SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在输出可靠性和准确性方面的问题，特别是在处理基于事实信息（如常识和领域特定知识）生成内容的能力。论文的关键解决方案是引入了一个名为SimpleVQA的新基准测试，它通过六个关键特性来评估MLLMs的事实性能力，包括覆盖多个任务和场景、确保高质量和具有挑战性的查询、保持静态和永恒的参考答案，并且易于评估。SimpleVQA将视觉问答项目分类为9个不同的任务，并实施严格的质量控制过程以确保高质、简洁且清晰的答案。

链接: https://arxiv.org/abs/2502.13059
作者: Xianfu Cheng,Wei Zhang,Shiwei Zhang,Jian Yang,Xiangyuan Guan,Xianjie Wu,Xiang Li,Ge Zhang,Jiaheng Liu,Yuying Mai,Yutao Zeng,Zhoufutu Wen,Ke Jin,Baorui Wang,Weixiao Zhou,Yunhong Lu,Tongliang Li,Wenhao Huang,Zhoujun Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
zh

[NLP-16] AEIA-MN: Evaluating the Robustness of Multimodal LLM -Powered Mobile Agents Against Active Environmental Injection Attacks

【速读】：该论文旨在解决AI代理在操作系统中识别“冒充者”的需求被忽视的问题，并提出了一种名为Active Environment Injection Attack (AEIA)的新威胁，即攻击者通过将攻击方法伪装成环境元素来干扰AI代理的决策过程。为评估基于MLLM的代理对这种威胁的鲁棒性，论文提出了AEIA-MN方案，该方案利用移动操作系统中的交互漏洞。实验结果表明，即使高级的MLLMs也对该攻击高度敏感，在AndroidWorld基准测试中达到了高达93%的攻击成功率。因此，解决方案的关键在于通过AEIA-MN方案揭示并验证基于MLLM的AI代理在面对伪装攻击时的脆弱性。

链接: https://arxiv.org/abs/2502.13053
作者: Yurun Chen,Xueyu Hu,Keting Yin,Juncheng Li,Shengyu Zhang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As researchers continuously optimize AI agents to perform tasks more effectively within operating systems, they often neglect to address the critical need for enabling these agents to identify “impostors” within the system. Through an analysis of the agents’ operating environment, we identified a potential threat: attackers can disguise their attack methods as environmental elements, injecting active disturbances into the agents’ execution process, thereby disrupting their decision-making. We define this type of attack as Active Environment Injection Attack (AEIA). Based on this, we propose AEIA-MN, an active environment injection attack scheme that exploits interaction vulnerabilities in the mobile operating system to evaluate the robustness of MLLM-based agents against such threats. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% in the AndroidWorld benchmark.
zh

[NLP-17] Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction

【速读】：该论文旨在解决方面情感四元组预测（Aspect Sentiment Quadruple Prediction, ASQP）任务中全集训练样本标注资源密集的问题。论文的关键解决方案在于探索大型语言模型（Large Language Models, LLMs）在零样本和少样本学习设置下的能力，以减少对大规模人工标注数据的依赖，同时保持与微调模型相近的性能水平。实验结果显示，在40个样本的情况下，LLMs在Rest16数据集上的F1分数为52.46，接近于最佳微调方法MVP的60.39，表明LLMs可以在一定程度上替代人工标注，从而减轻ASQP任务中的标注负担。

链接: https://arxiv.org/abs/2502.13044
作者: Nils Constantin Hellwig,Jakob Fehle,Udo Kruschwitz,Christian Wolff
机构: University of Regensburg (雷根斯堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect sentiment quadruple prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores slightly below those obtained with state-of-the-art fine-tuned models but exceeding previously reported zero- and few-shot performance. In the 40-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.
zh

[NLP-18] Natural Language Generation from Visual Sequences: Challenges and Future Directions

【速读】：该论文旨在解决多图像到文本生成任务中的复杂视觉内容与相应文本之间关系的理解问题。论文的关键在于分析多个时间顺序图像序列的任务，并指出这些任务在建模和评估方法上存在共同挑战和相似性，从而提出未来研究的方向以推进这一领域的理解和模型发展。

链接: https://arxiv.org/abs/2502.13034
作者: Aditya K Surikuchi,Raquel Fernández,Sandro Pezzelle
机构: Institute for Logic, Language and Computation (逻辑、语言与计算研究所); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to use natural language to talk about visual content is at the core of human intelligence and a crucial feature of any artificial intelligence system. Various studies have focused on generating text for single images. In contrast, comparatively little attention has been paid to exhaustively analyzing and advancing work on multiple-image vision-to-text settings. In this position paper, we claim that any task dealing with temporally ordered sequences of multiple images or frames is an instance of a broader, more general problem involving the understanding of intricate relationships between the visual content and the corresponding text. We comprehensively analyze five tasks that are instances of this problem and argue that they pose a common set of challenges and share similarities in terms of modeling and evaluation approaches. Based on the insights from these various aspects and stages of multi-image-to-text generation, we highlight several open questions and suggest future research directions. We believe that these directions can advance the understanding of complex phenomena in this domain and the development of better models.
zh

[NLP-19] HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

【速读】：该论文旨在解决现有方法在优化大型语言模型（Large Language Models, LLMs）评估提示时仅关注单一因素，而忽视多种因素组合影响的问题。为解决这一不足，论文提出了一种名为启发式提示策略搜索（Heuristic Prompting Strategy Search, HPSS）的新方法。HPSS通过综合集成8个关键评估提示因素，并借鉴遗传算法的思想，进行迭代搜索以找到有效的提示策略，从而显著提升了LLMs评估的性能。关键在于HPSS方法能够系统性地考虑多个因素的组合效应，以实现更全面的优化。

链接: https://arxiv.org/abs/2502.13031
作者: Bosi Wen,Pei Ke,Yufei Sun,Cunxiang Wang,Xiaotao Gu,Jinfeng Zhou,Jie Tang,Hongning Wang,Minlie Huang
机构: The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University (清华大学); University of Electronic Science and Technology of China (电子科技大学); Zhipu AI (智谱AI); The Knowledge Engineering Group (KEG), Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 32 pages, 10 figures

点击查看摘要

Abstract:Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods.
zh

[NLP-20] Whose story is it? Personalizing story generation by inferring author styles

【速读】：该论文旨在解决故事生成中的个性化问题，目前这一领域尚未得到充分探索。解决方案的关键在于提出了一种新颖的两阶段流程：首先通过分析作者过去的作品推断其隐含的故事写作特征，并将其组织成“作者写作表”；其次利用该表模拟作者的个性，通过定制化的人物描述和个人化的故事编写规则生成个性化的故事。为了验证这种方法的有效性，构建了一个包含590个故事的数据集，并通过与非个性化基线的对比实验表明，所提出的管道能够有效地生成高质量的个性化故事，在捕捉作者写作风格方面取得了显著优势。

链接: https://arxiv.org/abs/2502.13028
作者: Nischal Ashok Kumar,Chau Minh Pham,Mohit Iyyer,Andrew Lan
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); University of Maryland, College Park(马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注: preprint 52 pages

点击查看摘要

Abstract:Personalization has become essential for improving user experience in interactive writing and educational applications, yet its potential in story generation remains largely unexplored. In this work, we propose a novel two-stage pipeline for personalized story generation. Our approach first infers an author’s implicit story-writing characteristics from their past work and organizes them into an Author Writing Sheet, inspired by narrative theory. The second stage uses this sheet to simulate the author’s persona through tailored persona descriptions and personalized story writing rules. To enable and validate our approach, we construct Mythos, a dataset of 590 stories from 64 authors across five distinct sources that reflect diverse story-writing settings. A head-to-head comparison with a non-personalized baseline demonstrates our pipeline’s effectiveness in generating high-quality personalized stories. Our personalized stories achieve a 75 percent win rate (versus 14 percent for the baseline and 11 percent ties) in capturing authors’ writing style based on their past works. Human evaluation highlights the high quality of our Author Writing Sheet and provides valuable insights into the personalized story generation task. Notable takeaways are that writings from certain sources, such as Reddit, are easier to personalize than others, like AO3, while narrative aspects, like Creativity and Language Use, are easier to personalize than others, like Plot.
zh

[NLP-21] Agent ic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks

【速读】：该论文旨在解决动态构建和扩展知识图谱的问题，以实现知识的持续组织与更新。不同于依赖静态提取或单次学习的传统方法，本文提出的关键解决方案是结合具备推理能力的大语言模型与不断更新的图表示。通过迭代生成新概念和关系，并将其融入全局图谱，系统形成了一个具有稳定模块性和桥接节点的无标度网络。这种方法在数百次迭代中持续产生新的节点和边，同时演化出分布式的连通性特征。这一反馈循环机制使得知识结构能够开放且连贯地发展，尤其适用于材料设计等领域，促进跨领域新颖知识的合成。

链接: https://arxiv.org/abs/2502.13025
作者: Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present an agentic, autonomous graph expansion framework that iteratively structures and refines knowledge in situ. Unlike conventional knowledge graph construction methods relying on static extraction or single-pass learning, our approach couples a reasoning-native large language model with a continually updated graph representation. At each step, the system actively generates new concepts and relationships, merges them into a global graph, and formulates subsequent prompts based on its evolving structure. Through this feedback-driven loop, the model organizes information into a scale-free network characterized by hub formation, stable modularity, and bridging nodes that link disparate knowledge clusters. Over hundreds of iterations, new nodes and edges continue to appear without saturating, while centrality measures and shortest path distributions evolve to yield increasingly distributed connectivity. Our analysis reveals emergent patterns, such as the rise of highly connected ‘hub’ concepts and the shifting influence of ‘bridge’ nodes, indicating that agentic, self-reinforcing graph construction can yield open-ended, coherent knowledge structures. Applied to materials design problems, we present compositional reasoning experiments by extracting node-specific and synergy-level principles to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that transcend rote summarization and strengthen the framework’s potential for open-ended scientific discovery. We discuss other applications in scientific discovery and outline future directions for enhancing scalability and interpretability.
zh

[NLP-22] Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理自然语言处理（NLP）任务时容易产生幻觉的问题，这主要是由于它们的参数化知识有限以及缺乏领域特定的专业知识。为应对这一挑战，论文提出了检索增强生成（Retrieval-Augmented Generation, RAG）方法，通过引入外部文档检索来扩展LLMs的知识库。然而，这种方法经常面临从外部资源检索到的知识源包含不相关或错误信息的问题，从而影响了其在下游任务中的有效性。为克服这一限制，论文引入了一个紧凑、高效且可插拔的模块，用于在将外部知识源提供给生成器之前对其进行精炼。该模块通过提取最相关和支持性的信息，并重新组织成简洁、针对查询的具体格式，重构检索内容。通过包含监督微调、对比多任务学习和基于强化学习的对齐在内的三阶段训练范式，该模块优先考虑关键知识并与生成器的偏好对齐。此方法使得LLMs能够生成更准确、可靠且符合上下文的输出。因此，该解决方案的关键在于设计并训练一个高效的模块，以确保从外部知识源中提取的信息是精确且相关的。

链接: https://arxiv.org/abs/2502.13019
作者: Sha Li,Naren Ramarkrishnan
机构: Virginia Tech(弗吉尼亚理工大学); Virginia Tech(弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Despite the remarkable capabilities of Large Language Models (LLMs) in various NLP tasks, they remain vulnerable to hallucinations due to their limited parametric knowledge and lack of domain-specific expertise. Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating external document retrieval to augment the knowledge base of LLMs. In this approach, RAG retrieves document chunks from an external corpus in response to a query, which are then used as context for the downstream language model to generate an answer. However, these retrieved knowledge sources often include irrelevant or erroneous information, undermining the effectiveness of RAG in downstream tasks. To overcome this limitation, we introduce a compact, efficient, and pluggable module designed to refine external knowledge sources before feeding them to the generator. The module reconstructs retrieved content by extracting the most relevant and supportive information and reorganising it into a concise, query-specific format. Through a three-stage training paradigm - comprising supervised fine-tuning, contrastive multi-task learning, and reinforcement learning-based alignment - it prioritises critical knowledge and aligns it with the generator’s preferences. This method enables LLMs to produce outputs that are more accurate, reliable, and contextually appropriate.
zh

[NLP-23] owards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents

【速读】：该论文旨在解决角色扮演代理（Role-Playing Agent, RPA）评估难题，由于任务需求多样性和代理设计差异，现有评估方法面临挑战。论文的关键解决方案是提出了一套基于证据、可操作且具有一般性的评估设计指南，通过系统性回顾2021年1月至2024年12月间发表的1,676篇论文，识别出六种代理属性、七种任务属性以及七种评估指标，从而帮助研究人员开发更系统和一致的评估方法。

链接: https://arxiv.org/abs/2502.13012
作者: Chaoran Chen,Bingsheng Yao,Ruishi Zou,Wenyue Hua,Weimin Lyu,Toby Jia-Jun Li,Dakuo Wang
机构: University of Notre Dame(圣母大学); Northeastern University(东北大学); University of California, San Diego(加州大学圣地亚哥分校); University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Stony Brook University(石溪大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human-like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs. This paper proposes an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature. Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.
zh

[NLP-24] Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLM s and Evolving Medical Knowledge

【速读】：该论文旨在解决大型语言模型（LLMs）在医学问答中的可靠性问题，主要由于医学知识的快速演进及手动更新领域的资源劳动密集。论文提出的关键解决方案是Adaptive Medical Graph-RAG (AMG-RAG)，这是一种全面框架，通过自动化构建和持续更新医学知识图谱，集成推理，并检索当前外部证据（如PubMed和WikiSearch），从而动态链接新发现和复杂的医学概念，提升查询的准确性和可解释性。

链接: https://arxiv.org/abs/2502.13010
作者: Mohammad Reza Rezaei,Reza Saadati Fard,Jayson Parker,Rahul G. Krishnan,Milad Lankarany
机构: Department of Biomedical Engineering, University of Toronto (生物医学工程系，多伦多大学); Department of Computer Science, Worcester Polytechnic Institute (计算机科学系，伍斯特理工学院); Department of Biology, University of Toronto Mississauga (生物学系，多伦多大学密西沙加分校); Department of Computer Science, University of Toronto (计算机科学系，多伦多大学); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature. However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems. To address this, we introduce Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch. By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights. Subjects: Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2502.13010 [cs.CL] (or arXiv:2502.13010v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-25] Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation

【速读】：该论文旨在解决跨语言语音质量预测模型在不同语言间的泛化能力不足的问题。研究的关键在于评估两种基于深度学习的语音质量模型（NISQA和Transformer-based Audio Spectrogram Transformer (AST)）在多种非英语语言中的性能，并分析它们在五个维度上的表现差异。研究表明，尽管AST模型在跨语言性能上更为稳定，但两种模型均表现出显著的语言偏向性，特别是在瑞典语和荷兰语中预测难度较大。因此，论文强调需要构建更加平衡的多语言数据集以及针对特定架构的适应性改进以提升跨语言泛化能力。

链接: https://arxiv.org/abs/2502.13004
作者: Wafaa Wardah,Tuğçe Melike Koçak Büyüktaş,Kirill Shchegelskiy,Sebastian Möller,Robert P. Spang
机构: Technische Universität Berlin (柏林工业大学); Deutsches Forschungszentrum für Künstliche Intelligenz (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objective speech quality models aim to predict human-perceived speech quality using automated methods. However, cross-lingual generalization remains a major challenge, as Mean Opinion Scores (MOS) vary across languages due to linguistic, perceptual, and dataset-specific differences. A model trained primarily on English data may struggle to generalize to languages with different phonetic, tonal, and prosodic characteristics, leading to inconsistencies in objective assessments. This study investigates the cross-lingual performance of two speech quality models: NISQA, a CNN-based model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both models were trained exclusively on English datasets containing over 49,000 speech samples and subsequently evaluated on speech in German, French, Mandarin, Swedish, and Dutch. We analyze model performance using Pearson Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS. Our findings show that while AST achieves a more stable cross-lingual performance, both models exhibit noticeable biases. Notably, Mandarin speech quality predictions correlate highly with human MOS scores, whereas Swedish and Dutch present greater prediction challenges. Discontinuities remain difficult to model across all languages. These results highlight the need for more balanced multilingual datasets and architecture-specific adaptations to improve cross-lingual generalization.
zh

[NLP-26] You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations

【速读】：该论文旨在解决会议摘要领域因隐私限制和高昂的数据收集成本导致的高质量数据匮乏的问题。解决方案的关键在于提出了FAME（Meeting Summarization Dataset），这是一个由MIMIC框架生成的大规模多语言（英语和德语）会议数据集。MIMIC框架通过定义心理基础的角色设定、规划对话流程，并利用大规模语言模型进行辩论来生成会议记录。此外，论文还引入了一个模块化后处理步骤，以减少重复性和过于正式的语气，确保生成对话的一致性和可信度。最后，论文提出了一种基于心理学的评估框架，从自然性、社会行为真实性以及信息导向难度三个方面对生成的数据进行评价。

链接: https://arxiv.org/abs/2502.13001
作者: Frederic Kirstein,Muneeb Khan,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: University of Göttingen (哥廷根大学); kirstein@gipplab.org (邮箱)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 in difficulty). These findings highlight that FAME is a good and scalable proxy for real-world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.
zh

[NLP-27] Eager Updates For Overlapped Communication and Computation in DiLoCo

【速读】：该论文旨在解决在分布式优化方法中，特别是在数据传输带宽受限的数据中心环境下，外层优化步骤所需的通信导致显著性能下降的问题。论文的关键解决方案在于通过在计算过程中重叠通信，使外层优化步骤能够与内层优化阶段完全并行执行，其中特别提出了一种称为“eager updates”的变体方法，以实现这一目标，并展示了其在低带宽设置下具有竞争力的性能。

链接: https://arxiv.org/abs/2502.12996
作者: Satyen Kale,Arthur Douillard,Yanislav Donchev
机构: Apple(苹果); Google Research(谷歌研究); Google DeepMind(谷歌深思维); Google DeepMind(谷歌深思维)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2501.18512

点击查看摘要

Abstract:Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.
zh

[NLP-28] B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

【速读】：该论文旨在解决黑盒模型后验解释方法在准确性和人类可解释性方面的不足，这主要是由于当前神经网络的可解释性有限。论文的关键解决方案是引入B-cos语言模型（B-cos LMs），通过结合B-cos转换和任务微调，将预训练的语言模型直接转化为B-cos LMs，从而提高效率。与传统的后验解释方法相比，B-cos LMs能够提供更忠实且人类可理解的解释，同时保持与常规微调相当的任务性能。

链接: https://arxiv.org/abs/2502.12992
作者: Yifan Wang,Sukrut Rao,Ji-Ung Lee,Mayank Jobanputra,Vera Demberg
机构: Saarland University (萨尔兰大学), Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所), RTG Neuroexplicit Models of Language, Vision, and Action (研究培训小组神经明确语言、视觉和动作模型)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural models. Meanwhile, B-cos networks have been introduced to improve model explainability through architectural and computational adaptations, but their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous B-cos methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we provide practical guidelines for effectively building B-cos LMs based on our findings. Our code is available at this https URL.
zh

[NLP-29] Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLM s

【速读】：该论文旨在解决如何实现全面的人物模拟，超越表面事实与对话，深入到人物的思维过程。论文的关键解决方案是CharacterBot模型，通过四个训练任务（包括预训练、多项选择问答、生成式问答和风格转换）来掌握鲁迅的语言结构和思想内核。为了优化学习效果，引入CharLoRA参数更新机制，使得通用语言风格专家与其他特定任务专家协同工作，从而更好地理解和模拟人物的语言风格及深层次思维。

链接: https://arxiv.org/abs/2502.12988
作者: Zixiao Wang,Duzhen Zhang,Ishita Agrawal,Shen Gao,Le Song,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学); Shandong University(山东大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character. Using Lu Xun, a renowned Chinese writer, as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope that this work inspires future research on deep character persona simulation LLM.
zh

[NLP-30] Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM s

【速读】：该论文旨在开发适用于南亚（东南亚）语言的先进多语言模型Sailor2，并解决其在这些语言上的高效训练与应用问题。解决方案的关键在于通过大规模预训练（500B token，其中400B特定于南亚语言）来支持13种南亚语言的同时，保持对中文和英语的熟练掌握。此外，论文还提供了一个全面的指南，涵盖了数据整理、预训练、后训练、模型定制和评估五个关键方面，以促进此类多语言模型的高效开发。

链接: https://arxiv.org/abs/2502.12982
作者: Longxu Dou,Qian Liu,Fan Zhou,Changyu Chen,Zili Wang,Ziqi Jin,Zichen Liu,Tongyao Zhu,Cunxiao Du,Penghui Yang,Haonan Wang,Jiaheng Liu,Yongchi Zhao,Xiachong Feng,Xin Mao,Man Tsung Yeung,Kunat Pipatanakul,Fajri Koto,Min Si Thu,Hynek Kydlíček,Zeyi Liu,Qunshu Lin,Sittipong Sripaisarnmongkol,Kridtaphad Sae-Khow,Nirattisai Thongchim,Taechawat Konkaew,Narong Borijindargoon,Anh Dao,Matichon Maneegard,Phakphum Artkaew,Zheng-Xin Yong,Quan Nguyen,Wannaphong Phatthiyaphaibun,Hoang H. Tran,Mike Zhang,Shiqi Chen,Tianyu Pang,Chao Du,Xinyi Wan,Wei Lu,Min Lin
机构: Sea AI Lab (海智实验室); SCB 10X; WiseSight; Hugging Face; SUTD (新加坡科技设计大学); SJTU (上海交通大学); SMU (新加坡管理大学); NUS (新加坡国立大学); NTU (南洋理工大学); HKU (香港大学); ABAKA AI; Peafowl.ai; MBZUAI (穆罕默德·本·扎耶德人工智能大学); Michigan State University (密歇根州立大学); Float16.cloud; NYU (纽约大学); Brown University (布朗大学); Umeå University (于默奥大学); PyThaiNLP; HCMUT (胡志明市大); Aalborg University (奥尔堡大学); CityU (城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 49 pages, 16 figures. Technical Report of Sailor2: this https URL

点击查看摘要

Abstract:Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
zh

[NLP-31] Reasoning -to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在对抗性攻击和越狱查询（jailbreak queries）面前的安全性不足问题。解决方案的关键在于提出了一种名为Reasoning-to-Defend (R2D) 的新型训练范式，该范式通过将查询和响应的安全性反思整合到模型的生成过程中，解锁了一个具备安全意识的推理机制。此外，引入了对比.pivot优化（Contrastive Pivot Optimization, CPO），以增强模型感知对话安全性状态的能力，从而使得LLMs能够在推理过程中动态调整其响应策略，显著提升其防御能力。

链接: https://arxiv.org/abs/2502.12970
作者: Junda Zhu,Lingyong Yan,Shuaiqiang Wang,Dawei Yin,Lei Sha
机构: Beihang University(北航), China; Baidu Inc.(百度), China
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:The reasoning abilities of Large Language Models (LLMs) have demonstrated remarkable advancement and exceptional performance across diverse domains. However, leveraging these reasoning capabilities to enhance LLM safety against adversarial attacks and jailbreak queries remains largely unexplored. To bridge this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLMs’ generation process, unlocking a safety-aware reasoning mechanism. This approach enables self-evaluation at each reasoning step to create safety pivot tokens as indicators of the response’s safety status. Furthermore, in order to improve the learning efficiency of pivot token prediction, we propose Contrastive Pivot Optimization(CPO), which enhances the model’s ability to perceive the safety status of dialogues. Through this mechanism, LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their defense capabilities against jailbreak attacks. Extensive experimental results demonstrate that R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs’ robustness against jailbreaks.
zh

[NLP-32] A Survey of Text Classification Under Class Distribution Shift

【速读】：该论文旨在解决机器学习（Machine Learning, ML）模型在实际应用中遇到的数据分布变化问题，特别是在文本分类任务中，由于新话题不断涌现导致测试数据分布随时间改变的问题。论文的关键解决方案在于探讨并总结了针对不同类型的分布偏移（如通过Universum学习、零样本学习和开放式学习）的研究，并提出了相应的缓解策略。特别地，论文指出持续学习（continual learning）能够有效应对类别分布变化带来的挑战。

链接: https://arxiv.org/abs/2502.12965
作者: Adriana Valentina Costache,Silviu Florin Gheorghe,Eduard Gabriel Poesina,Paul Irofti,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at this https URL.
zh

[NLP-33] rust Me Im Wrong: High-Certainty Hallucinations in LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在生成输出时缺乏现实事实依据的问题，即所谓的幻觉现象。论文挑战了现有假设，即所有幻觉都与不确定性相关。关键在于通过知识检测和不确定性测量方法，论文展示了即使模型具有正确的知识，仍可能以高确定性产生幻觉。研究进一步表明，这些高确定性的幻觉在不同模型和数据集中具有一致性和独特性，并且现有的缓解方法对其无效。论文强调理解幻觉起源的重要性，并提出需要改进缓解策略以增强LLMs的安全性。

链接: https://arxiv.org/abs/2502.12964
作者: Adi Simhi,Itay Itzhak,Fazl Barez,Gabriel Stanovsky,Yonatan Belinkov
机构: Technion – Israel Institute of Technology; University of Oxford; School of Computer Science and Engineering, The Hebrew University of Jerusalem
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations. Prior research has associated hallucinations with model uncertainty, leveraging this relationship for hallucination detection and mitigation. In this paper, we challenge the underlying assumption that all hallucinations are associated with uncertainty. Using knowledge detection and uncertainty measurement methods, we demonstrate that models can hallucinate with high certainty even when they have the correct knowledge. We further show that high-certainty hallucinations are consistent across models and datasets, distinctive enough to be singled out, and challenge existing mitigation methods. Our findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at this https URL .
zh

[NLP-34] Infinite Retrieval: Attention Enhanced LLM s in Long-Context Processing

【速读】：该论文旨在解决大型语言模型（LLMs）在处理输入令牌超过上下文窗口限制的任务时所面临的挑战，无论是简单的直接检索任务还是复杂的多跳推理任务。论文的关键在于提出了一种名为InfiniRetri的新方法，该方法利用LLMs自身的注意力信息，实现跨越无限长度输入的准确检索。通过实验观察到注意力分布与生成答案之间的相关性，并证明了这种注意力分配与增强的检索能力相一致。InfiniRetri无需额外训练即可应用于任何基于Transformer的LLMs，显著降低了长文本推理的延迟和计算开销。

链接: https://arxiv.org/abs/2502.12962
作者: Xiaoju Ye,Zhichun Wang,Jingyuan Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:Limited by the context window size of Large Language Models(LLMs), handling various tasks with input tokens exceeding the upper limit has been challenging, whether it is a simple direct retrieval task or a complex multi-hop reasoning task. Although various methods have been proposed to enhance the long-context processing capabilities of LLMs, they either incur substantial post-training costs, or require additional tool modules(e.g.,RAG), or have not shown significant improvement in realistic tasks. Our work observes the correlation between the attention distribution and generated answers across each layer, and establishes the attention allocation aligns with retrieval-augmented capabilities through experiments. Drawing on the above insights, we propose a novel method InfiniRetri that leverages the LLMs’s own attention information to enable accurate retrieval across inputs of infinitely length. Our evaluations indicate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model, surpassing other method or larger models and setting a new state-of-the-art(SOTA). Moreover, our method achieves significant performance improvements on real-world benchmarks, with a maximum 288% improvement. In addition, InfiniRetri can be applied to any Transformer-based LLMs without additional training and substantially reduces inference latency and compute overhead in long texts. In summary, our comprehensive studies show InfiniRetri’s potential for practical applications and creates a paradigm for retrievaling information using LLMs own capabilities under infinite-length tokens. Code will be released in link.
zh

[NLP-35] Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

【速读】：该论文旨在解决大型语言模型（LLMs）在使用外部工具时存在的两个关键问题：不必要的工具调用导致延迟增加，以及与外部工具交互时可能出现的错误。论文的关键解决方案是引入元认知（meta-cognition）作为模型自我评估能力的代理，通过量化元认知评分来指导何时调用外部工具。具体而言，所提出的MeCo策略无需微调且成本极低，能够准确检测LLMs的内部认知信号，并显著改善工具使用的决策过程。

链接: https://arxiv.org/abs/2502.12961
作者: Wenjun Li,Dexun Li,Kuicai Dong,Cong Zhang,Hao Zhang,Weiwen Liu,Yasheng Wang,Ruiming Tang,Yong Liu
机构: Huawei Noah’s Ark Lab
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or real-time data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, weather/map apps), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues:(1) increased delays due to unnecessary tool calls, and (2) potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, representing the model’s awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Our experiments show that MeCo accurately detects LLMs’ internal cognitive signals and significantly improves tool-use decision-making across multiple base models and benchmarks.
zh

[NLP-36] AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages NAACL2025

【速读】：该论文旨在解决在多语言语言模型中，重对齐技术有时会因目标语言与微调源语言差异较大而降低性能的问题。论文的关键解决方案是提出了一种名为AlignFreeze的方法，该方法在重对齐过程中冻结模型的下半部分或上半部分层。实验结果显示，冻结较低层可以防止性能下降，特别是在POS标注任务中，AlignFreeze在XLM-R模型上使七种语言的准确率提高了超过一个标准差，从而改善了这些语言的性能。

链接: https://arxiv.org/abs/2502.12959
作者: Steve Bakos,Félix Gaschi,David Guzmán,Riddhi More,Kelly Chutong Li,En-Shiun Annie Lee
机构: Ontario Tech University, Canada (加拿大安大略理工大学); University of Toronto, Canada (加拿大多伦多大学); SAS Posos, France (法国SAS Posos)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures, to be published in Proceedings of NAACL 2025

点击查看摘要

Abstract:Realignment techniques are often employed to enhance cross-lingual transfer in multilingual language models, still, they can sometimes degrade performance in languages that differ significantly from the fine-tuned source language. This paper introduces AlignFreeze, a method that freezes either the layers’ lower half or upper half during realignment. Through controlled experiments on 4 tasks, 3 models, and in 35 languages, we find that realignment affects all the layers but can be the most detrimental to the lower ones. Freezing the lower layers can prevent performance degradation. Particularly, AlignFreeze improves Part-of-Speech (PoS) tagging performances in languages where full realignment fails: with XLM-R, it provides improvements of more than one standard deviation in accuracy in seven more languages than full realignment.
zh

[NLP-37] ask-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text

【速读】：该论文旨在解决传统掩码语言模型中随机选择掩码令牌及固定掩码比例的问题。解决方案的关键在于提出了一种任务感知的反课程学习方案（Task-informed Anti-Curriculum Learning Scheme），包括利用任务特定知识确定需要掩码的令牌，并采用循环递减的掩码比例，从而实现从难到易的学习过程。这一方法在情感分析、主题文本分类和作者归属三项下游任务中得到了验证，证明了其有效性。

链接: https://arxiv.org/abs/2502.12953
作者: Andrei Jarca,Florinel Alin Croitoru,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masked language modeling has become a widely adopted unsupervised technique to pre-train language models. However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at this https URL.
zh

[NLP-38] Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models

【速读】：该论文旨在解决知识蒸馏（Knowledge Distillation, KD）在Mixture-of-Experts (MoE) 模型压缩中的不足。论文的关键在于发现MoE模型中非激活专家的知识价值，并提出两种新的MoE特定知识蒸馏方法：知识增强（Knowledge Augmentation, KA）和学生感知路由（Student-Aware Router, SAR），以有效提取所有专家的知识。这些方法显著优于传统KD方法，证明了其在MoE教师模型中的有效性。

链接: https://arxiv.org/abs/2502.12947
作者: Gyeongman Kim,Gyouk Chu,Eunho Yang
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); AITRICS(未知)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.
zh

[NLP-39] LLM Popcorn: An Empirical Study of LLM s as Assistants for Popular Micro-video Generation

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在辅助生成受欢迎的微视频方面的潜力。研究的关键在于如何有效地利用LLMs来生成微视频，并通过基于提示的优化提升其流行度。研究结果表明，先进的LLMs如DeepSeek-V3能够使生成的微视频达到与人类创作相媲美的流行程度。此外，不同的LLMs和视频生成器在微视频生成任务中的表现也有所不同，其中DeepSeek-V3和DeepSeek-R1在LLMs中表现突出，而LTX-Video和HunyuanVideo则在视频生成方面表现出色。

链接: https://arxiv.org/abs/2502.12945
作者: Junchen Fu,Xuri Ge,Kaiwen Zheng,Ioannis Arapakis,Xin Xin,Joemon M. Jose
机构: University of Glasgow(格拉斯哥大学); Shandong University(山东大学); Telefónica Scientific Research(Telefónica科学研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold significant commercial value. The rise of high-quality AI-generated content has spurred interest in AI-driven micro-video creation. However, despite the advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek in text generation and reasoning, their potential to assist the creation of popular micro-videos remains largely unexplored. In this paper, we conduct an empirical study on LLM-assisted popular micro-video generation (LLMPopcorn). Specifically, we investigate the following research questions: (i) How can LLMs be effectively utilized to assist popular micro-video generation? (ii) To what extent can prompt-based enhancements optimize the LLM-generated content for higher popularity? (iii) How well do various LLMs and video generators perform in the popular micro-video generation task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3 enable micro-video generation to achieve popularity comparable to human-created content. Prompt enhancements further boost popularity, and benchmarking highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and HunyuanVideo lead in video generation. This pioneering work advances AI-assisted micro-video creation, uncovering new research opportunities. We will release the code and datasets to support future studies. Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.12945 [cs.CL] (or arXiv:2502.12945v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.12945 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-40] Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

【速读】：该论文旨在解决在低资源语言中量化推理能力的挑战，特别是在缺乏数据和标注人员的情况下。论文的关键解决方案是比较三种数据集创建策略：LLM（大型语言模型）辅助的数据生成、机器翻译以及母语者的手工编写数据，以构建具有文化细微差别的故事理解数据集。研究重点放在印度尼西亚的两种主要地方语言——爪哇语和巽他语，并通过广泛的手动验证评估开放权重和封闭权重LLM在辅助数据集创建方面的有效性。最终发现，LLM辅助的数据创建方法优于机器翻译。

链接: https://arxiv.org/abs/2502.12932
作者: Salsabila Zahirah Pranida,Rifo Ahmad Genadi,Fajri Koto
机构: Department of Natural Language Processing, MBZUAI (自然语言处理系, MBZUAI)
类目: Computation and Language (cs.CL)
备注: 18 pages total: 8 pages of main body, 6 pages of appendix. 4 figures in main body, 6 figures in appendix. Submitted to ARR on February 2025

点击查看摘要

Abstract:Quantifying reasoning capability in low-resource languages remains a challenge in NLP due to data scarcity and limited access to annotators. While LLM-assisted dataset construction has proven useful for medium- and high-resource languages, its effectiveness in low-resource languages, particularly for commonsense reasoning, is still unclear. In this paper, we compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset. We focus on Javanese and Sundanese, two major local languages in Indonesia, and evaluate the effectiveness of open-weight and closed-weight LLMs in assisting dataset creation through extensive manual validation. To assess the utility of synthetic data, we fine-tune language models on classification and generation tasks using this data and evaluate performance on a human-written test set. Our findings indicate that LLM-assisted data creation outperforms machine translation.
zh

[NLP-41] Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中存在的内在偏差问题。论文提出了一种名为Flow-of-Options (FoO)的新推理方法，通过使LLMs系统性地探索多种可能性来克服这一挑战。FoO的关键在于其能够利用压缩且可解释的表示形式，在增强解多样性的同时支持长期记忆，从而显著提升了在标准数据科学任务和治疗化学任务上的性能，分别实现了38.2% - 69.2%和37.4% - 47.9%的性能提升。此外，FoO的应用不仅限于分类和回归任务，还扩展到了强化学习和图像生成等领域。

链接: https://arxiv.org/abs/2502.12929
作者: Lakshmi Nair,Ian Trase,Mark Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Github code: this https URL

点击查看摘要

Abstract:We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic system for autonomously solving Machine Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines, achieving improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost under 1 per task, our framework is well-suited for cost-sensitive applications. Beyond classification and regression, we illustrate the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our framework presents significant advancements compared to current state-of-the-art agentic systems for AutoML, due to the benefits of FoO in enforcing diversity in LLM solutions through compressed, explainable representations that also support long-term memory when combined with case-based reasoning.
zh

[NLP-42] Finedeep: Mitigating Sparse Activation in Dense LLM s via Multi-Layer Fine-Grained Experts

【速读】：该论文旨在解决大型语言模型在密集架构下由于稀疏激活导致的高效表示空间探索受限的问题。解决方案的关键在于提出了一种名为Finedeep的细粒度专家架构，通过将前馈神经网络层划分为多个小型专家，并采用新颖的路由机制来确定每个专家的贡献，从而有效缓解了稀疏激活问题，并提升了密集模型的表示能力利用效率。

链接: https://arxiv.org/abs/2502.12928
作者: Leiyu Pan,Zhenpeng Su,Minxuan Lv,Yizhe Xiong,Xiangwen Zhang,Zijia Lin,Hui Chen,Jungong Han,Guiguang Ding,Cheng Luo,Di Zhang,Kun Gai,Deyi Xiong
机构: College of Intelligence and Computing, Tianjin University (天津大学); Chinese Academy of Sciences (中国科学院); Kuaishou Technology (快手科技); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert’s contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.
zh

[NLP-43] SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems

【速读】：该论文旨在解决高质量反馈提供受限于时间、成本及有限数据可用性的问题。解决方案的关键在于引入合成教育反馈回路（Synthetic Educational Feedback Loops, SEFL），通过两个大型语言模型（LLMs）在教师-学生角色中模拟作业完成和形成性反馈，生成大量合成的学生作品及其相应评价对。这些合成数据被用于微调较小且计算效率更高的LLMs，使其能够复制高质量、目标导向反馈的关键特征。与个性化辅导方法不同，SEFL专注于模拟多样化的教师-学生反馈循环，从而在不依赖大量真实学生数据的情况下，实现即时、按需的大规模反馈。

链接: https://arxiv.org/abs/2502.12927
作者: Mike Zhang,Amalie Pernille Dilling,Léon Gondelman,Niels Erik Ruan Lyngdorf,Euan D. Lindsay,Johannes Bjerva
机构: Aalborg University (奥胡斯大学), Denmark
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Providing high-quality feedback is crucial for student success but is constrained by time, cost, and limited data availability. We introduce Synthetic Educational Feedback Loops (SEFL), a novel framework designed to deliver immediate, on-demand feedback at scale without relying on extensive, real-world student data. In SEFL, two large language models (LLMs) operate in teacher–student roles to simulate assignment completion and formative feedback, generating abundant synthetic pairs of student work and corresponding critiques. We then fine-tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high-quality, goal-oriented feedback. Unlike personalized tutoring approaches that offer multi-turn, individualized instruction, SEFL specifically focuses on replicating the teacher–student feedback loop for diverse assignments. Through both LLM-as-a-judge and human evaluations, we demonstrate that SEFL-tuned models outperform their non-tuned counterparts in feedback quality, clarity, and timeliness. These findings reveal SEFL’s potential to transform feedback processes for higher education and beyond, offering an ethical and scalable alternative to conventional manual feedback cycles.
zh

[NLP-44] Conditioning LLM s to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data

【速读】：该论文旨在解决自然语言处理（NLP）中代码切换（Code-switching, CS）这一关键挑战，特别是大型语言模型（Large Language Models, LLMs）难以理解和生成代码切换文本的问题。其解决方案的关键在于提出了一种利用LLMs通过回译（back-translation）自然代码切换句子来生成代码切换数据的新方法，并将其应用于英西语言对。这种方法以自然的代码切换数据为起点，使模型能够学习其自然分布而非仅限于语法模式，从而生成流畅的代码切换文本，拓展了代码切换通信的研究机会。

链接: https://arxiv.org/abs/2502.12924
作者: Maite Heredia,Gorka Labaka,Jeremy Barnes,Aitor Soroa
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU (毕尔巴鄂大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models’ performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.
zh

[NLP-45] On-Device LLM s for Home Assistant: Dual Role in Intent Detection and Response Generation

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在经过合成但领域代表性数据微调后，是否能够在仅限资源且仅使用CPU的边缘硬件上，执行智能家庭助手所需的双重任务：(i)槽位和意图检测及(ii)自然语言响应生成。解决方案的关键在于通过量化技术（包括16位和8位量化）来优化模型大小与性能之间的平衡，同时保持高精度的槽位和意图检测以及生成文本的强语义连贯性。尽管4位量化模型在生成流畅性方面表现良好，但在设备服务分类准确性上有明显下降。实验结果表明，所提出的模型具有良好的泛化能力，并能在约80%-86%的准确率下处理噪声人类提示和领域外意图。

链接: https://arxiv.org/abs/2502.12923
作者: Rune Birkmose,Nathan Mørkeberg Reece,Esben Hofstedt Norvin,Johannes Bjerva,Mike Zhang
机构: Aalborg University (奥胡斯大学), Denmark
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates whether Large Language Models (LLMs), fine-tuned on synthetic but domain-representative data, can perform the twofold task of (i) slot and intent detection and (ii) natural language response generation for a smart home assistant, while running solely on resource-limited, CPU-only edge hardware. We fine-tune LLMs to produce both JSON action calls and text responses. Our experiments show that 16-bit and 8-bit quantized variants preserve high accuracy on slot and intent detection and maintain strong semantic coherence in generated text, while the 4-bit model, while retaining generative fluency, suffers a noticeable drop in device-service classification accuracy. Further evaluations on noisy human (non-synthetic) prompts and out-of-domain intents confirm the models’ generalization ability, obtaining around 80–86% accuracy. While the average inference time is 5–6 seconds per query – acceptable for one-shot commands but suboptimal for multi-turn dialogue – our results affirm that an on-device LLM can effectively unify command interpretation and flexible response generation for home automation without relying on specialized hardware.
zh

[NLP-46] Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison

【速读】：该论文旨在解决查询驱动推荐中未知项目导致的用户难以理解为何某些项目符合其需求的问题。解决方案的关键在于引入Q-STRUM Debate，这是一种基于STRUM-LLM的新型扩展方法，通过辩论式的提示生成与查询相关的项目方面对比摘要，从而提供增强的对比总结。利用现代大型语言模型（LLMs）作为生成辩论的强大工具，Q-STRUM Debate在三个数据集上的实验表明，其在关键对比总结标准上显著优于现有方法。

链接: https://arxiv.org/abs/2502.12921
作者: George-Kirollos Saad,Scott Sanner
机构: University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Query-driven recommendation with unknown items poses a challenge for users to understand why certain items are appropriate for their needs. Query-driven Contrastive Summarization (QCS) is a methodology designed to address this issue by leveraging language-based item descriptions to clarify contrasts between them. However, existing state-of-the-art contrastive summarization methods such as STRUM-LLM fall short of this goal. To overcome these limitations, we introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs debate-style prompting to generate focused and contrastive summarizations of item aspects relevant to a query. Leveraging modern large language models (LLMs) as powerful tools for generating debates, Q-STRUM Debate provides enhanced contrastive summaries. Experiments across three datasets demonstrate that Q-STRUM Debate yields significant performance improvements over existing methods on key contrastive summarization criteria, thus introducing a novel and performant debate prompting methodology for QCS.
zh

[NLP-47] GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLM s On-Device Fine-tuning

【速读】：该论文旨在解决传统大规模语言模型（LLM）微调技术在资源受限边缘设备上的高浮点计算需求、隐私问题以及与硬件兼容性差的问题。解决方案的关键在于引入了一种名为GSQ-Tuning的新框架，通过使用基于整数运算的Group-Shared Exponents Integer格式来表示模型参数，从而在推理和训练过程中完全消除浮点运算的需求。结合类似LoRA的适配器，该方法实现了高效且内存占用低的全整数微调，同时保持了与FP16微调相当的精度。

链接: https://arxiv.org/abs/2502.12913
作者: Sifan Zhou,Shuo Wang,Zhihang Yuan,Mingjia Shi,Yuzhang Shang,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to FP16-based fine-tuning while significantly reducing memory usage (50%). Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.
zh

[NLP-48] Knapsack Optimization-based Schema Linking for LLM -based Text-to-SQL Generation

【速读】：该论文旨在解决从用户查询生成SQL语句过程中初始模式链接（Schema Linking）准确性不足的问题。当前模式链接模型在处理相关模式元素缺失或冗余过多方面仍然存在困难。为了解决这一问题，论文提出了一种改进的模式链接度量方法，通过引入受限缺失指示器（restricted missing indicator）。关键解决方案是Knapsack优化型模式链接代理（KaSLA），它采用分层链接策略，首先识别最优表链接，然后在选定表内链接列，以减少链接候选空间。在每个链接过程中，KaSLA利用背包优化方法来链接潜在的相关元素，并考虑到有限的冗余容忍度。这种方法使得KaSLA-1.6B在Spider和BIRD基准测试中实现了优于大规模语言模型（LLMs）的模式链接结果。

链接: https://arxiv.org/abs/2502.12911
作者: Zheng Yuan,Hao Chen,Zijin Hong,Qinggang Zhang,Feiran Huang,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学); City University of Macau (澳门城市大学); Jinan University (暨南大学)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Generating SQLs from user queries is a long-standing challenge, where the accuracy of initial schema linking significantly impacts subsequent SQL generation performance. However, current schema linking models still struggle with missing relevant schema elements or an excess of redundant ones. A crucial reason for this is that commonly used metrics, recall and precision, fail to capture relevant element missing and thus cannot reflect actual schema linking performance. Motivated by this, we propose an enhanced schema linking metric by introducing a restricted missing indicator. Accordingly, we introduce Knapsack optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent designed to prevent the missing of relevant schema elements while minimizing the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy that first identifies the optimal table linking and subsequently links columns within the selected table to reduce linking candidate space. In each linking process, it utilize a knapsack optimization approach to link potentially relevant elements while accounting for a limited tolerance of potential redundant this http URL this optimization, KaSLA-1.6B achieves superior schema linking results compared to large-scale LLMs, including deepseek-v3 with state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider and BIRD benchmarks verify that KaSLA can significantly improve the SQL generation performance of SOTA text-to-SQL models by substituting their schema linking processes.
zh

[NLP-49] Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

【速读】：该论文旨在评估大型语言模型（LLMs）在动态真实世界场景中防御互联网欺诈和网络钓鱼的能力。解决方案的关键在于引入了一个名为Fraud-R1的新基准测试，该测试包含来自网络钓鱼骗局、虚假招聘信息、社交媒体和新闻的8,564个欺诈案例，并分为五大类欺诈类型。Fraud-R1通过多轮评估流程来检测LLMs在不同阶段（如可信度建立、紧迫感创造和情感操控）抵御欺诈的能力。此外，该研究在两种设定下评估了15个LLMs：1）助手模式，提供一般决策支持；2）角色扮演模式，模拟特定身份。研究结果揭示了在角色扮演模式和虚假招聘信息中的欺诈防御挑战，同时指出中英文之间的显著性能差异，强调了提升多语言欺诈检测能力的重要性。

链接: https://arxiv.org/abs/2502.12904
作者: Shu Yang,Shenzhe Zhu,Zeyu Wu,Keyu Wang,Junchi Yao,Junchao Wu,Lijie Hu,Mengdi Li,Derek F. Wong,Di Wang
机构: King Abdullah University of Science and Technology (沙特国王科技大学); Provable Responsible AI and Data Analytics (PRADA) Lab (可证明负责的人工智能和数据分析实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Fraud-R1, a benchmark designed to evaluate LLMs’ ability to defend against internet fraud and phishing in dynamic, real-world scenarios. Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job postings, social media, and news, categorized into 5 major fraud types. Unlike previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to assess LLMs’ resistance to fraud at different stages, including credibility building, urgency creation, and emotional manipulation. Furthermore, we evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM provides general decision-making assistance, and 2. Role-play, where the model assumes a specific persona, widely used in real-world agent-based interactions. Our evaluation reveals the significant challenges in defending against fraud and phishing inducement, especially in role-play settings and fake job postings. Additionally, we observe a substantial performance gap between Chinese and English, underscoring the need for improved multilingual fraud detection capabilities.
zh

[NLP-50] Soundwave: Less is More for Speech-Text Alignment in LLM s

【速读】：该论文旨在解决语音与文本之间的表示空间差距及序列长度不一致这两个基本问题。解决方案的关键在于提出了一种名为Soundwave的新方法，它采用高效的训练策略和新颖的架构来应对上述挑战。实验结果表明，Soundwave仅使用相当于Qwen2-Audio五十分之一的训练数据，在语音翻译和AIR-Bench语音任务中表现更优。

链接: https://arxiv.org/abs/2502.12900
作者: Yuhao Zhang,Zhiheng Liu,Fan Bu,Ruiyu Zhang,Benyou Wang,Haizhou Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at this https URL.
zh

[NLP-51] None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

【速读】：该论文旨在解决大型语言模型（LLMs）在评估过程中过度依赖回忆/记忆而非推理的问题。关键解决方案在于提出一种通用的变化方法，该方法完全将正确答案与先前看到的令牌或概念分离，迫使模型进行理解和推理以得出正确答案，而不是单纯依靠记忆。这种方法应用于英语和西班牙语的两个数据集，结果表明所有模型在所提出的变体下准确率显著下降，从而验证了解决方案的有效性。

链接: https://arxiv.org/abs/2502.12896
作者: Eva Sánchez Salido,Julio Gonzalo,Guillermo Marco
机构: UNED Research Center in Natural Language Processing and Information Retrieval (UNED 自然语言处理与信息检索研究中心); ETSI Informática, UNED (UNED 信息技术工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs’ answers.
zh

[NLP-52] Multilingual European Language Models: Benchmarking Approaches and Challenges

【速读】：该论文旨在解决现有评估数据集在多语种欧洲基准测试中的局限性，特别是关注如何提升翻译质量及减轻文化偏见。论文的关键解决方案包括引入人机协同验证（human-in-the-loop verification）以及迭代式翻译排名（iterative translation ranking），以增强评估的准确性和文化适应性。

链接: https://arxiv.org/abs/2502.12895
作者: Fabio Barth,Georg Rehm
机构: DFKI GmbH, Germany(德国DFKI有限公司); Humboldt-Universität zu Berlin, Germany(柏林洪堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The breakthrough of generative large language models (LLMs) that can solve different tasks through chat interaction has led to a significant increase in the use of general benchmarks to assess the quality or performance of these models beyond individual applications. There is also a need for better methods to evaluate and also to compare models due to the ever increasing number of new models published. However, most of the established benchmarks revolve around the English language. This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks. We analyse seven multilingual benchmarks and identify four major challenges. Furthermore, we discuss potential solutions to enhance translation quality and mitigate cultural biases, including human-in-the-loop verification and iterative translation ranking. Our analysis highlights the need for culturally aware and rigorously validated benchmarks to assess the reasoning and question-answering capabilities of multilingual LLMs accurately.
zh

[NLP-53] H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models Including OpenAI o1 /o3 DeepSeek -R1 and Gemini 2.0 Flash Thinking

【速读】：该论文旨在解决大型推理模型（LRMs）在安全检查中的鲁棒性不足问题。论文的关键解决方案是引入Malicious-Educator基准测试，通过伪装成看似合法的教学提示来揭示模型中的严重安全漏洞。此外，论文提出了一种名为Hijacking Chain-of-Thought (H-CoT) 的通用攻击方法，利用模型自身的中间推理过程突破其安全机制，从而显著降低拒绝率并使其愿意提供有害内容。这些发现强调了开发更强大安全机制的紧迫性，以确保高级推理能力的应用不会损害伦理标准。

链接: https://arxiv.org/abs/2502.12893
作者: Martin Kuo,Jianyi Zhang,Aolin Ding,Qinsi Wang,Louis DiValentin,Yujia Bao,Wei Wei,Da-Cheng Juan,Hai Li,Yiran Chen
机构: OpenAI(开放人工智能); Duke University (杜克大学); Center for Advanced AI, Accenture (埃森哲高级人工智能中心); Accenture Security (埃森哲安全); National Tsing Hua University (国立清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks-using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI’s o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model’s own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline-dropping from 98% to below 2%-and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.
zh

[NLP-54] Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?

【速读】：该论文旨在探讨多语言大型语言模型（Multilingual Large Language Models, LLMs）是否能够为资源匮乏的语言提供有效的技术解决方案。论文的关键在于分析这些多语言模型在处理资源匮乏语言时的表现，并总结相关工作以期为系统性地实现这一方法提供指导。

链接: https://arxiv.org/abs/2502.12886
作者: Georg Rehm,Annika Grützner-Zahn,Fabio Barth
机构: DFKI GmbH (德国人工智能研究中心); Humboldt-Universität zu Berlin (柏林洪堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate unprecedented capabilities and define the state of the art for almost all natural language processing (NLP) tasks and also for essentially all Language Technology (LT) applications. LLMs can only be trained for languages for which a sufficient amount of pre-training data is available, effectively excluding many languages that are typically characterised as under-resourced. However, there is both circumstantial and empirical evidence that multilingual LLMs, which have been trained using data sets that cover multiple languages (including under-resourced ones), do exhibit strong capabilities for some of these under-resourced languages. Eventually, this approach may have the potential to be a technological off-ramp for those under-resourced languages for which “native” LLMs, and LLM-based technologies, cannot be developed due to a lack of training data. This paper, which concentrates on European languages, examines this idea, analyses the current situation in terms of technology support and summarises related work. The article concludes by focusing on the key open questions that need to be answered for the approach to be put into practice in a systematic way.
zh

[NLP-55] How desirable is alignment between LLM s and linguistically diverse human users?

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）如何能够适应或调整其语言行为以匹配具有多样语言使用习惯的用户。用户多样性可能源于年龄差异、性别特征以及多语种经验及其相关的语言处理和使用差异。论文的关键在于考虑这种适应性对可用性、交流及LLMs发展所带来的潜在影响。关键解决方案在于使LLMs具备理解并调整其语言输出以符合不同用户群体需求的能力。

链接: https://arxiv.org/abs/2502.12884
作者: Pia Knoeferle,Sebastian Möller,Dorothea Kolossa,Veronika Solopova,Georg Rehm
机构: Humboldt-Universität zu Berlin; Berlin School of Mind and Brain; Einstein Center for Neurosciences Berlin; Technische Universität Berlin; German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We discuss how desirable it is that Large Language Models (LLMs) be able to adapt or align their language behavior with users who may be diverse in their language use. User diversity may come about among others due to i) age differences; ii) gender characteristics, and/or iii) multilingual experience, and associated differences in language processing and use. We consider potential consequences for usability, communication, and LLM development.
zh

[NLP-56] PAFT: Prompt-Agnostic Fine-Tuning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在微调后适应下游任务时，由于对特定提示格式的过度拟合而导致的提示鲁棒性降低的问题。解决方案的关键在于提出了一种名为提示无关微调（Prompt-Agnostic Fine-Tuning, PAFT）的方法，通过在微调过程中动态调整提示，促使模型学习任务的基本原理而非特定提示格式，从而提高模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2502.12859
作者: Chenxing Wei,Yao Shu,Mingwen Ou,Ying Tiffany He,Fei Richard Yu
机构: College of Computer Science and Software Engineering, Shenzhen University (深圳大学), China; Guangdong Lab of AI and Digital Economy (SZ) (广东省人工智能与数字经济实验室（深圳）), China; Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院), China; School of Information Technology, Carleton University (卡尔顿大学信息科技学院), Canada
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:While Large Language Models (LLMs) adapt well to downstream tasks after fine-tuning, this adaptability often compromises prompt robustness, as even minor prompt variations can significantly degrade performance. To address this, we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach that dynamically adjusts prompts during fine-tuning. This encourages the model to learn underlying task principles rather than overfitting to specific prompt formulations. PAFT operates in two stages: First, a diverse set of meaningful, synthetic candidate prompts is constructed. Second, during fine-tuning, prompts are randomly sampled from this set to create dynamic training inputs. Extensive experiments across diverse datasets and LLMs demonstrate that models trained with PAFT exhibit strong robustness and generalization across a wide range of prompts, including unseen ones. This enhanced robustness improves both model performance and inference speed while maintaining training efficiency. Ablation studies further confirm the effectiveness of PAFT.
zh

[NLP-57] Rejected Dialects: Biases Against African American Language in Reward Models NAACL

【速读】：该论文旨在解决由奖励模型（reward models）中的偏见所引发的安全性、公平性和代表性问题，特别是在处理非洲裔美国人语言（African American Language, AAL）时。关键在于引入了一个评估奖励模型中方言偏见的框架，并通过实验展示了这些模型在处理AAL文本时与白人主流英语（White Mainstream English, WME）相比存在显著偏差，包括较低的偏好匹配度和对话引导偏向WME的问题。这一研究揭示了在大型语言模型（LLMs）开发过程中针对AAL的反向偏见，强调了其代表性的危害及伦理关切。

链接: https://arxiv.org/abs/2502.12858
作者: Joel Mire,Zubin Trivadi Aysola,Daniel Chechelnitsky,Nicholas Deas,Chrysoula Zerva,Maarten Sap
机构: Carnegie Mellon University (卡内基梅隆大学); Columbia University (哥伦比亚大学); Instituto Superior Técnico, University of Lisbon (里斯本大学理工学院); Instituto de Telecomunicações (电信研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to NAACL Findings 2025

点击查看摘要

Abstract:Preference alignment via reward models helps build safe, helpful, and reliable large language models (LLMs). However, subjectivity in preference judgments and the lack of representative sampling in preference data collection can introduce new biases, hindering reward models’ fairness and equity. In this work, we introduce a framework for evaluating dialect biases in reward models and conduct a case study on biases against African American Language (AAL) through several experiments comparing reward model preferences and behavior on paired White Mainstream English (WME) and both machine-translated and human-written AAL corpora. We show that reward models are less aligned with human preferences when processing AAL texts vs. WME ones (-4% accuracy on average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and steer conversations toward WME, even when prompted with AAL texts. Our findings provide a targeted analysis of anti-AAL biases at a relatively understudied stage in LLM development, highlighting representational harms and ethical questions about the desired behavior of LLMs concerning AAL.
zh

[NLP-58] Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models

【速读】：该论文旨在解决小型模型在数学推理（Mathematical Reasoning）任务中表现不佳的问题。尽管大型预训练模型在多种推理任务中表现出色，但小型模型在算术计算方面仍然存在困难，导致数学推理错误。论文的关键解决方案是利用程序生成的算术数据集来增强小型模型的推理能力。具体而言，通过两种主要方法实现这一目标：(1) 中间微调（Intermediate Fine-Tuning），即先将模型在算术数据集上进行微调，然后再在推理数据集上训练；(2) 将算术数据集整合到指令微调混合中，使模型在学习一般指令执行能力的同时掌握算术技能。实验结果表明，无论采用哪种方法，引入算术数据集都能有效提升模型的算术能力和整体数学推理性能。

链接: https://arxiv.org/abs/2502.12855
作者: Neeraj Gangwar,Suma P Bhat,Nickvash Kani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:While large models pre-trained on high-quality data exhibit excellent performance across various reasoning tasks, including mathematical reasoning (e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical reasoning remains a challenging problem. Common approaches to address this challenge include knowledge distillation, where smaller student models learn from large pre-trained teacher models, and data augmentation, such as rephrasing questions. Despite these efforts, smaller models struggle with arithmetic computations, leading to errors in mathematical reasoning. In this work, we focus on leveraging a programmatically generated arithmetic dataset to enhance the reasoning capabilities of smaller models. We investigate two key approaches to incorporate this dataset – (1) intermediate fine-tuning, where a model is fine-tuned on the arithmetic dataset before being trained on a reasoning dataset, and (2) integrating the arithmetic dataset into the instruction-tuning mixture, allowing the model to learn arithmetic skills alongside general instruction-following abilities. Our experiments on multiple reasoning benchmarks demonstrate that incorporating an arithmetic dataset, whether through targeted fine-tuning or within the instruction-tuning mixture, enhances the models’ arithmetic capabilities, which in turn improves their mathematical reasoning performance.
zh

[NLP-59] S2R: Teaching LLM s to Self-verify and Self-correct via Reinforcement Learning

【速读】：该论文旨在解决如何在资源有限的情况下提升较小规模基础模型的推理能力。论文的关键解决方案是引入S^2R框架，通过有监督微调初始化大规模语言模型（LLMs）的自我验证和自我纠正行为，并利用结果层面和过程层面的强化学习进一步加强这些技能，从而在推理过程中实现自适应优化。研究表明，仅使用3.1k个自我验证和自我纠正的初始样本，Qwen2.5-math-7B的准确性从51.0%提高到81.6%，优于同等数据量下长链思维（long-CoT）蒸馏训练的模型。

链接: https://arxiv.org/abs/2502.12853
作者: Ruotian Ma,Peisong Wang,Cheng Liu,Xingyan Liu,Jiaqi Chen,Bang Zhang,Xin Zhou,Nan Du,Jia Li
机构: Tencent(腾讯); Tsinghua University(清华大学); The University of Hong Kong(香港大学); Fudan University(复旦大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs’ deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S ^2 R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S ^2 R. Our code and data are available at this https URL.
zh

[NLP-60] MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

【速读】：该论文旨在解决现有多语言视觉-语言（Vision-Language, VL）基准数据集覆盖语言种类有限的问题，尤其缺乏低资源语言的数据。论文的关键解决方案是引入了一个名为MVL-SIB的新基准，它涵盖了205种语言的跨模态和纯文本主题匹配任务，比现有的最全面的VL基准多出了超过100种语言。通过这一新的基准，论文评估了多种开源权重大型视觉-语言模型（Large Vision-Language Models, LVLMs）的表现，并揭示了这些模型在处理低资源语言时存在的局限性，特别是在跨模态主题匹配方面表现不佳。

链接: https://arxiv.org/abs/2502.12852
作者: Fabian David Schmidt,Florian Schneider,Chris Biemann,Goran Glavaš
机构: Center for Artificial Intelligence and Data Science, University of Würzburg, Germany (德国维尔茨堡大学人工智能与数据科学中心); Language Technology Group, University of Hamburg, Germany (德国汉堡大学语言技术组)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages – over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N’Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.
zh

[NLP-61] MeMo: Towards Language Models with Associative Memory Mechanisms

【速读】：该论文旨在解决通过学习实现文本记忆的根本局限性，提出了一种新的架构以直接进行文本记忆。解决方案的关键在于设计了一个名为MeMo的新型语言模型架构，它通过分层关联记忆（Layered Associative Memories）显式地存储 token 序列，从而实现透明性和模型编辑能力，包括遗忘文本的功能。

链接: https://arxiv.org/abs/2502.12851
作者: Fabio Massimo Zanzotto,Elena Sofia Ruzzetti,Giancarlo A. Xompero,Leonardo Ranaldi,Davide Venditti,Federico Ranaldi,Cristina Giannone,Andrea Favalli,Raniero Romagnoli
机构: Human-centric ART, University of Rome Tor Vergata (罗马大学托尔韦尔加塔分校人本中心艺术); University of Edinburgh (爱丁堡大学); Almawave S.p.A. (阿尔玛瓦夫有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.
zh

[NLP-62] owards Equitable AI: Detecting Bias in Using Large Language Models for Marketing

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成营销标语时所嵌入的社会偏见问题。研究通过针对五个不同的人口统计类别（性别、婚姻状况、年龄、收入水平和教育水平）生成金融相关的营销标语，揭示了这些标语中存在的系统性偏见。解决方案的关键在于采用相对偏见计算方法以及Kolmogorov-Smirnov（KS）检验来评估和检测这些偏见，从而强调需要在AI生成的营销策略中考虑基于人口统计特征的偏见及其更广泛的社会影响。

链接: https://arxiv.org/abs/2502.12838
作者: Berk Yilmaz,Huthaifa I. Ashqar
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent advances in large language models (LLMs) have revolutionized industries such as finance, marketing, and customer service by enabling sophisticated natural language processing tasks. However, the broad adoption of LLMs brings significant challenges, particularly in the form of social biases that can be embedded within their outputs. Biases related to gender, age, and other sensitive attributes can lead to unfair treatment, raising ethical concerns and risking both company reputation and customer trust. This study examined bias in finance-related marketing slogans generated by LLMs (i.e., ChatGPT) by prompting tailored ads targeting five demographic categories: gender, marital status, age, income level, and education level. A total of 1,700 slogans were generated for 17 unique demographic groups, and key terms were categorized into four thematic groups: empowerment, financial, benefits and features, and personalization. Bias was systematically assessed using relative bias calculations and statistically tested with the Kolmogorov-Smirnov (KS) test against general slogans generated for any individual. Results revealed that marketing slogans are not neutral; rather, they emphasize different themes based on demographic factors. Women, younger individuals, low-income earners, and those with lower education levels receive more distinct messaging compared to older, higher-income, and highly educated individuals. This underscores the need to consider demographic-based biases in AI-generated marketing strategies and their broader societal implications. The findings of this study provide a roadmap for developing more equitable AI systems, highlighting the need for ongoing bias detection and mitigation efforts in LLMs.
zh

[NLP-63] An LLM -Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation

【速读】：该论文旨在解决大型语言模型（LLMs）在分析生理时间序列数据（如可穿戴设备数据）时存在的问题。现有的方法直接将原始数值序列嵌入提示中，这超出了令牌限制并增加了计算成本。此外，一些研究将从时间序列中提取的特征整合到文本提示中或采用多模态方法，但这些方法往往因LLMs对连续波形解析能力有限且效率低下而产生通用性和可靠性不足的输出。

该文的关键解决方案在于开发了一个基于OpenCHA的开源LLM驱动代理，该代理集成了用户交互、数据源和分析工具，以生成准确的健康洞察。通过一个案例研究，使用包含PPG和ECG记录的数据集评估了该代理在从PPG信号估计心率（HR）方面的性能，并将其与OpenAI GPT-4o-mini和GPT-4o进行比较，结果显示该代理显著优于基准模型，在HR估计方面实现了更低的误差率和更高的可靠性。

链接: https://arxiv.org/abs/2502.12836
作者: Mohammad Feli,Iman Azimi,Pasi Liljeberg,Amir M.Rahmani
机构: Department of Computing, University of Turku (图尔库大学), Finland; Department of Computer Science, University of California, Irvine (加州大学欧文分校), USA; School of Nursing, University of California, Irvine (加州大学欧文分校护理学院), USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are revolutionizing healthcare by improving diagnosis, patient care, and decision support through interactive communication. More recently, they have been applied to analyzing physiological time-series like wearable data for health insight extraction. Existing methods embed raw numerical sequences directly into prompts, which exceeds token limits and increases computational costs. Additionally, some studies integrated features extracted from time-series in textual prompts or applied multimodal approaches. However, these methods often produce generic and unreliable outputs due to LLMs’ limited analytical rigor and inefficiency in interpreting continuous waveforms. In this paper, we develop an LLM-powered agent for physiological time-series analysis aimed to bridge the gap in integrating LLMs with well-established analytical tools. Built on the OpenCHA, an open-source LLM-powered framework, our agent features an orchestrator that integrates user interaction, data sources, and analytical tools to generate accurate health insights. To evaluate its effectiveness, we implement a case study on heart rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study. The agent’s performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o, with ECG serving as the gold standard for HR estimation. Results demonstrate that our agent significantly outperforms benchmark models by achieving lower error rates and more reliable HR estimations. The agent implementation is publicly available on GitHub.
zh

[NLP-64] Subword models struggle with word learning but surprisal hides it

【速读】：该论文旨在探讨子词(subword)和字符(character)语言模型在词汇学习中的表现，并通过心理语言学的词汇判断任务进行评估。论文的关键发现是字符语言模型在区分词汇与非词汇方面表现出色且一致，而子词语言模型则难以准确完成此任务。此外，论文指出在字符语言模型中，词汇学习先于句法学习，而在子词语言模型中两者同时发生。因此，论文提出字符语言模型可能是建模语言习得过程的一个可行替代方案。关键解决方案在于采用字符语言模型来更好地模拟语言习得过程。

链接: https://arxiv.org/abs/2502.12835
作者: Bastian Bunzeck,Sina Zarrieß
机构: Bielefeld University (比勒菲尔德大学)
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:We study word learning in subword and character language models with the psycholinguistic lexical decision task. While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently. Furthermore, when comparing word learning and syntactic learning, both processes are separable in character LM where word learning predates syntactic learning, whereas these processes are simultaneous in subword LM. This raises questions about the adequacy of subword LMs for modeling language acquisition and positions character LMs as a viable alternative.
zh

[NLP-65] KazMMLU: Evaluating Language Models on Kazakh Russian and Regional Knowledge of Kazakhstan

【速读】：该论文旨在解决哈萨克斯坦文化与语言在自然语言处理领域中的代表性不足问题。尽管全球大型语言模型（Large Language Models, LLMs）不断进步，但针对哈萨克语的研究进展有限，尤其是在专用模型和基准评估方面。为填补这一空白，论文的关键解决方案是引入KazMMLU数据集，这是首个专为哈萨克语设计的MMLU风格数据集。该数据集包含23,000个覆盖不同教育水平的问题，并包括10,969个哈萨克语问题和12,031个俄语问题，以反映哈萨克斯坦的双语教育系统及其丰富的本地背景。通过评估几种最先进的多语言模型，研究显示这些模型在哈萨克语和俄语上的表现仍有较大提升空间。

链接: https://arxiv.org/abs/2502.12829
作者: Mukhammed Togmanov,Nurdaulet Mukhituly,Diana Turmakhan,Jonibek Mansurov,Maiya Goloburda,Akhmed Sakip,Zhuohan Xie,Yuxia Wang,Bekassyl Syzdykov,Nurkhan Laiyk,Alham Fikri Aji,Ekaterina Kochmar,Preslav Nakov,Fajri Koto
机构: Department of Natural Language Processing, MBZUAI (自然语言处理系, MBZUAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite having a population of twenty million, Kazakhstan’s culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan’s bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.
zh

[NLP-66] Reasoning and the Trusting Behavior of DeepSeek and GPT : An Experiment Revealing Hidden Fault Lines in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLM）在性能提升或成本降低时，开发者是否应切换至新模型的问题。论文的关键在于揭示不同LLM在信任行为上的显著差异，通过使用博弈论和行为经济学模型来展示OpenAI的o1-mini和o3-mini模型与DeepSeek模型之间的区别。论文指出，虽然性能基准很重要，但仅依赖这些基准可能忽视了模型在复杂行为变化方面的差异。论文强调，深入分析LLM的潜在缺陷应成为组织AI战略的一部分。

链接: https://arxiv.org/abs/2502.12825
作者: Rubing Lu,João Sedoc,Arun Sundararajan
机构: New York University (纽约大学); Stern School of Business (斯特恩商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI’s and DeepSeek’s models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek’s more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization’s AI strategy.
zh

[NLP-67] Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models

【速读】：该论文旨在探讨大规模语言模型（LLMs）在重新定义任务中的表现，即通过赋予已知物理常数和单位替代值来引发模型响应，并以此揭示LLMs潜在的推理缺陷。研究发现，随着模型规模的扩大，其性能不仅下降，而且错误的自信心增加。尽管提示策略或响应格式等因素具有影响力，但它们并不能完全防止LLMs依赖于记忆中的值。关键在于通过这种任务重新定义方法来评估和理解LLMs的推理能力及其局限性。

链接: https://arxiv.org/abs/2502.12821
作者: Elena Stringli,Maria Lymperaiou,Giorgos Filandrianos,Giorgos Stamou
机构: National Technical University of Athens (希腊国立技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inverse tasks can uncover potential reasoning gaps as Large Language Models (LLMs) scale up. In this work, we explore the redefinition task, in which we assign alternative values to well-known physical constants and units of measure, prompting LLMs to respond accordingly. Our findings show that not only does model performance degrade with scale, but its false confidence also rises. Moreover, while factors such as prompting strategies or response formatting are influential, they do not preclude LLMs from anchoring to memorized values.
zh

[NLP-68] Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在生成合成用户及模拟与任务导向对话系统进行对话方面的应用。论文的关键在于提出了一种新颖全面的方法，使用LLMs创建多样化用户画像，设定目标，参与多轮对话，并评估对话的成功程度。具体而言，研究采用了两种专有LLMs（GPT-4o和GPT-o1）生成具有不同人口统计特征、多种用户目标、不同的对话风格、初始知识水平、兴趣和对话目标的异构用户群体。研究分析表明，GPT-o1生成的用户属性分布更为异质，而GPT-4o则生成较为集中的用户属性。这些生成的用户画像被用于与任务导向对话系统进行对话模拟。

链接: https://arxiv.org/abs/2502.12813
作者: Adnan Ahmad,Stefan Hillmann,Sebastian Möller
机构: Technische Universität Berlin (柏林工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this study, we explore the application of Large Language Models (LLMs) for generating synthetic users and simulating user conversations with a task-oriented dialogue system and present detailed results and their analysis. We propose a comprehensive novel approach to user simulation technique that uses LLMs to create diverse user profiles, set goals, engage in multi-turn dialogues, and evaluate the conversation success. We employ two proprietary LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a heterogeneous base of user profiles, characterized by varied demographics, multiple user goals, different conversational styles, initial knowledge levels, interests, and conversational objectives. We perform a detailed analysis of the user profiles generated by LLMs to assess the diversity, consistency, and potential biases inherent in these LLM-generated user simulations. We find that GPT-o1 generates more heterogeneous user distribution across most user attributes, while GPT-4o generates more skewed user attributes. The generated set of user profiles are then utilized to simulate dialogue sessions by interacting with a task-oriented dialogue system.
zh

[NLP-69] owards Text-Image Interleaved Retrieval

【速读】：该论文旨在解决多模态信息检索在处理包含多个图像及文本-图像交错内容的实际应用中的局限性。论文提出了文本-图像交错检索（Text-Image Interleaved Retrieval, TIIR）任务，并构建了一个基于wikiHow教程的基准数据集。论文的关键解决方案是引入了一种新型的套娃多模态嵌入器（Matryoshka Multimodal Embedder, MME），通过在不同粒度上压缩视觉标记数量，有效应对基于多模态大语言模型（Multimodal Large Language Model, MLLM）的TIIR模型中视觉标记过多的问题。实验表明，简单地调整现有模型并不能始终如一地获得有效结果，而MME显著提升了性能，同时大幅减少了视觉标记的数量。

链接: https://arxiv.org/abs/2502.12799
作者: Xin Zhang,Ziqi Dai,Yongqi Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Meishan Zhang,Jun Yu,Wenjie Li,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学，深圳); The Hong Kong Polytechnic University(香港理工大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 16 pages, 14 figures

点击查看摘要

Abstract:Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.
zh

[NLP-70] Commonsense Reasoning in Arab Culture

【速读】：该论文旨在解决现有大型语言模型在常识推理评估中依赖机器翻译数据集的问题，这些数据集缺乏文化深度且可能引入盎格鲁中心偏见。论文的关键解决方案是引入了一个名为\datasetname的新数据集，该数据集包含现代标准阿拉伯语（MSA）中的常识推理问题，覆盖了13个阿拉伯国家的文化。通过与母语者合作编写和验证符合各自国家文化的题目，该数据集涵盖了12个日常生活领域及54个细分子主题，反映了社会规范、传统和日常经验的多样性。这一方案强调了开发更具有文化意识的模型和数据集以适应阿拉伯语世界的必要性。

链接: https://arxiv.org/abs/2502.12788
作者: Abdelrahman Sadallah,Junior Cedric Tonga,Khalid Almubarak,Saeed Almheiri,Farah Atif,Chatrine Qwaider,Karima Kadaoui,Sara Shatnawi,Yaser Alesh,Fajri Koto
机构: Department of Natural Language Processing, MBZUAI (自然语言处理系，MBZUAI); SDAIA (未知缩写); Al-Balqa Applied University (阿尔巴尼亚应用大学); Khalifa University (哈利法大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce \datasetname, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. \datasetname spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.
zh

[NLP-71] Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach

【速读】：该论文旨在解决传统预测模型在脑响应预测中的局限性，这些模型仅依赖于从单一模态特征进行线性映射，无法有效捕捉语音理解过程中听觉信号与语言及语义信息在广泛脑网络中的复杂整合。论文的关键解决方案在于引入了一种非线性的多模态预测模型，该模型结合了来自预训练模型（如LLAMA、Whisper）的音频和语言特征。这种方法显著提升了预测性能，并超越了现有的最佳模型，从而为未来的计算机模拟测试和解码性能改进奠定了基础。

链接: https://arxiv.org/abs/2502.12771
作者: Danny Dongyeop Han,Yunju Cho,Jiook Cha,Jay-Yoon Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Self-supervised language and audio models effectively predict brain responses to speech. However, traditional prediction models rely on linear mappings from unimodal features, despite the complex integration of auditory signals with linguistic and semantic information across widespread brain networks during speech comprehension. Here, we introduce a nonlinear, multimodal prediction model that combines audio and linguistic features from pre-trained models (e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in prediction performance (unnormalized and normalized correlation) over traditional unimodal linear models, as well as a 7.7% and 14.4% improvement, respectively, over prior state-of-the-art models. These improvements represent a major step towards future robust in-silico testing and improved decoding performance. They also reveal how auditory and semantic information are fused in motor, somatosensory, and higher-level semantic regions, aligning with existing neurolinguistic theories. Overall, our work highlights the often neglected potential of nonlinear and multimodal approaches to brain modeling, paving the way for future studies to embrace these strategies in naturalistic neurolinguistics research.
zh

[NLP-72] How Much Do LLM s Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild

【速读】：该论文旨在解决大型语言模型（LLMs）在多语言环境中生成知识密集型长篇问答时的幻觉现象（hallucination），即生成非事实或不忠实响应的问题。论文的关键解决方案在于训练一个多语言幻觉检测模型，并通过机器翻译（MT）生成其他语言的训练数据，结合人工标注的高质量数据集，以量化和检测不同语言环境中LLMs的幻觉现象。此外，论文构建了一个包含30种语言的知识密集型问答数据集，使用LLMs生成的提示和维基百科文章作为参考，从而评估幻觉率与语言资源丰富度之间的关系。

链接: https://arxiv.org/abs/2502.12769
作者: Saad Obaid ul Islam,Anne Lauscher,Goran Glavaš
机构: WüNLP, CAIDAS, University of Würzburg(伍尔兹堡大学); Data Science Group, University of Hamburg(汉堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:In the age of misinformation, hallucination – the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses – represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild’’ than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.
zh

[NLP-73] R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

【速读】：该论文旨在解决现有框架在知识图谱（Knowledge Graphs, KG）推理任务中的刚性适应性和对大型语言模型（Large Language Models, LLMs）的过度依赖问题。解决方案的关键在于引入R2-KG，这是一种双代理框架，将推理过程分为两个角色：操作员（低容量LLM）负责收集证据，监督员（高容量LLM）负责做出最终判断。此外，R2-KG采用弃权机制，在收集到充分证据后才生成答案，从而显著提高可靠性。这种设计不仅成本效益高，还能保持强大的推理准确性。

链接: https://arxiv.org/abs/2502.12767
作者: Sumin Jo,Junseong Choi,Jiho Kim,Edward Choi
机构: KAIST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks are often rigid, struggling to adapt to KG or task changes. They also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across multiple KG-based reasoning tasks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability while reducing inference cost. However, it also leads to a higher abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning. It reduces reliance on high-capacity LLMs while ensuring trustworthy inference.
zh

[NLP-74] Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）语料库生成过程中效率与翻译质量之间的平衡问题。解决方案的关键在于整合半自动化的人机协同后编辑流程与大型语言模型（Large Language Models, LLMs），引入了增强翻译合成（Enhanced Translation Synthesis）和辅助标注分析（Assisted Annotation Analysis）等新型LLM特性，以改进初始翻译假设和质量评估。此外，通过LLM驱动的伪标签（LLM-Driven Pseudo Labeling）和翻译推荐系统（Translation Recommendation System），进一步减轻人工标注者的负担，从而提升系统的整体效能。

链接: https://arxiv.org/abs/2502.12755
作者: Kamer Ali Yuksel,Ahmet Gunduz,Abdul Baseet Anees,Hassan Sawaf
机构: aiXplain Inc. (aiXplain公司), Los Gatos, CA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper introduces an advanced methodology for machine translation (MT) corpus generation, integrating semi-automated, human-in-the-loop post-editing with large language models (LLMs) to enhance efficiency and translation quality. Building upon previous work that utilized real-time training of a custom MT quality estimation metric, this system incorporates novel LLM features such as Enhanced Translation Synthesis and Assisted Annotation Analysis, which improve initial translation hypotheses and quality assessments, respectively. Additionally, the system employs LLM-Driven Pseudo Labeling and a Translation Recommendation System to reduce human annotator workload in specific contexts. These improvements not only retain the original benefits of cost reduction and enhanced post-edit quality but also open new avenues for leveraging cutting-edge LLM advancements. The project’s source code is available for community use, promoting collaborative developments in the field. The demo video can be accessed here.
zh

[NLP-75] MediaMind: Revolutionizing Media Monitoring using Agent ification

【速读】：该论文旨在探讨如何通过代理化（Agentification）技术将现有软件工具转化为具备独立决策能力和动态交互能力的智能代理。关键解决方案在于采用基于代理的架构，使MediaMind能够自主监控、分析并实时提供多语言媒体内容的见解。这种方法显著提升了系统的适应性、效率和响应速度，从而优化组织的工作流程和决策过程，并应对不断变化的趋势。

链接: https://arxiv.org/abs/2502.12745
作者: Ahmet Gunduz,Kamer Ali Yuksel,Hassan Sawaf
机构: aixplain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In an era of rapid technological advancements, agentification of software tools has emerged as a critical innovation, enabling systems to function autonomously and adaptively. This paper introduces MediaMind as a case study to demonstrate the agentification process, highlighting how existing software can be transformed into intelligent agents capable of independent decision-making and dynamic interaction. Developed by aiXplain, MediaMind leverages agent-based architecture to autonomously monitor, analyze, and provide insights from multilingual media content in real time. The focus of this paper is on the technical methodologies and design principles behind agentifying MediaMind, showcasing how agentification enhances adaptability, efficiency, and responsiveness. Through detailed case studies and practical examples, we illustrate how the agentification of MediaMind empowers organizations to streamline workflows, optimize decision-making, and respond to evolving trends. This work underscores the broader potential of agentification to revolutionize software tools across various domains.
zh

[NLP-76] Self-Enhanced Reasoning Training: Activating Latent Reasoning Reasoning in Small Models for Enhanced Reasoning Distillation ICASSP2025

【速读】：该论文旨在解决小型语言模型在推理能力上的不足。论文的关键在于提出了一种名为Self-Enhanced Reasoning Training (SERT)的方法，通过在零样本条件下利用自动生成的推理路径进行自我训练，激活并利用小型模型中的潜在推理能力，从而增强其推理能力。

链接: https://arxiv.org/abs/2502.12744
作者: Yong Zhang,Bingyuan Zhang,Zhitao Li,Ming Li,Ning Cheng,Minchuan Chen,Tao Wei,Jun Ma,Shaojun Wang,Jing Xiao
机构: Ping An Technology (Shenzhen) Co., Ltd., China (平安科技（深圳）有限公司，中国)
类目: Computation and Language (cs.CL)
备注: Accepted by the 50th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has significantly enhanced their reasoning abilities, enabling increasingly complex tasks. However, these capabilities often diminish in smaller, more computationally efficient models like GPT-2. Recent research shows that reasoning distillation can help small models acquire reasoning capabilities, but most existing methods focus primarily on improving teacher-generated reasoning paths. Our observations reveal that small models can generate high-quality reasoning paths during sampling, even without chain-of-thought prompting, though these paths are often latent due to their low probability under standard decoding strategies. To address this, we propose Self-Enhanced Reasoning Training (SERT), which activates and leverages latent reasoning capabilities in small models through self-training on filtered, self-generated reasoning paths under zero-shot conditions. Experiments using OpenAI’s GPT-3.5 as the teacher model and GPT-2 models as the student models demonstrate that SERT enhances the reasoning abilities of small models, improving their performance in reasoning distillation.
zh

[NLP-77] “I know myself better but not really greatly”: Using LLM s to Detect and Explain LLM -Generated Texts

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成文本的检测与区分问题，特别是如何区分人类生成的文本与LLMs生成的文本。论文的关键解决方案在于通过二分类任务（人类生成文本 vs. LLM生成文本）和三分类任务（人类生成文本、LLM生成文本以及未决类别）来评估基于LLMs的检测器。研究发现，自我检测（self-detection）性能优于跨模型检测（cross-detection），但在扩展到三分类任务时，引入“未决”类别可以显著提升检测准确性和解释质量。此外，通过对解释错误进行详尽的定性与定量分析，论文指出了依赖不准确特征、幻觉和推理错误三种主要问题，并强调了进一步研究改进自我检测和自我解释的重要性，以应对可能影响泛化能力的过拟合问题。

链接: https://arxiv.org/abs/2502.12743
作者: Jiazhou Ji,Jie Guo,Weidong Qiu,Zheng Huang,Yang Xu,Xinru Lu,Xiaoyu Jiang,Ruizhe Li,Shujun Li
机构: School of Cyber Science and Engineering, Shanghai Jiao Tong University, China(上海交通大学网络科学与工程学院); Department of Computing Science, University of Aberdeen, UK(阿伯丁大学计算机科学系); Institute of Cyber Security for Society (iCSS) & School of Computing, University of Kent, UK(肯特大学网络安全社会研究所 & 计算学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in generating human-like texts, but the potential misuse of such LLM-generated texts raises the need to distinguish between human-generated and LLM-generated content. This paper explores the detection and explanation capabilities of LLM-based detectors of LLM-generated texts, in the context of a binary classification task (human-generated texts vs LLM-generated texts) and a ternary classification task (human-generated texts, LLM-generated texts, and undecided). By evaluating on six close/open-source LLMs with different sizes, our findings reveal that while self-detection consistently outperforms cross-detection, i.e., LLMs can detect texts generated by themselves more accurately than those generated by other LLMs, the performance of self-detection is still far from ideal, indicating that further improvements are needed. We also show that extending the binary to the ternary classification task with a new class “Undecided” can enhance both detection accuracy and explanation quality, with improvements being statistically significant and consistent across all LLMs. We finally conducted comprehensive qualitative and quantitative analyses on the explanation errors, which are categorized into three types: reliance on inaccurate features (the most frequent error), hallucinations, and incorrect reasoning. These findings with our human-annotated dataset emphasize the need for further research into improving both self-detection and self-explanation, particularly to address overfitting issues that may hinder generalization.
zh

[NLP-78] Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation

【速读】：该论文旨在解决知识库问答（KBQA）系统在测试阶段遇到未知知识库元素时表现不佳的问题。关键解决方案在于引入了SG-KBQA模型，该模型通过注入模式上下文（schema contexts）到实体检索和逻辑形式生成过程中，增强了系统的泛化能力。利用模式上下文中提供的丰富语义和对知识库结构的认知，SG-KBQA显著提升了在多种测试设置下的性能，超越了现有的最先进模型。

链接: https://arxiv.org/abs/2502.12737
作者: Shengxiang Gao,Jey Han Lau,Jianzhong Qi
机构: School of Computing and Information Systems, The University of Melbourne (墨尔本大学计算与信息系统学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Knowledge base question answering (KBQA) aims to answer user questions in natural language using rich human knowledge stored in large KBs. As current KBQA methods struggle with unseen knowledge base elements at test time,we introduce SG-KBQA: a novel model that injects schema contexts into entity retrieval and logical form generation to tackle this issue. It uses the richer semantics and awareness of the knowledge base structure provided by schema contexts to enhance generalizability. We show that SG-KBQA achieves strong generalizability, outperforming state-of-the-art models on two commonly used benchmark datasets across a variety of test settings. Code will be released upon paper publication.
zh

[NLP-79] Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training ACL2025

【速读】：该论文旨在解决机器生成文本（Machine-generated Text, MGT）检测在面对简单扰动和对抗性攻击时的脆弱性问题。为构建有效的防御机制，论文提出了一种名为GREedy Adversary PromoTed DefendER (GREATER) 的对抗框架，用于训练稳健的MGT检测器。解决方案的关键在于引入了两个核心组件：一个对抗者GREATER-A和一个检测器GREATER-D。GREATER-A通过在嵌入空间中识别和扰动关键标记，并结合贪心搜索和剪枝来生成隐蔽且具有破坏性的对抗样本；而GREATER-D则学习抵御来自GREATER-A的对抗攻击，并推广其防御至其他类型的攻击。此外，通过同步更新GREATER-A和GREATER-D，提高了检测器防御不同攻击及不同攻击强度的能力。

链接: https://arxiv.org/abs/2502.12734
作者: Yuanfan Li,Zhaohan Zhang,Chengzhengxu Li,Chao Shen,Xiaoming Liu
机构: Faculty of Electronic and Information Engineering, Xi’an Jiaotong University (西安交通大学电子与信息工程学院); Queen Mary University of London (伦敦玛丽女王大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Submitted to ACL 2025, Preprint, Under review

点击查看摘要

Abstract:Machine-generated Text (MGT) detection is crucial for regulating and attributing online texts. While the existing MGT detectors achieve strong performance, they remain vulnerable to simple perturbations and adversarial attacks. To build an effective defense against malicious perturbations, we view MGT detection from a threat modeling perspective, that is, analyzing the model’s vulnerability from an adversary’s point of view and exploring effective mitigations. To this end, we introduce an adversarial framework for training a robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The GREATER consists of two key components: an adversary GREATER-A and a detector GREATER-D. The GREATER-D learns to defend against the adversarial attack from GREATER-A and generalizes the defense to other attacks. GREATER-A identifies and perturbs the critical tokens in embedding space, along with greedy search and pruning to generate stealthy and disruptive adversarial examples. Besides, we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D to generalize its defense to different attacks and varying attack intensities. Our experimental results across 9 text perturbation strategies and 5 adversarial attacks show that our GREATER-D reduces the Attack Success Rate (ASR) by 10.61% compared with SOTA defense methods while our GREATER-A is demonstrated to be more effective and efficient than SOTA attack approaches.
zh

[NLP-80] Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge NAACL

【速读】：该论文旨在验证桌面角色扮演游戏（Tabletop Role-Playing Games, TTRPG）音频能否作为说话人分离（diarization）系统的一项挑战。论文的关键在于创建了一个小型TTRPG音频数据集，并将其与AMI和ICSI语料库进行比较，评估了两种说话人分离系统（this http URL 和 wespeaker）的性能。研究发现，TTRPG特有的语音转换特性导致这两种分离系统的混淆率增加，尤其是wespeaker严重低估了音频文件中的说话人数。论文提出TTRPG音频作为评估说话人分离系统性能的一个有前景的挑战。

链接: https://arxiv.org/abs/2502.12714
作者: Lian Remme,Kevin Tang
机构: Heinrich Heine University Düsseldorf(海因里希海涅杜塞尔多夫大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 15 pages, 14 figures, published in NAACL Findings 2025

点击查看摘要

Abstract:This paper provides a proof of concept that audio of tabletop role-playing games (TTRPG) could serve as a challenge for diarization systems. TTRPGs are carried out mostly by conversation. Participants often alter their voices to indicate that they are talking as a fictional character. Audio processing systems are susceptible to voice conversion with or without technological assistance. TTRPG present a conversational phenomenon in which voice conversion is an inherent characteristic for an immersive gaming experience. This could make it more challenging for diarizers to pick the real speaker and determine that impersonating is just that. We present the creation of a small TTRPG audio dataset and compare it against the AMI and the ICSI corpus. The performance of two diarizers, this http URL and wespeaker, were evaluated. We observed that TTRPGs’ properties result in a higher confusion rate for both diarizers. Additionally, wespeaker strongly underestimates the number of speakers in the TTRPG audio files. We propose TTRPG audio as a promising challenge for diarization systems.
zh

[NLP-81] ranslate Smart not Hard: Cascaded Translation Systems with Quality-Aware Deferral

【速读】：该论文旨在解决大型模型在机器翻译任务中性能优越但计算成本高昂的问题。解决方案的关键在于采用级联方法，并利用现有的质量估计（Quality Estimation, QE）指标作为延退规则，仅将部分实例传递给更强大的大型模型处理。这种方法使级联系统能够在只调用大型模型30%到50%的情况下达到大型模型的性能，从而显著降低计算成本。

链接: https://arxiv.org/abs/2502.12701
作者: António Farinhas,Nuno M. Guerreiro,Sweta Agrawal,Ricardo Rei,André F.T. Martins
机构: Instituto de Telecomunicações(葡萄牙电信研究所); Instituto Superior Técnico, Universidade de Lisboa(里斯本大学技术高等研究院); MICS, CentraleSupélec, Université Paris-Saclay(巴黎萨克雷大学MICS中心); Unbabel; ELLIS Unit Lisbon(ELLIS利桑托单元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Larger models often outperform smaller ones but come with high computational costs. Cascading offers a potential solution. By default, it uses smaller models and defers only some instances to larger, more powerful models. However, designing effective deferral rules remains a challenge. In this paper, we propose a simple yet effective approach for machine translation, using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction (30% to 50%) of the examples, significantly reducing computational costs. We validate this approach through both automatic and human evaluation.
zh

[NLP-82] Multi-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorming

【速读】：该论文旨在解决大型语言模型（LLMs）在生成文本时存在的多样性与新颖性不足的问题，导致其输出往往缺乏创意且过于确定性。论文的关键解决方案是引入推理时多视角头脑风暴方法（inference-time multi-view brainstorming method），通过在输入提示中融入来自文本和视觉源的多样化视角（称为“Multi-Novelty”），从而增强生成内容的多样性和创造性。此方法无需对模型架构进行修改，适用于开源及专有LLMs。

链接: https://arxiv.org/abs/2502.12700
作者: Arash Lagzian,Srinivas Anumasa,Dianbo Liu
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable proficiency in generating accurate and fluent text. However, they often struggle with diversity and novelty, leading to repetitive or overly deterministic responses. These limitations stem from constraints in training data, including gaps in specific knowledge domains, outdated information, and an over-reliance on textual sources. Such shortcomings reduce their effectiveness in tasks requiring creativity, multi-perspective reasoning, and exploratory thinking, such as LLM based AI scientist agents and creative artist agents . To address this challenge, we introduce inference-time multi-view brainstorming method, a novel approach that enriches input prompts with diverse perspectives derived from both textual and visual sources, which we refere to as “Multi-Novelty”. By incorporating additional contextual information as diverse starting point for chain of thoughts, this method enhances the variety and creativity of generated outputs. Importantly, our approach is model-agnostic, requiring no architectural modifications and being compatible with both open-source and proprietary LLMs.
zh

[NLP-83] heoretical Guarantees for Minimum Bayes Risk Decoding

【速读】：该论文旨在解决 Minimum Bayes Risk (MBR) 解码的有效性在理论上的解释问题。论文的关键在于分析并证明了在一定假设下，随着参考假设集大小 ( n ) 的增加，MBR 解码以 ( O\left(n^{-\frac{1}{2}}\right) ) 的速率高概率接近最优解，即使语言空间 ( Y ) 显著大于 ( n )。这为先前多项实证研究中观察到的 MBR 解码的优异性能提供了理论基础，并进一步表明 MBR 解码在某些情况下比最大后验 (MAP) 解码更快收敛至最优解。

链接: https://arxiv.org/abs/2502.12685
作者: Yuki Ichihara,Yuu Jinnai,Kaito Ariu,Tetsuro Morimura,Eiji Uchibe
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Minimum Bayes Risk (MBR) decoding optimizes output selection by maximizing the expected utility value of an underlying human distribution. While prior work has shown the effectiveness of MBR decoding through empirical evaluation, few studies have analytically investigated why the method is effective. As a result of our analysis, we show that, given the size n of the reference hypothesis set used in computation, MBR decoding approaches the optimal solution with high probability at a rate of O\left(n^-\frac12\right) , under certain assumptions, even though the language space Y is significantly larger Y\gg n . This result helps to theoretically explain the strong performance observed in several prior empirical studies on MBR decoding. In addition, we provide the performance gap for maximum-a-posteriori (MAP) decoding and compare it to MBR decoding. The result of this paper indicates that MBR decoding tends to converge to the optimal solution faster than MAP decoding in several cases.
zh

[NLP-84] Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees ICLR NEURIPS

【速读】：该论文旨在解决强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）在处理多轮对话中的局限性以及人类偏好非传递性的问题。现有方法如DPO虽表现良好，但将与语言模型的交互视为单一决策过程（bandit problem），且依赖于Bradley-Terry模型假设，这在实际多轮对话场景中应用受限，并不能充分捕捉人类偏好的复杂性。为此，论文提出将对齐问题建模为一个双人常和马尔可夫博弈（two-player constant-sum Markov game），其中每个参与者都试图最大化其在整个对话过程中相对于对手的胜率。论文的关键解决方案是引入多步偏好优化（Multi-step Preference Optimization, MPO）方法，该方法基于自然演员-评论家框架（natural actor-critic framework），并通过乐观在线梯度下降算法（optimistic online gradient descent）进一步发展出OMPO算法。理论上证明了这两种算法在收敛性上的有效性，并指出OMPO需要 (\mathcal{O}(\epsilon^{-1})) 次策略更新即可收敛至 (\epsilon)-近似纳什均衡。实验验证了所提方法在多轮对话数据集和数学推理数据集上的有效性。

链接: https://arxiv.org/abs/2502.12678
作者: Yongtao Wu,Luca Viano,Yihang Chen,Zhenyu Zhu,Kimon Antonakopoulos,Quanquan Gu,Volkan Cevher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted as oral presentation in NeurIPS LanGame Workshop, revised from ICLR submission

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Multi-step Preference Optimization (MPO) is built upon the natural actor-critic framework~\citeppeters2008natural. We further develop OMPO based on the optimistic online gradient descent algorithm~\citeprakhlin2013online,joulani17a. Theoretically, we provide a rigorous analysis for both algorithms on convergence and show that OMPO requires \mathcalO(\epsilon^-1) policy updates to converge to an \epsilon -approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
zh

[NLP-85] Speech-FT: A Fine-tuning Strategy for Enhancing Speech Representation Models Without Compromising Generalization Ability

【速读】：该论文旨在解决在对语音表征模型进行微调以增强特定应用性能时，通常会削弱其泛化能力的问题。解决方案的关键在于提出了一种名为Speech-FT的微调策略，它通过模型合并（model merging）的方法，在保持泛化能力的同时，仍能从微调中获益。

链接: https://arxiv.org/abs/2502.12672
作者: Tzu-Quan Lin,Wei-Ping Huang,Hao Tang,Hung-yi Lee
机构: National Taiwan University (台湾大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech representation models are highly effective at extracting general features for various tasks. While fine-tuning can enhance these representations for specific applications, it often compromises their generalization ability. To address this challenge, we propose Speech-FT, a fine-tuning strategy for speech representation models that leverages model merging to preserve generalization ability while still benefiting from fine-tuning. Speech-FT is effective across different fine-tuning scenarios and is compatible with various types of speech representation models, providing a versatile solution. Speech-FT offers an efficient and practical approach to further improving general speech representations after pre-training.
zh

[NLP-86] Baichuan-M1: Pushing the Medical Capability of Large Language Models

【速读】：该论文旨在解决医学领域专用大型语言模型（Large Language Models, LLMs）稀缺的问题。由于医学知识的复杂性和高质量数据的有限性，开发高效且实用的医学专用LLMs面临挑战。为了解决这一问题，论文介绍了一种名为Baichuan-M1的新模型，该模型从零开始训练，并专门优化用于医学应用。其关键是通过从零开始的全新训练方法，结合有效的训练策略，平衡了通用能力和医学专业知识，从而在保持广泛领域性能的同时，在医学专业领域表现出色。论文还开源了一个名为Baichuan-M1-14B的小型版本以供使用。

链接: https://arxiv.org/abs/2502.12671
作者: Bingning Wang,Haizhou Zhao,Huozhi Zhou,Liang Song,Mingyu Xu,Wei Cheng,Xiangrong Zeng,Yupeng Zhang,Yuqi Huo,Zecheng Wang,Zhengyun Zhao,Da Pan,Fan Yang,Fei Kou,Fei Li,Fuzhong Chen,Guosheng Dong,Han Liu,Hongda Zhang,Jin He,Jinjie Yang,Kangxi Wu,Kegeng Wu,Lei Su,Linlin Niu,Linzhuang Sun,Mang Wang,Pengcheng Fan,Qianli Shen,Rihui Xin,Shunya Dang,Songchi Zhou,Weipeng Chen,Wenjing Luo,Xin Chen,Xin Men,Xionghai Lin,Xuezhen Dong,Yan Zhang,Yifei Duan,Yuyan Zhou,Zhi Ma,Zhiying Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, technical report

点击查看摘要

Abstract:The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.
zh

[NLP-87] Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

【速读】：该论文旨在分析BoN采样方法在正则化策略中的效果，并解决BoN采样易受奖励黑客攻击（reward hacking）的问题。论文的关键解决方案是引入了一种新的采样方法，称为随机正则化BoN采样（Stochastic RBoN sampling, SRBoN），该方法在理论上保证了最坏情况下的BoN采样性能，并通过实证评估验证了其有效性。此外，论文还提出了一种简化的正则化BoN方法，即句长正则化BoN（Sentence Length Regularized BoN），在实验中表现出更好的性能。

链接: https://arxiv.org/abs/2502.12668
作者: Yuki Ichihara,Yuu Jinnai,Tetsuro Morimura,Kaito Ariu,Kenshi Abe,Mitsuki Sakamoto,Eiji Uchibe
机构: Nara Institute of Science and Technology; CyberAgent; Advanced Telecommunications Research Institute International
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Since the reward model is an imperfect proxy for the true objective, an excessive focus on optimizing its value can lead to a compromise of its performance on the true objective. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling so that it mitigates reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al. (2024) introduce RBoN based on a heuristic and they lack the analysis of why such regularization strategy improves the performance of BoN sampling. The aim of this study is to analyze the effect of BoN sampling on regularization strategies. Using the regularization strategies corresponds to robust optimization, which maximizes the worst case over a set of possible perturbations in the proxy reward. Although the theoretical guarantees are not directly applicable to RBoN, RBoN corresponds to a practical implementation. This paper proposes an extension of the RBoN framework, called Stochastic RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN in proxy reward. We then perform an empirical evaluation using the AlpacaFarm and Anthropic’s hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. In addition, we also propose another simple RBoN method, the Sentence Length Regularized BoN, which has a better performance in the experiment as compared to the previous methods.
zh

[NLP-88] A2ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization

【速读】：该论文旨在解决长上下文大语言模型（LLMs）在高效服务过程中由于KV缓存的大内存占用和高访问开销所带来的显著挑战。论文的关键解决方案是提出了一种名为A^2 ATS的新颖基于检索的KV缓存缩减方法。A^2 ATS通过将向量量化技术应用于键状态来获得注意力分数的准确近似，从而实现高效且精确的Top-K token检索。这种方法包括引入窗口式旋转位置嵌入以解耦位置嵌入后的查询和键状态中的位置依赖性，并采用与查询相关的向量量化直接优化注意力分数近似的目标。此外，设计了异构推理架构用于KV缓存卸载，支持更大批量的数据处理，从而实现长上下文服务。实验结果表明，A^2 ATS能够以更低的性能下降和相似或更低的开销实现长上下文服务吞吐量提升高达2.7倍。

链接: https://arxiv.org/abs/2502.12665
作者: Junhui He,Junna Xing,Nan Wang,Rui Xu,Shangyu Wu,Peng Zhou,Qiang Liu,Chun Jason Xue,Qingan Li
机构: Wuhan University; Alibaba Cloud Computing; Jinan University; City University of Hong Kong; MBZUAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long context large language models (LLMs) pose significant challenges for efficient serving due to the large memory footprint and high access overhead of KV cache. Retrieval-based KV cache reduction methods can mitigate these challenges, typically by offloading the complete KV cache to CPU and retrieving necessary tokens on demand during inference. However, these methods still suffer from unsatisfactory accuracy degradation and extra retrieval overhead. To address these limitations, this paper proposes A ^2 ATS, a novel retrieval-based KV cache reduction method. A ^2 ATS aims to obtain an accurate approximation of attention scores by applying the vector quantization technique to key states, thereby enabling efficient and precise retrieval of the top-K tokens. First, we propose Windowed Rotary Position Embedding, which decouples the positional dependency from query and key states after position embedding. Then, we propose query-aware vector quantization that optimizes the objective of attention score approximation directly. Finally, we design the heterogeneous inference architecture for KV cache offloading, enabling long context serving with larger batch sizes. Experimental results demonstrate that A ^2 ATS can achieve a lower performance degradation with similar or lower overhead compared to existing methods, thereby increasing long context serving throughput by up to 2.7 \times .
zh

[NLP-89] Demystifying Multilingual Chain-of-Thought in Process Reward Modeling

【速读】：该论文旨在解决大型语言模型（LLMs）在多语言环境下进行复杂多步骤推理任务时的能力提升问题。关键解决方案在于训练多语言过程奖励模型（PRMs），通过在涵盖七种语言的数据集上进行训练，并利用从英语翻译而来的数据，从而提高模型在多语言环境下的平均准确率和减少早期推理错误。此外，研究还探讨了训练语言数量及英语数据量对多语言PRMs性能的影响，并揭示了增加候选响应和可训练参数数量带来的益处。

链接: https://arxiv.org/abs/2502.12663
作者: Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch
机构: School of Informatics, University of Edinburgh(爱丁堡大学信息学院); Monash University(蒙纳士大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
zh

[NLP-90] R.R.: Unveiling LLM Training Privacy through Recollection and Ranking

【速读】：该论文旨在解决大型语言模型（LLMs）在隐私风险方面的问题，特别是如何防止训练数据中的个人身份信息（PII）被泄露。论文提出了一种名为R.R.（Recollect and Rank）的新方法，通过两步攻击策略来重建被屏蔽的训练数据中的PII实体。关键在于引入了一种称为“recollection”的提示范式，用于指导模型填充被遮掩的文本，并设计了一个新的评分标准对提取出的PII候选者进行排序。这种方法利用参考模型作为校准，从而实现比现有基线更好的PII识别性能。

链接: https://arxiv.org/abs/2502.12658
作者: Wenlong Meng,Zhenyuan Guo,Lenan Wu,Chen Gong,Wenyan Liu,Weixian Li,Chengkun Wei,Wenzhi Chen
机构: Zhejiang University (浙江大学); University of Virginia (弗吉尼亚大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) pose significant privacy risks, potentially leaking training data due to implicit memorization. Existing privacy attacks primarily focus on membership inference attacks (MIAs) or data extraction attacks, but reconstructing specific personally identifiable information (PII) in LLM’s training data remains challenging. In this paper, we propose R.R. (Recollect and Rank), a novel two-step privacy stealing attack that enables attackers to reconstruct PII entities from scrubbed training data where the PII entities have been masked. In the first stage, we introduce a prompt paradigm named recollection, which instructs the LLM to repeat a masked text but fill in masks. Then we can use PII identifiers to extract recollected PII candidates. In the second stage, we design a new criterion to score each PII candidate and rank them. Motivated by membership inference, we leverage the reference model as a calibration to our criterion. Experiments across three popular PII datasets demonstrate that the R.R. achieves better PII identical performance compared to baselines. These results highlight the vulnerability of LLMs to PII leakage even when training data has been scrubbed. We release the replicate package of R.R. at a link.
zh

[NLP-91] textitOne Size doesnt Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction

【速读】：该论文旨在解决在智能教育系统中，大型语言模型（Large Language Models, LLMs）未能充分适应个体学习者特征的问题。解决方案的关键在于提出了一种个性化会话辅导代理（PACE），它基于Felder和Silverman的学习风格模型模拟学生的学习方式，并采用苏格拉底教学法提供即时反馈以促进深度思考。通过构建个性化的教学数据和训练模型，PACE能够识别并适应每个学生的独特需求，从而显著提升整体学习体验和效果。

链接: https://arxiv.org/abs/2502.12633
作者: Ben Liu,Jihan Zhang,Fangquan Lin,Xu Jia,Min Peng
机构: DAMO Academy, Alibaba Group (达摩院, 阿里巴巴集团); College of Computer and Cyber Security, Hebei Normal University (河北师范大学计算机与网络安全学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly employed in various intelligent educational systems, simulating human tutors to facilitate effective human-machine interaction. However, previous studies often overlook the significance of recognizing and adapting to individual learner characteristics. Such adaptation is crucial for enhancing student engagement and learning efficiency, particularly in mathematics instruction, where diverse learning styles require personalized strategies to promote comprehension and enthusiasm. In this paper, we propose a \textbfPerson\textbfAlized \textbfConversational tutoring ag\textbfEnt (PACE) for mathematics instruction. PACE simulates students’ learning styles based on the Felder and Silverman learning style model, aligning with each student’s persona. In this way, our PACE can effectively assess the personality of students, allowing to develop individualized teaching strategies that resonate with their unique learning styles. To further enhance students’ comprehension, PACE employs the Socratic teaching method to provide instant feedback and encourage deep thinking. By constructing personalized teaching data and training models, PACE demonstrates the ability to identify and adapt to the unique needs of each student, significantly improving the overall learning experience and outcomes. Moreover, we establish multi-aspect evaluation criteria and conduct extensive analysis to assess the performance of personalized teaching. Experimental results demonstrate the superiority of our model in personalizing the educational experience and motivating students compared to existing methods.
zh

[NLP-92] DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

【速读】：该论文旨在解决音乐大型语言模型（Music Large Language Models, LLMs）在音乐理解任务中仅依赖音乐和文本输入的局限性，探索通过整合图像、视频及文本音乐特征等多模态信息来进一步提升音乐理解能力。解决方案的关键在于提出了一种名为DeepResonance的多模态音乐理解LLM，通过多模态对齐数据进行多方式指令微调。此外，引入了多采样ImageBind嵌入和预对齐Transformer以增强模态融合，从而优化模型对多模态指令的适应性。

链接: https://arxiv.org/abs/2502.12623
作者: Zhuoyuan Mao,Mengjie Zhao,Qiyu Wu,Hiromi Wakaki,Yuki Mitsufuji
机构: Sony Group Corporation(索尼集团); Sony AI(索尼AI)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model’s ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-alignment Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets.
zh

[NLP-93] Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions

【速读】：该论文旨在解决通过Chain-of-Though (CoT)方法生成的解释易受内容偏差影响，从而降低其鲁棒性和准确性的问题。为了解决这一局限性，论文提出了一种名为QuaSAR (Quasi-Symbolic Abstract Reasoning) 的方法，这是一种改进的CoT策略，通过准符号化解释引导大型语言模型 (LLMs) 在更高抽象层次上进行推理。QuaSAR的关键在于仅将相关变量和谓词形式化，使得符号元素与自然语言能够共存，从而在不完全形式化的情况下实现内容与逻辑推理的分离。这种方法提高了基于CoT方法的准确性和鲁棒性，在自然语言任务（如MMLU-Redux）和符号推理任务（如GSM-Symbolic）中分别提升了最高达8%的准确率，并增强了在对抗性变化中的稳健性和一致性。

链接: https://arxiv.org/abs/2502.12616
作者: Leonardo Ranaldi,Marco Valentino,Alexander Polonsky,Andrè Freitas
机构: Idiap Research Institute (马蒂尼研究所); Department of Computer Science, University of Manchester (曼彻斯特大学计算机科学系), UK; National Biomarker Centre (NBC), CRUK Manchester Institute (英国癌症研究中心曼彻斯特研究所), UK; BLOOM Social Analytics (巴黎), France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models (LLMs) by decomposing complex tasks into intermediate inference steps. However, explanations generated via CoT are susceptible to content biases that negatively affect their robustness and faithfulness. To mitigate existing limitations, recent work has proposed using logical formalisms coupled with external symbolic solvers. However, fully symbolic approaches possess the bottleneck of requiring a complete translation from natural language to formal languages, a process that affects efficiency and flexibility. To achieve a trade-off, this paper investigates methods to disentangle content from logical reasoning without a complete formalisation. In particular, we present QuaSAR (for Quasi-Symbolic Abstract Reasoning), a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations. Our framework leverages the capability of LLMs to formalise only relevant variables and predicates, enabling the coexistence of symbolic elements with natural language. We show the impact of QuaSAR for in-context learning and for constructing demonstrations to improve the reasoning capabilities of smaller models. Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy, enhancing robustness and consistency on challenging adversarial variations on both natural language (i.e. MMLU-Redux) and symbolic reasoning tasks (i.e., GSM-Symbolic).
zh

[NLP-94] Label Drop for Multi-Aspect Relation Modeling in Universal Information Extraction NAACL

【速读】：该论文旨在解决多关系信息抽取中的复杂性和准确性问题。为了解决这些问题，论文提出了一种名为LDNet的关键方案，该方案结合了多方面关系建模和标签丢弃机制。通过将不同关系分配到不同的理解与决策层级，LDNet减少了决策混淆，并且标签丢弃机制有效减轻了无关关系的影响。实验结果表明，LDNet在单模态和多模态、少样本和零样本设置下的9个任务和33个数据集上表现出色或达到了最先进的性能水平。

链接: https://arxiv.org/abs/2502.12614
作者: Lu Yang,Jiajia Li,En Ci,Lefei Zhang,Zuchao Li,Ping Wang
机构: School of Computer Science, Wuhan University, Wuhan, China (武汉大学计算机学院); School of Information Management, Wuhan University, Wuhan, China (武汉大学信息管理学院); Key Laboratory of Archival Intelligent Development and Service, NAAC (档案智能开发与服务重点实验室, NAAC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL-main 2025

点击查看摘要

Abstract:Universal Information Extraction (UIE) has garnered significant attention due to its ability to address model explosion problems effectively. Extractive UIE can achieve strong performance using a relatively small model, making it widely adopted. Extractive UIEs generally rely on task instructions for different tasks, including single-target instructions and multiple-target instructions. Single-target instruction UIE enables the extraction of only one type of relation at a time, limiting its ability to model correlations between relations and thus restricting its capability to extract complex relations. While multiple-target instruction UIE allows for the extraction of multiple relations simultaneously, the inclusion of irrelevant relations introduces decision complexity and impacts extraction accuracy. Therefore, for multi-relation extraction, we propose LDNet, which incorporates multi-aspect relation modeling and a label drop mechanism. By assigning different relations to different levels for understanding and decision-making, we reduce decision confusion. Additionally, the label drop mechanism effectively mitigates the impact of irrelevant relations. Experiments show that LDNet outperforms or achieves competitive performance with state-of-the-art systems on 9 tasks, 33 datasets, in both single-modal and multi-modal, few-shot and zero-shot settings.\footnotethis https URL
zh

[NLP-95] Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

【速读】：该论文旨在解决大型语言模型（LLMs）生成文本的准确检测问题，并特别关注作者特征对检测结果的影响。论文的关键解决方案在于通过多因素方差分析（ANOVA）和加权最小二乘法（WLS）评估不同社会语言学属性（性别、CEFR熟练度、学术领域和语言环境）对当前最先进的AI文本检测器准确性的影响，从而揭示显著的偏差，并提供一种更为公平和可靠的检测系统开发方法。

链接: https://arxiv.org/abs/2502.12611
作者: Jiatao Li,Xiaojun Wan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) necessitates accurate AI-generated text detection. However, current approaches largely overlook the influence of author characteristics. We investigate how sociolinguistic attributes-gender, CEFR proficiency, academic field, and language environment-impact state-of-the-art AI text detectors. Using the ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, we conduct a rigorous evaluation employing multi-factor ANOVA and weighted least squares (WLS). Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects. These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups. We offer novel empirical evidence, a robust statistical framework, and actionable insights for developing more equitable and reliable detection systems in real-world, out-of-domain contexts. This work paves the way for future research on bias mitigation, inclusive evaluation benchmarks, and socially responsible LLM detectors.
zh

[NLP-96] COPU: Conformal Prediction for Uncertainty Quantification in Natural Language Generation

【速读】：该论文旨在解决自然语言生成（Natural Language Generation, NLG）中的不确定性量化（Uncertainty Quantification, UQ）问题，特别是在使用非形式化预测（Conformal Prediction, CP）方法时存在的局限性。关键在于提出了一种新方法（\ourmethod），该方法通过明确将真实值加入候选输出，并利用logit得分来衡量不符合性，从而改进了CP方法在NLG任务中的适用性和准确性。实验结果表明，这种方法在多个大型语言模型（Large Language Models, LLMs）上的四个NLG任务中，能够更好地校准错误率和经验覆盖率，提供更广泛的用户指定错误率下的精确UQ。

链接: https://arxiv.org/abs/2502.12601
作者: Sean Wang,Yicheng Jiang,Yuxin Tang,Lu Cheng,Hanjie Chen
机构: Rice University; University of Illinois Chicago
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty Quantification (UQ) for Natural Language Generation (NLG) is crucial for assessing the performance of Large Language Models (LLMs), as it reveals confidence in predictions, identifies failure modes, and gauges output reliability. Conformal Prediction (CP), a model-agnostic method that generates prediction sets with a specified error rate, has been adopted for UQ in classification tasks, where the size of the prediction set indicates the model’s uncertainty. However, when adapting CP to NLG, the sampling-based method for generating candidate outputs cannot guarantee the inclusion of the ground truth, limiting its applicability across a wide range of error rates. To address this, we propose \ourmethod, a method that explicitly adds the ground truth to the candidate outputs and uses logit scores to measure nonconformity. Our experiments with six LLMs on four NLG tasks show that \ourmethod outperforms baseline methods in calibrating error rates and empirical cover rates, offering accurate UQ across a wide range of user-specified error rates.
zh

[NLP-97] Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion

【速读】：该论文旨在解决大型语言模型（LLMs）在适应新知识和多样化知识方面的问题，以确保其在实际应用中的持续有效性。论文的关键在于综述了当前最先进的方法，包括持续学习、模型编辑和基于检索的显式适应等技术，以集成不同类型的知识，如事实信息、领域专业知识、语言能力以及用户偏好。这些方法的核心目标是克服知识一致性与可扩展性等方面的挑战，从而推动LLMs成为适应性强且稳健的知识系统。

链接: https://arxiv.org/abs/2502.12598
作者: Mingyang Wang,Alisa Stoll,Lukas Lange,Heike Adel,Hinrich Schütze,Jannik Strötgen
机构: Bosch Center for Artificial Intelligence (BCAI); LMU Munich; Munich Center for Machine Learning (MCML); Karlsruhe University of Applied Sciences; Hochschule der Medien, Stuttgart
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to new and diverse knowledge is essential for their lasting effectiveness in real-world applications. This survey provides an overview of state-of-the-art methods for expanding the knowledge of LLMs, focusing on integrating various knowledge types, including factual information, domain expertise, language proficiency, and user preferences. We explore techniques, such as continual learning, model editing, and retrieval-based explicit adaptation, while discussing challenges like knowledge consistency and scalability. Designed as a guide for researchers and practitioners, this survey sheds light on opportunities for advancing LLMs as adaptable and robust knowledge systems.
zh

[NLP-98] PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

【速读】：该论文旨在解决模型剪枝后大型语言模型能力显著退化的问题。现有方法通常忽略了模型能力退化的不均衡性，并且需要较高的计算成本。此外，一些与模型能力恢复无关的指令数据可能引入负面影响。论文的关键解决方案是提出了一种名为PASER（Post-training Data Selection method for Efficient pruned large language model Recovery）的方法。PASER通过流形学习和谱聚类在语义空间中分组恢复数据，识别出特定能力的指令集，并根据模型能力退化的程度自适应分配数据预算。此外，PASER优先选择模型性能下降显著的数据样本，并检测过滤掉潜在冲突或无关的恢复数据，从而有效恢复剪枝后的大型语言模型的一般能力，同时仅使用原始微调数据的4%-20%。

链接: https://arxiv.org/abs/2502.12594
作者: Bowei He,Lihao Yin,Hui-Ling Zhen,Xiaokun Zhang,Mingxuan Yuan,Chen Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model pruning is an effective approach for compressing large language models. However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some instruction data irrelevant to model capability recovery may introduce negative effects. To address these challenges, we propose the \textbfPost-training d\textbfAta \textbfSelection method for \textbfEfficient pruned large language model \textbfRecovery (\textbfPASER). PASER aims to identify instructions where model capabilities are most severely compromised within a certain recovery data budget. Our approach first applies manifold learning and spectral clustering to group recovery data in the semantic space, revealing capability-specific instruction sets. We then adaptively allocate the data budget to different clusters based on the degrees of model capability degradation. In each cluster, we prioritize data samples where model performance has declined dramatically. To mitigate potential negative transfer, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4%-20% of the original post-training data.
zh

[NLP-99] CutPasteFind: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在生成描述时容易出现幻觉，特别是物体幻觉的问题。现有检测方法虽然性能强大，但依赖昂贵的API调用和迭代式的LVLM验证，这使得它们在大规模或离线使用中不切实际。论文提出的关键解决方案是CutPaste\Find，这是一种轻量级且无需训练的框架，用于检测LVLM生成输出中的幻觉。其核心是一个视觉辅助知识库，编码丰富的实体-属性关系及其相关的图像表示，并引入了一个缩放因子来优化相似性评分，从而提高检测效率和成本效益。

链接: https://arxiv.org/abs/2502.12591
作者: Cong-Duy Nguyen,Xiaobao Wu,Duc Anh Vu,Shuai Zhao,Thong Nguyen,Anh Tuan Luu
机构: Nanyang Technological University(南洋理工大学), Singapore; National University of Singapore(新加坡国立大学), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, but they remain susceptible to hallucination, particularly object hallucination where non-existent objects or incorrect attributes are fabricated in generated descriptions. Existing detection methods achieve strong performance but rely heavily on expensive API calls and iterative LVLM-based validation, making them impractical for large-scale or offline use. To address these limitations, we propose CutPaste\Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs. Our approach leverages off-the-shelf visual and linguistic modules to perform multi-step verification efficiently without requiring LVLM inference. At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations. We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs. Comprehensive evaluations on benchmark datasets, including POPE and R-Bench, demonstrate that CutPaste\Find achieves competitive hallucination detection performance while being significantly more efficient and cost-effective than previous methods.
zh

[NLP-100] RSMLP: A light Sampled MLP Structure for Incomplete Utterance Rewrite

【速读】：该论文旨在解决不完整表达重写（Incomplete Utterance Rewriting, IUR）任务中的挑战，即如何通过重建对话表达来更好地适应当前语境，从而提升理解能力。论文提出的关键解决方案是Rewritten-Sampled MLP (RSMLP)，这是一种基于多层感知机（MLP）架构并采用精心设计的下采样策略的轻量级方法。RSMLP通过有效提取句子间的潜在语义信息，并进行适当的编辑以恢复不完整的表达，从而实现这一目标。

链接: https://arxiv.org/abs/2502.12587
作者: Lunjun Liu,Weilai Jiang,Yaonan Wang
机构: College of Electrical and Information Engineering, Hunan University (湖南大学电气与信息工程学院), Changsha, P.R.China; Greater Bay Area Institute for Innovation, Hunan University (湖南大学粤港澳大湾区创新研究院), Guangzhou, P.R.China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Incomplete Utterance Rewriting (IUR) task has garnered significant attention in recent years. Its goal is to reconstruct conversational utterances to better align with the current context, thereby enhancing comprehension. In this paper, we introduce a novel and versatile lightweight method, Rewritten-Sampled MLP (RSMLP). By employing an MLP based architecture with a carefully designed down-sampling strategy, RSMLP effectively extracts latent semantic information between utterances and makes appropriate edits to restore incomplete utterances. Due to its simple yet efficient structure, our method achieves competitive performance on public IUR datasets and in real-world applications.
zh

[NLP-101] G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation WWW2025

【速读】：该论文旨在解决在生成个性化且可解释推荐时，如何有效提取用户-项目交互图中的协同过滤（Collaborative Filtering, CF）信息，并将其与大型语言模型（Large Language Models, LLMs）结合以生成人类可理解的解释。论文的关键在于提出了一种名为G-Refer的框架，通过引入混合图检索机制从结构和语义角度提取显式的CF信号，并利用图翻译将这些信号转化为人类可理解的文本形式，从而弥合图结构与自然语言解释之间的模态差距。此外，通过知识剪枝和检索增强的微调方法，增强LLMs处理和利用所提取CF信息的能力，以生成高质量的解释。

链接: https://arxiv.org/abs/2502.12586
作者: Yuhan Li,Xinni Zhang,Linhao Luo,Heng Chang,Yuxiang Ren,Irwin King,Jia Li
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); The Chinese University of Hong Kong(香港中文大学); Monash University(蒙纳士大学); Huawei Technologies Co., Ltd.(华为技术有限公司); The Chinese University of Hong Kong(香港中文大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by WWW 2025, research track

点击查看摘要

Abstract:Explainable recommendation has demonstrated significant advantages in informing users about the logic behind recommendations, thereby increasing system transparency, effectiveness, and trustworthiness. To provide personalized and interpretable explanations, existing works often combine the generation capabilities of large language models (LLMs) with collaborative filtering (CF) information. CF information extracted from the user-item interaction graph captures the user behaviors and preferences, which is crucial for providing informative explanations. However, due to the complexity of graph structure, effectively extracting the CF information from graphs still remains a challenge. Moreover, existing methods often struggle with the integration of extracted CF information with LLMs due to its implicit representation and the modality gap between graph structures and natural language explanations. To address these challenges, we propose G-Refer, a framework using graph retrieval-augmented large language models (LLMs) for explainable recommendation. Specifically, we first employ a hybrid graph retrieval mechanism to retrieve explicit CF signals from both structural and semantic perspectives. The retrieved CF information is explicitly formulated as human-understandable text by the proposed graph translation and accounts for the explanations generated by LLMs. To bridge the modality gap, we introduce knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of LLMs to process and utilize the retrieved CF information to generate explanations. Extensive experiments show that G-Refer achieves superior performance compared with existing methods in both explainability and stability. Codes and data are available at this https URL.
zh

[NLP-102] LongFaith: Enhancing Long-Context Reasoning in LLM s with Faithful Synthetic Data

【速读】：该论文旨在解决长上下文大型语言模型（LLMs）在长上下文推理和问答任务中，因合成数据的准确性问题导致的数据有效性不足。关键解决方案是提出了一种名为LongFaith的新管道，通过整合真实数据和基于引用的推理提示，消除干扰因素并提升推理链的准确性，从而减少昂贵的验证过程需求。

链接: https://arxiv.org/abs/2502.12583
作者: Cehao Yang,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Shengjie Ma,Aofan Liu,Hui Xiong,Jian Guo
机构: IDEA Research, International Digital Economy Academy (数字经济学研院); Artificial Intelligence Thrust, Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
zh

[NLP-103] A Fuzzy Evaluation of Sentence Encoders on Grooming Risk Classification

【速读】：该论文旨在解决在线聊天环境中使用隐晦和编码语言的恋童癖者规避现有模型检测的问题。论文的关键在于采用模糊理论框架评估不同参与群体（包括执法人员、真实受害者和诱饵）在聊天背景下的恋童风险分类，并揭示细调模型在识别编码语言方面的不足，尤其是这些语言导致词汇表外（OOV）词汇增多从而引起误分类的现象。

链接: https://arxiv.org/abs/2502.12576
作者: Geetanjali Bihani,Julia Rayz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures. Accepted for publication in the Proceedings of the NAFIPS International Conference on Fuzzy Systems, Soft Computing, and Explainable AI. NAFIPS’2024

点击查看摘要

Abstract:With the advent of social media, children are becoming increasingly vulnerable to the risk of grooming in online settings. Detecting grooming instances in an online conversation poses a significant challenge as the interactions are not necessarily sexually explicit, since the predators take time to build trust and a relationship with their victim. Moreover, predators evade detection using indirect and coded language. While previous studies have fine-tuned Transformers to automatically identify grooming in chat conversations, they overlook the impact of coded and indirect language on model predictions, and how these align with human perceptions of grooming. In this paper, we address this gap and evaluate bi-encoders on the task of classifying different degrees of grooming risk in chat contexts, for three different participant groups, i.e. law enforcement officers, real victims, and decoys. Using a fuzzy-theoretic framework, we map human assessments of grooming behaviors to estimate the actual degree of grooming risk. Our analysis reveals that fine-tuned models fail to tag instances where the predator uses indirect speech pathways and coded language to evade detection. Further, we find that such instances are characterized by a higher presence of out-of-vocabulary (OOV) words in samples, causing the model to misclassify. Our findings highlight the need for more robust models to identify coded language from noisy chat inputs in grooming contexts.
zh

[NLP-104] A Cognitive Writing Perspective for Constrained Long-Form Text Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在一次性生成高质量长文本时难以满足严格要求的问题。解决方案的关键在于提出了一种名为CogWriter的新框架，该框架通过模仿人类的认知写作过程，将LLMs的长文本生成任务转化为系统化的认知写作范式。CogWriter包含两个关键模块：(1) 规划代理（Planning Agent），用于分层规划以分解任务；(2) 多个生成代理（Generation Agents），用于并行执行这些计划。系统通过持续的监控和审查机制维持输出质量，这些机制评估输出是否符合指定要求，并触发必要的修订。

链接: https://arxiv.org/abs/2502.12568
作者: Kaiyang Wan,Honglin Mu,Rui Hao,Haoran Luo,Tianle Gu,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Chinese Academy of Sciences (中国科学院大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \hrefthis https URLCogWriter.
zh

[NLP-105] Self Iterative Label Refinement via Robust Unlabeled Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在缺乏高质量反馈时性能下降的问题。论文的关键解决方案在于提出了一种迭代精炼管道，采用无标签-无标签学习框架来改进LLM生成的伪标签，通过利用两个具有不同正类比率的无标签数据集，该方法能够迭代去噪和精炼初始伪标签，从而在最小化人工监督的情况下减轻内部偏见带来的负面影响。

链接: https://arxiv.org/abs/2502.12565
作者: Hikaru Asano,Tadashi Kozuno,Yukino Baba
机构: The University of Tokyo(东京大学); RIKEN AIP(理化学研究所人工智能中心); OMRON SINIC X(欧姆龙 SINIC X)
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have yielded impressive performance on various tasks, yet they often depend on high-quality feedback that can be costly. Self-refinement methods attempt to leverage LLMs’ internal evaluation mechanisms with minimal human supervision; however, these approaches frequently suffer from inherent biases and overconfidence, especially in domains where the models lack sufficient internal knowledge, resulting in performance degradation. As an initial step toward enhancing self-refinement for broader applications, we introduce an iterative refinement pipeline that employs the Unlabeled-Unlabeled learning framework to improve LLM-generated pseudo-labels for classification tasks. By exploiting two unlabeled datasets with differing positive class ratios, our approach iteratively denoises and refines the initial pseudo-labels, thereby mitigating the adverse effects of internal biases with minimal human supervision. Evaluations on diverse datasets, including low-resource language corpora, patent classifications, and protein structure categorizations, demonstrate that our method consistently outperforms both initial LLM’s classification performance and the self-refinement approaches by cutting-edge models (e.g., GPT-4o and DeepSeek-R1).
zh

[NLP-106] Evaluating Language Models on Grooming Risk Estimation Using Fuzzy Theory

【速读】：该论文旨在解决利用语言模型检测在线儿童诱骗行为时所面临的挑战，特别是针对隐性语言的编码问题。当前基于Transformer的语言模型如SBERT虽在预防性诱骗检测方面显示出潜力，但它们主要依赖于表层特征，并通过义警及执法部门的对话来近似真实的诱骗过程。论文的关键解决方案在于探讨SBERT能否有效评估对话中不同程度的诱骗风险，并评估其结果在不同参与群体中的表现。研究发现，虽然微调有助于语言模型学习分配诱骗评分，但在预测含有高程度诱骗风险的上下文时存在高方差，尤其是在使用间接言语途径操纵受害者且缺乏性暗示内容的情况下。这一发现强调了语言模型对间接言语行为进行稳健建模的必要性，尤其是那些被犯罪分子利用的行为。

链接: https://arxiv.org/abs/2502.12563
作者: Geetanjali Bihani,Tatiana Ringenberg,Julia Rayz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures. Accepted for publication in the Proceedings of the NAFIPS International Conference on Fuzzy Systems, Soft Computing, and Explainable AI. NAFIPS’2024

点击查看摘要

Abstract:Encoding implicit language presents a challenge for language models, especially in high-risk domains where maintaining high precision is important. Automated detection of online child grooming is one such critical domain, where predators manipulate victims using a combination of explicit and implicit language to convey harmful intentions. While recent studies have shown the potential of Transformer language models like SBERT for preemptive grooming detection, they primarily depend on surface-level features and approximate real victim grooming processes using vigilante and law enforcement conversations. The question of whether these features and approximations are reasonable has not been addressed thus far. In this paper, we address this gap and study whether SBERT can effectively discern varying degrees of grooming risk inherent in conversations, and evaluate its results across different participant groups. Our analysis reveals that while fine-tuning aids language models in learning to assign grooming scores, they show high variance in predictions, especially for contexts containing higher degrees of grooming risk. These errors appear in cases that 1) utilize indirect speech pathways to manipulate victims and 2) lack sexually explicit content. This finding underscores the necessity for robust modeling of indirect speech acts by language models, particularly those employed by predators.
zh

[NLP-107] SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在面对额外模态数据时的安全风险问题。现有低资源安全对齐方法，包括仅基于文本的方法，在处理额外模态带来的安全威胁时显得力不从心。为了解决这一问题，论文提出了一种名为合成嵌入增强安全对齐（SEA）的方法，其关键是通过梯度更新优化额外模态的嵌入，从而扩展文本数据集，使得即使只有文本数据可用的情况下也能进行多模态安全对齐训练。实验结果表明，SEA方法能够在单个RTX3090 GPU上于24秒内合成高质量嵌入，并显著提升MLLMs在面对视频和音频等额外模态威胁时的安全性。

链接: https://arxiv.org/abs/2502.12562
作者: Weikai Lu,Hao Peng,Huiping Zhuang,Cen Chen,Ziqian Zeng
机构: Shien-Ming Wu School of Intelligent Engineering, South China University of Technology(智能制造工程学院，华南理工大学); School of Cyber Science and Technology, Beihang University(网络空间安全学院，北京航空航天大学); School of Future Technology, South China University of Technology(未来技术学院，华南理工大学); Pazhou Laboratory(琶洲实验室)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have serious security this http URL safety alignment using multimodal datasets consisting of text and data of additional modalities can effectively enhance MLLM’s security, it is costly to construct these datasets. Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. To address this, we propose Synthetic Embedding augmented safety Alignment (SEA), which optimizes embeddings of additional modality through gradient updates to expand textual datasets. This enables multimodal safety alignment training even when only textual data is available. Extensive experiments on image, video, and audio-based MLLMs demonstrate that SEA can synthesize a high-quality embedding on a single RTX3090 GPU within 24 seconds. SEA significantly improves the security of MLLMs when faced with threats from additional modalities. To assess the security risks introduced by video and audio, we also introduced a new benchmark called VA-SafetyBench. High attack success rates across multiple MLLMs validate its challenge. Our code and data will be available at this https URL.
zh

[NLP-108] UXAgent : An LLM Agent -Based Usability Testing Framework for Web Design

【速读】：该论文旨在解决传统可用性测试（Usability Testing）在迭代研究设计中的灵活性不足以及招募参与者困难的问题。论文的关键解决方案是设计了一个名为UXAgent的系统，该系统包含大型语言模型模拟代理（LLM-Agent）模块和通用浏览器连接器模块，能够自动生成数千个模拟用户来测试目标网站。通过定性（如访谈模拟代理的思考过程）和定量（如操作次数）结果及视频记录，UX研究人员可以有效地评估和迭代其可用性测试研究设计。

链接: https://arxiv.org/abs/2502.12561
作者: Yuxuan Lu,Bingsheng Yao,Hansu Gu,Jing Huang,Jessie Wang,Laurence Li,Jiri Gesi,Qi He,Toby Jia-Jun Li,Dakuo Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Usability testing is a fundamental yet challenging (e.g., inflexible to iterate the study design flaws and hard to recruit study participants) research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM-Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human subject study. Our system features an LLM-Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The results are shown in qualitative (e.g., interviewing how an agent thinks ), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent-assisted UX study.
zh

[NLP-109] How does a Language-Specific Tokenizer affect LLM s?

【速读】：该论文旨在探究语言特定分词器在主要以英语文本数据训练的大规模语言模型中的作用，特别通过韩语案例研究来分析。论文的关键解决方案在于开发了一种针对韩语扩展的分词器，并通过多种后续预测任务实验对比了基本分词器与扩展分词器的效果。研究表明，这种扩展分词器能够减少错误预测的信心，并在复杂任务中降低交叉熵，从而减少无意义输出的产生，提供生成过程中的稳定性，可能提升下游任务的表现。

链接: https://arxiv.org/abs/2502.12560
作者: Jean Seo,Jaeyoon Kim,SungJoo Byun,Hyopil Shin
机构: Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.
zh

[NLP-110] Policy-to-Language: Train LLM s to Explain Decisions with Flow-Matching Generated Rewards

【速读】：该论文旨在解决在人类与强化学习（Reinforcement Learning, RL）及大语言模型（Large Language Models, LLMs）驱动的多样化智能体共存过程中，如何通过自然语言解释其决策策略以实现可靠共存的问题。关键在于提出了一种基于大语言模型的模型不可知论解释生成器，并利用生成流匹配模型产生奖励信号。该模型特别设计了一个隐藏层与大语言模型融合，以提取解释中的语言线索来生成恰当的奖励，从而实现密集且有效的奖励生成，减少昂贵的人类反馈需求，进而实现高效的解释能力，甚至提升原任务决策的准确性。

链接: https://arxiv.org/abs/2502.12530
作者: Xinyi Yang,Liang Zeng,Heng Dong,Chao Yu,Xiaoran Wu,Huazhong Yang,Yu Wang,Milind Tambe,Tonghan Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.
zh

[NLP-111] Can LLM s Extract Frame-Semantic Arguments?

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在帧语义论元识别（frame-semantic argument identification）任务中的表现，并探索输入表示格式、模型架构以及对未见过和领域外样本的泛化能力的影响。研究的关键解决方案在于采用基于JSON的表示方法显著提升了性能，并通过微调使较小规模的模型也能获得具有竞争力的结果。此外，论文引入了一种利用预测帧元素进行帧识别的新方法，实现了在模糊目标上的最先进性能（state-of-the-art performance）。尽管LLMs展现了较强的泛化能力，但它们在处理领域外数据时仍存在困难。

链接: https://arxiv.org/abs/2502.12516
作者: Jacob Devasier,Rishabh Mediratta,Chengkai Li
机构: University of Texas at Arlington
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frame-semantic parsing is a critical task in natural language understanding, yet the ability of large language models (LLMs) to extract frame-semantic arguments remains underexplored. This paper presents a comprehensive evaluation of LLMs on frame-semantic argument identification, analyzing the impact of input representation formats, model architectures, and generalization to unseen and out-of-domain samples. Our experiments, spanning models from 0.5B to 78B parameters, reveal that JSON-based representations significantly enhance performance, and while larger models generally perform better, smaller models can achieve competitive results through fine-tuning. We also introduce a novel approach to frame identification leveraging predicted frame elements, achieving state-of-the-art performance on ambiguous targets. Despite strong generalization capabilities, our analysis finds that LLMs still struggle with out-of-domain data.
zh

[NLP-112] Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在自动化同行评审中的鲁棒性。论文的关键解决方案是一个方面引导的多层次扰动框架，通过在论文、评审和反驳三个关键组件中针对贡献、合理性和完整性等质量方面进行有目标的扰动，来揭示LLMs作为评审者和元评审者时可能出现的偏见。该框架不仅诊断出潜在的脆弱性，还展示了这些偏见在多种链式思维提示策略下持续存在，从而强调了当前LLMs缺乏可靠且稳健的批判性评估能力。

链接: https://arxiv.org/abs/2502.12510
作者: Jiatao Li,Yanheng Li,Xinyu Hu,Mingqi Gao,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University(王选计算机技术研究所,北京大学); Information Management Department, Peking University(信息管理系,北京大学); Renmin University of China(中国人民大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review. Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects, including contribution, soundness, presentation, tone, and completeness. By applying targeted perturbations and examining their effects on both LLM-as-Reviewer and LLM-as-Meta-Reviewer, we investigate how aspect-based manipulations, such as omitting methodological details from papers or altering reviewer conclusions, can introduce significant biases in the review process. We identify several potential vulnerabilities: review conclusions that recommend a strong reject may significantly influence meta-reviews, negative or misleading reviews may be wrongly interpreted as thorough, and incomplete or hostile rebuttals can unexpectedly lead to higher acceptance rates. Statistical tests show that these biases persist under various Chain-of-Thought prompting strategies, highlighting the lack of robust critical evaluation in current LLMs. Our framework offers a practical methodology for diagnosing these vulnerabilities, thereby contributing to the development of more reliable and robust automated reviewing systems.
zh

[NLP-113] LegalCore: A Dataset for Legal Documents Event Coreference Resolution

【速读】：该论文旨在解决法律领域事件及其共指消解的问题。现有研究主要集中在新闻文章上，而本文提出首个法律领域的数据集LegalCore，其中包含了详尽的事件提及及共指信息。论文的关键解决方案在于构建这一包含数万tokens长法律合同文档的数据集，并评估主流大型语言模型（LLMs）在事件检测和共指消解任务上的表现，发现这些模型面临显著挑战，性能远低于有监督基线模型。

链接: https://arxiv.org/abs/2502.12509
作者: Kangda Wei,Xi Shi,Jonathan Tong,Sai Ramana Reddy,Anandhavelu Natarajan,Rajiv Jain,Aparna Garimella,Ruihong Huang
机构: Texas A&M University (德克萨斯农工大学); Adobe Research (Adobe 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.
zh

[NLP-114] Efficient OpAmp Adaptation for Zoom Attention to Golden Contexts

【速读】：该论文旨在解决大型语言模型（LLMs）在问答（QA）任务中因参考文档噪声而导致性能下降的问题，特别是在检索增强生成（RAG）场景和长上下文应用中。尽管进行了微调，基于Transformer的架构仍难以优先处理相关的内容，表现为过度关注无关或位置较后的文档。为应对这一挑战，论文提出了一种受运算放大器（OpAmp）启发的改进方法——OpAmp适应技术，通过高效适配器实现。该方法将适配器集成到预训练的Transformer块中，以增强对关键信息的关注，而无需从头开始昂贵的训练。实验评估表明，采用OpAmp适应技术训练的Qwen2.5-OpAmp-72B模型，在处理含有噪声上下文的基准测试中，其表现超越了包括DeepSeek-V3和GPT-4o在内的最先进LLMs。

链接: https://arxiv.org/abs/2502.12502
作者: Haoyuan Wu,Rui Ming,Haisheng Zheng,Zhuolun He,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligent Laboratory (上海人工智能实验室); ChatEDA Tech (智谱科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown significant promise in question-answering (QA) tasks, particularly in retrieval-augmented generation (RAG) scenarios and long-context applications. However, their performance is hindered by noisy reference documents, which often distract from essential information. Despite fine-tuning efforts, Transformer-based architectures struggle to prioritize relevant content. This is evidenced by their tendency to allocate disproportionate attention to irrelevant or later-positioned documents. Recent work proposes the differential attention mechanism to address this issue, but this mechanism is limited by an unsuitable common-mode rejection ratio (CMRR) and high computational costs. Inspired by the operational amplifier (OpAmp), we propose the OpAmp adaptation to address these challenges, which is implemented with adapters efficiently. By integrating the adapter into pre-trained Transformer blocks, our approach enhances focus on the golden context without costly training from scratch. Empirical evaluations on noisy-context benchmarks reveal that our Qwen2.5-OpAmp-72B model, trained with our OpAmp adaptation, surpasses the performance of state-of-the-art LLMs, including DeepSeek-V3 and GPT-4o.
zh

[NLP-115] Crowd Comparative Reasoning : Unlocking Comprehensive Evaluations for LLM -as-a-Judge

【速读】：该论文旨在解决LLM-as-a-Judge在生成链式思维（Chain-of-Thought, CoT）判断时可靠性不足的问题。现有方法主要依赖多数投票或标准扩展，但这些方法不足以克服CoT推理无法捕捉全面深入细节的局限性。论文的关键解决方案是引入基于人群的比较评估（Crowd-based Comparative Evaluation），通过引入额外的人群响应与候选响应进行比较，从而揭示候选响应中的更深层次和更全面的细节。这一过程有效地引导LLM-as-a-Judge提供更为详细的CoT判断，从而提升了评价的可靠性，并在五个基准测试中实现了平均准确率提升6.7%。此外，这种方法还生成了更高质量的CoTs，促进了法官蒸馏，并在监督微调（Supervised Fine-Tuning, SFT）的拒绝采样（rejection sampling）中表现出色，即所谓的群体拒绝采样（crowd rejection sampling），从而实现更高效的SFT。

链接: https://arxiv.org/abs/2502.12501
作者: Qiyuan Zhang,Yufei Wang,Yuxin Jiang,Liangyou Li,Chuhan Wu,Yasheng Wang,Xin Jiang,Lifeng Shang,Ruiming Tang,Fuyuan Lyu,Chen Ma
机构: City University of Hong Kong(香港城市大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); McGill University & MILA(麦吉尔大学 & MILA)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning’s inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.
zh

[NLP-116] UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation ICSE2025

【速读】：该论文旨在解决现有基于深度学习的代码生成方法局限于单一范式的局限性，即序列到序列（Sequence-to-Sequence）和序列到树（Sequence-to-Tree）范式。论文的关键在于提出UniGenCoder模型，通过共享编码器和解码器，并添加少量额外参数来统一这两种范式，同时引入选择器动态选择最优范式。此外，论文在模型训练过程中采用了多任务学习、蒸馏策略以及对比学习来促进知识迁移和优化选择器。实验结果验证了所提模型在文本到代码和代码到代码生成任务中的有效性。

链接: https://arxiv.org/abs/2502.12490
作者: Liangying Shao,Yanfu Yan,Denys Poshyvanyk,Jinsong Su
机构: School of Informatics, Xiamen University (厦门大学), China; William & Mary (威廉与玛丽学院), Virginia, USA
类目: Computation and Language (cs.CL)
备注: ICSE2025 NIER track

点击查看摘要

Abstract:Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We release our code at this https URL.
zh

[NLP-117] EPO: Explicit Policy Optimization for Strategic Reasoning in LLM s via Reinforcement Learning

【速读】：该论文旨在解决复杂现实场景中战略推理（Strategic Reasoning）的适应性、可扩展性和策略迁移能力不足的问题。关键解决方案是提出显式策略优化（Explicit Policy Optimization, EPO），通过利用大型语言模型（LLMs）在开放动作空间中提供策略，并采用多轮强化学习（Reinforcement Learning, RL）进行训练，以增强长期目标对齐能力和战略推理性能，而无需预设的监督微调（Supervised Fine-Tuning, SFT）。

链接: https://arxiv.org/abs/2502.12486
作者: Xiaoqian Liu,Ke Wang,Yongbin Li,Yuchuan Wu,Wentao Ma,Aobo Kong,Fei Huang,Jianbin Jiao,Junge Zhang
机构: University of Chinese Academy of Sciences(中国科学院大学); Tongyi Lab(通义实验室); Insititute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning-an ability to navigate dynamic environments and align long-term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL) using process rewards and iterative self-play, without supervised fine-tuning (SFT) as a preliminary step. Experiments across social and physical domains demonstrate EPO’s ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications.
zh

[NLP-118] Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages – A Singlish Case Study

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在低资源英语语言环境中的安全对齐问题，特别是针对新加坡特有的英语克里奥尔语（Singlish）。论文的关键解决方案是通过监督微调（Supervised Fine-Tuning, SFT）和改进的卡内曼-特沃斯基优化（Kahneman-Tversky Optimization, KTO），尤其是引入了一种新的KTO变体（KTO-S），以提高训练稳定性。这些方法相较于直接偏好优化（Direct Preference Optimization, DPO）更为样本高效，并显著减少了Singlish基准上的毒性，同时保持了标准LLM基准上的良好性能。

链接: https://arxiv.org/abs/2502.12485
作者: Isaac Lim,Shaun Khoo,Watson Chua,Goh Jiayi,Jessica Foo
机构: GovTech Singapore (政府科技局)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To ensure safe usage, Large Language Models (LLMs) typically undergo alignment with human-defined values. However, this alignment often relies on primarily English data and is biased towards Western-centric values, limiting its effectiveness in low-resource language settings. In this paper, we describe our approach for aligning SEA-Lion-v2.1-Instruct (a Llama3-8B variant) to minimize toxicity in Singlish, an English creole specific to Singapore. We find that supervised fine-tuning and Kahneman-Tversky Optimization (KTO) on paired and unpaired preferences is more sample efficient and yields significantly better results than Direct Preference Optimization (DPO). Our analysis reveals that DPO implicitly enforces a weaker safety objective than KTO, and that SFT complements KTO by improving training stability. Finally, we introduce a simple but novel modification to KTO, KTO-S, which improves training stability through better gradient exploitation. Overall, we present a general approach for safety alignment conducive to low-resource English languages, successfully reducing toxicity by 99% on our Singlish benchmark, with gains generalizing to the broader TOXIGEN dataset while maintaining strong performance across standard LLM benchmarks.
zh

[NLP-119] he Knowledge Microscope: Features as Better Analytical Lenses than Neurons

【速读】：该论文旨在解决语言模型中事实知识表达的有限性和神经元解释性差的问题。论文的关键解决方案在于使用稀疏自编码器（Sparse Autoencoders, SAE）将神经元分解为特征，从而作为替代分析单元。通过这种方法，特征展现出更强的知识表达能力、更好的解释性和更高的单义性，并且在隐私保护方面也优于神经元。

链接: https://arxiv.org/abs/2502.12483
作者: Yuheng Chen,Pengfei Cao,Kang Liu,Jun Zhao
机构: The Laboratory of Cognition and Decision Intelligence for Complex Systems (认知与决策智能实验室), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所), Beijing, China (北京); School of Artificial Intelligence (人工智能学院), University of Chinese Academy of Sciences (中国科学院大学), Beijing, China (北京)
类目: Computation and Language (cs.CL)
备注: ARR February UnderReview

点击查看摘要

Abstract:Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from this http URL and dataset will be available.
zh

[NLP-120] MSE-Adapter: A Lightweight Plugin Endowing LLM s with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition

【速读】：该论文旨在解决基于预训练语言模型的多模态情感分析（Multimodal Sentiment Analysis, MSA）和对话中情绪识别（Emotion Recognition in Conversations, ERC）方法存在的两个主要问题：1）这些模型在完成MSA和ERC任务后会丧失其原有的泛化能力；2）它们需要大量的计算资源。针对这些问题，论文提出了一种轻量且可适应的插件——多模态情感分析与情绪识别适配器（MSE-Adapter）。该适配器的关键在于引入了文本引导混合模块（Text-Guide-Mixer, TGM），通过Hadamard积建立非文本模态与文本模态之间的显式连接，从而促进高质量伪标记令牌的生成，同时仅需引入约2.6M至2.8M的可训练参数。实验结果表明，该适配器在多个公开的中英数据集上具有有效性。

链接: https://arxiv.org/abs/2502.12478
作者: Yang Yang,Xunde Dong,Yupeng Qiang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) methods based on pre-trained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose \textbfMultimodal \textbfSentiment Analysis and \textbfEmotion Recognition \textbfAdapter (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin. The code will be released on GitHub after a blind review. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.12478 [cs.CL] (or arXiv:2502.12478v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.12478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-121] Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning

【速读】：该论文旨在解决通过自动生成有意义的问题来评估和提升人类学习效果这一挑战。当前大型语言模型（Large Language Models, LLMs）在摘要和查询响应方面表现出色，但在生成有意义的问题方面研究不足。论文提出了一种名为Savaal的可扩展性问题生成系统，其关键在于采用三阶段处理管道而非直接向LLM提供大文档作为上下文，从而实现更好的深度理解测试和跨领域问题生成。评估结果显示，与直接提示LLM的方法相比，Savaal生成的问题在博士论文中提升了6.5倍，在论文中提升了1.5倍的深度理解测试效果。

链接: https://arxiv.org/abs/2502.12477
作者: Kimia Noorbakhsh,Joseph Chandler,Pantea Karimi,Mohammad Alizadeh,Hari Balakrishnan
机构: M.I.T. Computer Science and Artificial Intelligence Lab (CSAIL)
类目: Computation and Language (cs.CL)
备注: Kimia Noorbakhsh, Joseph Chandler, and Pantea Karimi contributed equally to the work

点击查看摘要

Abstract:Assessing and enhancing human learning through question-answering is vital, yet automating this process remains challenging. While large language models (LLMs) excel at summarization and query responses, their ability to generate meaningful questions for learners is underexplored. We propose Savaal, a scalable question-generation system with three objectives: (i) scalability, enabling question generation from hundreds of pages of text (ii) depth of understanding, producing questions beyond factual recall to test conceptual reasoning, and (iii) domain-independence, automatically generating questions across diverse knowledge areas. Instead of providing an LLM with large documents as context, Savaal improves results with a three-stage processing pipeline. Our evaluation with 76 human experts on 71 papers and PhD dissertations shows that Savaal generates questions that better test depth of understanding by 6.5X for dissertations and 1.5X for papers compared to a direct-prompting LLM baseline. Notably, as document length increases, Savaal’s advantages in higher question quality and lower cost become more pronounced. Comments: Kimia Noorbakhsh, Joseph Chandler, and Pantea Karimi contributed equally to the work Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.12477 [cs.CL] (or arXiv:2502.12477v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.12477 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-122] CoCo-CoLa: Evaluating Language Adherence in Multilingual LLM s

【速读】：该论文旨在解决多语言大型语言模型（Multilingual Large Language Models, LLMs）在生成响应时倾向于高资源语言（如英语）而忽视目标语言的问题。论文的关键解决方案是引入了一种名为CoCo-CoLa（正确概念-正确语言）的新评估指标，并提出了一种选择性微调关键层的部分训练策略，以提高语言一致性，同时显著降低计算成本。这种方法在低资源语言上尤其有效，能够实现与全量微调相当或更优的性能。

链接: https://arxiv.org/abs/2502.12476
作者: Elnaz Rahmati,Alireza S. Ziabari,Morteza Dehghani
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Multilingual Large Language Models (LLMs) develop cross-lingual abilities despite being trained on limited parallel data. However, they often struggle to generate responses in the intended language, favoring high-resource languages such as English. In this work, we introduce CoCo-CoLa (Correct Concept - Correct Language), a novel metric to evaluate language adherence in multilingual LLMs. Using fine-tuning experiments on a closed-book QA task across seven languages, we analyze how training in one language affects others’ performance. Our findings reveal that multilingual models share task knowledge across languages but exhibit biases in the selection of output language. We identify language-specific layers, showing that final layers play a crucial role in determining output language. Accordingly, we propose a partial training strategy that selectively fine-tunes key layers, improving language adherence while significantly reducing computational cost. Our method achieves comparable or superior performance to full fine-tuning, particularly for low-resource languages, offering a more efficient multilingual adaptation.
zh

[NLP-123] Reasoning on a Spectrum: Aligning LLM s to System 1 and System 2 Thinking

【速读】：该论文旨在解决大型语言模型（LLMs）在推理任务中的刚性问题，即它们依赖于结构化的逐步处理方式，缺乏人类认知中从直观启发式（系统1）到分析演绎（系统2）灵活切换的能力。为了解决这一问题，论文的关键方案是创建了一个包含2,000个样本的数据集，并明确将LLMs与这两种推理风格对齐，从而评估其在不同推理基准上的表现。研究结果表明，这种对齐方法揭示了准确性与效率之间的权衡：与系统2对齐的模型在算术和符号推理方面表现出色，而与系统1对齐的模型则在常识任务中表现更优。此外，模型响应的机制分析显示，系统1模型给出的答案更为确定，而系统2模型则表现出更高的不确定性。通过在这两种极端之间进行插值，可以实现推理准确性的单调过渡，同时保持连贯性。这项工作挑战了逐步推理始终最优的假设，并强调了根据任务需求调整推理策略的必要性。

链接: https://arxiv.org/abs/2502.12470
作者: Alireza S. Ziabari,Nona Ghazizadeh,Zhivar Sourati,Farzan Karimi-Malekabadi,Payam Piray,Morteza Dehghani
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. While human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context, LLMs lack this dynamic flexibility. This rigidity can lead to brittle and unreliable performance when faced with tasks that deviate from their trained patterns. To address this, we create a dataset of 2,000 samples with valid System 1 and System 2 answers, explicitly align LLMs with these reasoning styles, and evaluate their performance across reasoning benchmarks. Our results reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense tasks. A mechanistic analysis of model responses shows that System 1 models employ more definitive answers, whereas System 2 models demonstrate greater uncertainty. Interpolating between these extremes produces a monotonic transition in reasoning accuracy, preserving coherence. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.
zh

[NLP-124] EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

【速读】：该论文旨在通过等价检查任务评估大型语言模型（Large Language Models, LLMs）的代码推理能力。论文的关键在于构建了一个名为EquiBench的数据集，包含2400个程序对，覆盖四种编程语言和六个等价类别，这些程序对通过程序分析、编译器调度和超优化手段系统生成，以涵盖需要深层次语义推理而非简单句法变化的非平凡结构转换。通过这一数据集，论文评估了17个最先进的LLMs在等价检查任务中的表现，指出当前模型在复杂代码推理方面仍有显著提升空间。

链接: https://arxiv.org/abs/2502.12466
作者: Anjiang Wei,Jiannan Cao,Ran Li,Hongyu Chen,Yuhui Zhang,Ziheng Wang,Yaofeng Sun,Yuan Liu,Thiago S. F. X. Teixeira,Diyi Yang,Ke Wang,Alex Aiken
机构: Stanford University (斯坦福大学); MIT (麻省理工学院); Google (谷歌); Nanjing University (南京大学); DeepSeek; Intel; Visa Research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs, underpins a broad range of applications, including software refactoring, testing, and optimization. We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models (LLMs). We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. These pairs are systematically generated through program analysis, compiler scheduling, and superoptimization, covering nontrivial structural transformations that demand deep semantic reasoning beyond simple syntactic variations. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%. In the most challenging categories, the best accuracies are 62.3% and 68.8%, only modestly above the 50% random baseline for binary classification, indicating significant room for improvement in current models’ code reasoning capabilities.
zh

[NLP-125] SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

【速读】：该论文旨在解决在实际应用中部署大型语言模型（Large Language Models, LLMs）时，如何高效且准确地检测并阻止有害用户提示的问题。解决方案的关键在于提出了一种名为SafeRoute的二元路由机制，该机制能够区分简单示例和复杂示例，并仅将复杂的示例导向较大的安全防护模型进行处理，从而在保持准确性的前提下提高计算效率。

链接: https://arxiv.org/abs/2502.12464
作者: Seanie Lee,Dong Bok Lee,Dominik Wagner,Minki Kang,Haebin Seong,Tobias Bocklet,Juho Lee,Sung Ju Hwang
机构: KAIST(韩国科学技术院); Technische Hochschule Nürnberg Georg Simon Ohm(纽伦堡乔治西蒙欧姆应用技术大学); DeepAuto.ai(未知)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on “hard” examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model’s capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.
zh

[NLP-126] Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在理解非常长的上下文时所面临的挑战，特别是在关键细节分散在大量输入中的情况下进行稳健的多跳推理。论文的关键解决方案在于通过专门的提示工程（prompt engineering）和链式思维（chain-of-thought, CoT）推理来模拟检索增强生成（Retrieval Augmented Generation, RAG）。该方法将模型本身作为检索器和推理器，首先标记长篇幅中的相关段落，然后采用逐步的CoT工作流程整合这些证据片段，从而减少对外部检索器的依赖，同时保持对关键段落的关注。这种方法显著提升了处理包含多个事实的问题，如对象位置跟踪、计数以及不确定知识的能力。

链接: https://arxiv.org/abs/2502.12462
作者: Joon Park,Kyohei Atarashi,Koh Takeuchi,Hisashi Kashima
机构: Kyoto University(京都大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:This paper addresses the challenge of comprehending very long contexts in Large Language Models (LLMs) by proposing a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought (CoT) reasoning. While recent LLMs support over 100,000 tokens in a single prompt, simply enlarging context windows has not guaranteed robust multi-hop reasoning when key details are scattered across massive input. Our approach treats the model as both the retriever and the reasoner: it first tags relevant segments within a long passage, then employs a stepwise CoT workflow to integrate these pieces of evidence. This single-pass method thereby reduces reliance on an external retriever, yet maintains focus on crucial segments. We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text. Compared to baseline (no retrieval) and naive RAG pipelines, our approach more accurately handles multi-fact questions such as object location tracking, counting, and indefinite knowledge. Furthermore, we analyze how prompt structure, including the order of question, relevant-text tags, and overall instructions, significantly affects performance. These findings underscore that optimized prompt engineering, combined with guided reasoning, can enhance LLMs’ long-context comprehension and serve as a lightweight alternative to traditional retrieval pipelines.
zh

[NLP-127] Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance ACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在泛化到新颖输入时的脆弱性问题，特别是在面对细微但内容保持不变的扰动（如问题格式的轻微变化或干扰项长度的变化）时。论文的关键解决方案在于提出一种“泛化压力测试”(Generalization Stress Test)，通过受控的扰动来评估性能变化，以更可靠地评估LLMs的泛化能力，并呼吁重新评估基准测试和开发更为可靠的评价方法。

链接: https://arxiv.org/abs/2502.12459
作者: Guangxiang Zhao,Saier Hu,Xiaoqi Jian,Jinzhu Wu,Yuhan Wu,Change Jia,Lin Sun,Xiangzheng Zhang
机构: Qiyuan Tech; 360Zhinao; Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ACL 2025 theme track on the Generalization of NLP models

点击查看摘要

Abstract:This paper investigates the fragility of Large Language Models (LLMs) in generalizing to novel inputs, specifically focusing on minor perturbations in well-established benchmarks (e.g., slight changes in question format or distractor length). Despite high benchmark scores, LLMs exhibit significant accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B’s MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT-4 experiences a 25-point accuracy loss when question types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts. This work aligns with the ACL 2025 theme track on the Generalization of NLP models, proposing a “Generalization Stress Test” to assess performance shifts under controlled perturbations. The study calls for reevaluating benchmarks and developing more reliable evaluation methodologies to capture LLM generalization abilities better.
zh

[NLP-128] An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding

【速读】：该论文旨在解决长文本数据分析任务（如客户通话记录分析）的成本高和耗时问题。传统Transformers因其固定长度架构和与输入长度呈平方关系的自注意力机制，在处理长序列任务时面临挑战。论文的关键解决方案在于探索并评估了高效的Transformer变体（如Performer、Reformer）以及基于CNN的架构，以实现实时和近实时的长对话理解任务。结果显示，基于CNN的模型在训练速度、推理速度和内存效率方面显著优于传统的Transformers，具体表现为训练速度快约2.6倍，推理速度快约80%，内存效率提高约72%。

链接: https://arxiv.org/abs/2502.12458
作者: Annamalai Senthilnathan,Kristjan Arumae,Mohammed Khalilia,Zhengzheng Xing,Aaron R. Colak
机构: Qualtrics Inc. (Qualtrics公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we evaluate the CNN model using the Long Range Arena benchmark to demonstrate competitiveness in general long document analysis.
zh

[NLP-129] DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLM s

【速读】：该论文旨在解决大型语言模型在扩展过程中计算成本和资源消耗显著增加的问题。现有的稀疏化方法如剪枝虽能减少计算开销，但可能导致模型知识的损失。论文的关键解决方案是提出DSMoE（动态稀疏专家混合），通过将预训练的前馈网络层划分为计算块来实现稀疏化。DSMoE采用自适应专家路由和直通估计器，并引入稀疏性损失项以平衡性能与计算效率。实验结果表明，在相同的计算限制下，DSMoE在语言建模和下游任务中优于现有剪枝和MoE方法，特别是在生成任务中表现出色。

链接: https://arxiv.org/abs/2502.12455
作者: Minxuan Lv,Zhenpeng Su,Leiyu Pan,Yizhe Xiong,Zijia Lin,Hui Chen,Wei Zhou,Jungong Han,Guiguang Ding,Cheng Luo,Di Zhang,Kun Gai,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); University of Chinese Academy of Sciences(中国科学院大学); Kuaishou Technology(快手科技); Tianjin University(天津大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
zh

[NLP-130] Multi-Attribute Steering of Language Models via Targeted Intervention

【速读】：该论文旨在解决在推理时间干预（Inference-time Intervention, ITI）方法中，现有技术难以同时处理多个相互冲突属性的问题，如在提升模型有用性的同时减少有害内容。论文的关键解决方案是引入了一种名为多属性目标导向转向（Multi-Attribute Targeted Steering, MAT-Steer）的新框架，该框架通过学习针对不同属性的稀疏且正交的引导向量来实现选择性的词级别干预，从而在保持模型内部表示的同时减少属性间的冲突。

链接: https://arxiv.org/abs/2502.12446
作者: Duy Nguyen,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, code link: this https URL

点击查看摘要

Abstract:Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM’s parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model’s internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient finetuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
zh

[NLP-131] HopRAG : Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation

【速读】：该论文旨在解决 Retrieval-Augmented Generation (RAG) 系统在不完美检索中的挑战，传统检索器侧重于词汇或语义相似性而非逻辑相关性。解决方案的关键在于提出了一种名为 HopRAG 的新型 RAG 框架，通过图结构知识探索增强检索过程中的逻辑推理。HopRAG 在索引阶段构建了一个段落图，其中文本片段作为顶点，并通过大型语言模型 (LLM) 生成的伪查询建立逻辑连接作为边。在检索过程中，它采用检索-推理-剪枝机制：从词法或语义相似的段落开始，系统通过伪查询和 LLM 推理引导，探索多跳邻居以识别真正相关的段落。实验结果表明，与传统方法相比，HopRAG 在答案准确性方面提高了 76.78%，检索 F1 分数提高了 65.07%。

链接: https://arxiv.org/abs/2502.12442
作者: Hao Liu,Zhengren Wang,Xi Chen,Zhiyu Li,Feiyu Xiong,Qinhan Yu,Wentao Zhang
机构: Peking University (北京大学); Center for LLM, Institute for Advanced Algorithms Research, Shanghai (上海先进算法研究院LLM中心); Huazhong University of Science and Technology (华中科技大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems often struggle with imperfect retrieval, as traditional retrievers focus on lexical or semantic similarity rather than logical relevance. To address this, we propose HopRAG, a novel RAG framework that augments retrieval with logical reasoning through graph-structured knowledge exploration. During indexing, HopRAG constructs a passage graph, with text chunks as vertices and logical connections established via LLM-generated pseudo-queries as edges. During retrieval, it employs a retrieve-reason-prune mechanism: starting with lexically or semantically similar passages, the system explores multi-hop neighbors guided by pseudo-queries and LLM reasoning to identify truly relevant ones. Extensive experiments demonstrate HopRAG’s superiority, achieving 76.78% higher answer accuracy and 65.07% improved retrieval F1 score compared to conventional methods. The repository is available at this https URL.
zh

[NLP-132] Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL

【速读】：该论文旨在解决由“听起来不真实”的诱人提议所引发的社会技术问题，特别是在人类互动中如何通过策略性欺骗影响决策。论文的关键解决方案在于利用\abrai分析人类在《外交家》（Diplomacy）这类需要自然语言沟通和战略推理的棋盘游戏中如何进行欺骗。具体而言，该方法通过提取玩家交流中提议的逻辑形式，并结合基于文本的特征，计算提议相对于各参与者价值函数的相对收益，从而提高欺骗检测的精度。这种方法比大型语言模型（LLM）方法更优，因为LLM方法会错误地标记许多真实消息为欺骗性信息。

链接: https://arxiv.org/abs/2502.12436
作者: Wichayaporn Wongkamjan,Yanze Wang,Feng Gu,Denis Peskoff,Jonathan K. Kummerfeld,Jonathan May,Jordan Lee Boyd-Graber
机构: University of Maryland(马里兰大学); Princeton University(普林斯顿大学); University of Sydney(悉尼大学); Information Sciences Institute, University of Southern California(南加州大学信息科学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An increasingly prevalent socio-technical problem is people being taken in by offers that sound ``too good to be true’‘, where persuasion and trust shape decision-making. This paper investigates how \abrai can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in \textitDiplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms of proposed agreements in player communications and computing the relative rewards of the proposal using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-\abrai interaction tools can build on our methods for deception detection by triggering \textitfriction to give users a chance of interrogating suspicious proposals.
zh

[NLP-133] A Survey on Large Language Models for Automated Planning

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在自动化规划中的应用，并详细分析其成功与不足之处。论文的关键在于提出一种平衡的方法，即利用LLMs的灵活性和广义知识，结合传统规划方法的严谨性和成本效益，以增强规划应用的整体效能。

链接: https://arxiv.org/abs/2502.12435
作者: Mohamed Aghzal,Erion Plaku,Gregory J. Stein,Ziyu Yao
机构: George Mason University(乔治梅森大学); National Science Foundation(国家科学基金会)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The planning ability of Large Language Models (LLMs) has garnered increasing attention in recent years due to their remarkable capacity for multi-step reasoning and their ability to generalize across a wide range of domains. While some researchers emphasize the potential of LLMs to perform complex planning tasks, others highlight significant limitations in their performance, particularly when these models are tasked with handling the intricacies of long-horizon reasoning. In this survey, we critically investigate existing research on the use of LLMs in automated planning, examining both their successes and shortcomings in detail. We illustrate that although LLMs are not well-suited to serve as standalone planners because of these limitations, they nonetheless present an enormous opportunity to enhance planning applications when combined with other approaches. Thus, we advocate for a balanced methodology that leverages the inherent flexibility and generalized knowledge of LLMs alongside the rigor and cost-effectiveness of traditional planning methods.
zh

[NLP-134] Wi-Chat: Large Language Model Powered Wi-Fi Sensing

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在集成物理模型知识以实现真实世界信号解读方面的潜力。论文的关键解决方案在于引入Wi-Chat系统，通过将Wi-Fi传感原理融入提示（prompts），使LLMs能够处理原始Wi-Fi信号并推断人类活动。这种方法利用物理模型的洞察来指导LLMs解释信道状态信息（Channel State Information, CSI）数据，而无需使用传统的信号处理技术。这一创新展示了LLMs在零样本情境下进行活动识别的强大推理能力，从而提出了一种新的Wi-Fi传感范式，扩展了LLMs的应用范围，并提升了无线传感在实际部署中的可访问性。

链接: https://arxiv.org/abs/2502.12421
作者: Haopeng Zhang,Yili Ren,Haohan Yuan,Jingzhe Zhang,Yitong Shen
机构: ALOHA Lab, University of Hawaii at Manoa(夏威夷大学马诺阿分校); University of South Florida(南佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, their potential to integrate physical model knowledge for real-world signal interpretation remains largely unexplored. In this work, we introduce Wi-Chat, the first LLM-powered Wi-Fi-based human activity recognition system. We demonstrate that LLMs can process raw Wi-Fi signals and infer human activities by incorporating Wi-Fi sensing principles into prompts. Our approach leverages physical model insights to guide LLMs in interpreting Channel State Information (CSI) data without traditional signal processing techniques. Through experiments on real-world Wi-Fi datasets, we show that LLMs exhibit strong reasoning capabilities, achieving zero-shot activity recognition. These findings highlight a new paradigm for Wi-Fi sensing, expanding LLM applications beyond conventional language tasks and enhancing the accessibility of wireless sensing for real-world deployments.
zh

[NLP-135] Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models

【速读】：该论文旨在解决大型语言模型在合并过程中如何高效保留各专项任务能力的问题。现有方法通常采用统一系数来合并模型参数，忽略了参数在不同任务中的重要性差异。论文提出的关键解决方案是Sens-Merging，一种基于敏感度引导的系数调整方法。此方法通过分析参数在单一任务中的敏感度以及跨任务的可转移性，确定最优的合并系数，从而在保持各专项任务性能的同时提升整体模型表现。

链接: https://arxiv.org/abs/2502.12420
作者: Shuqi Liu,Han Wu,Bowei He,Xiongwei Han,Mingxuan Yuan,Linqin Song
机构: Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models have led to numerous task-specialized fine-tuned variants, creating a need for efficient model merging techniques that preserve specialized capabilities while avoiding costly retraining. While existing task vector-based merging methods show promise, they typically apply uniform coefficients across all parameters, overlooking varying parameter importance both within and across tasks. We present Sens-Merging, a sensitivity-guided coefficient adjustment method that enhances existing model merging techniques by operating at both task-specific and cross-task levels. Our method analyzes parameter sensitivity within individual tasks and evaluates cross-task transferability to determine optimal merging coefficients. Extensive experiments on Mistral 7B and LLaMA2-7B/13B models demonstrate that Sens-Merging significantly improves performance across general knowledge, mathematical reasoning, and code generation tasks. Notably, when combined with existing merging techniques, our method enables merged models to outperform specialized fine-tuned models, particularly in code generation tasks. Our findings reveal important trade-offs between task-specific and cross-task scalings, providing insights for future model merging strategies.
zh

[NLP-136] Lost in Transcription Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models

【速读】：该论文旨在解决自动语音识别（ASR）系统在大规模训练下的表现评估难题，特别是如何有效衡量其在关键应用场景中的转录质量。论文的关键解决方案在于引入幻觉错误率（Hallucination Error Rate, HER）这一新指标，以量化ASR模型产生不实输出的现象，即“幻觉”。研究发现，高词错误率（WER）可能掩盖低幻觉率，而低WER也可能隐藏危险的幻觉现象，并且合成噪声及分布偏移均会显著增加HER。通过分析20个ASR模型，论文揭示了分布偏移与HER之间存在强相关性 ((\alpha = 0.91) )。研究表明，结合传统指标如WER与HER能够更全面地评估ASR模型的表现，特别是在高风险领域。

链接: https://arxiv.org/abs/2502.12414
作者: Hanin Atwany,Abdul Waheed,Rita Singh,Monojit Choudhury,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL)
备注: The first two authors contributed equally as co-first authors. The manuscript is 21 pages long and is a work in progress

点击查看摘要

Abstract:Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of 20 ASR models reveals \numinsights~key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER ( \alpha = 0.91 ). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
zh

[NLP-137] Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）中不安全提示（unsafe prompts）带来的显著安全风险。现有方法依赖于数据驱动的微调来训练护栏模型（guardrail models），需要大量的数据和计算资源。本文提出的关键解决方案是GradCoo，这是一种新颖的梯度共现分析方法，通过扩展安全性关键参数识别范围至无符号梯度相似性，从而减少“方向偏差”（directional bias）的影响，并提高不安全提示检测的准确性。

链接: https://arxiv.org/abs/2502.12411
作者: Jingyuan Yang,Bowen Yan,Rongjun Li,Ziyu Zhou,Xin Chen,Zhiyong Feng,Wei Peng
机构: IT Innovation and Research Center, Huawei Technologies (华为技术有限公司创新与研究中心); College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); Artificial Intelligence Academy, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); IT Platform Dept 1, Huawei Technologies (华为技术有限公司IT平台部1); yangjingyuan2@huawei.com; lirongjun3@huawei.com; zhouziyu8@huawei.com; chenxin247@huawei.com; peng.wei1@huawei.com; yanbowen@bupt.edu.cn; zyfeng@tju.edu.cn

根据提供的信息，直接提取单位机构信息并遵循规则。邮件地址虽然不是机构名称，但根据规则也被包括了。
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsafe prompts pose significant safety risks to large language models (LLMs). Existing methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail models, necessitating significant data and computational resources. In contrast, recent few-shot gradient-based methods emerge, requiring only few safe and unsafe reference prompts. A gradient-based approach identifies unsafe prompts by analyzing consistent patterns of the gradients of safety-critical parameters in LLMs. Although effective, its restriction to directional similarity (cosine similarity) introduces directional bias'', limiting its capability to identify unsafe prompts. To overcome this limitation, we introduce GradCoo, a novel gradient co-occurrence analysis method that expands the scope of safety-critical parameter identification to include unsigned gradient similarity, thereby reducing the impact of directional bias’’ and enhancing the accuracy of unsafe prompt detection. Comprehensive experiments on the widely-used benchmark datasets ToxicChat and XStest demonstrate that our proposed method can achieve state-of-the-art (SOTA) performance compared to existing methods. Moreover, we confirm the generalizability of GradCoo in detecting unsafe prompts across a range of LLM base models with various sizes and origins.
zh

[NLP-138] On the Robust Approximation of ASR Metrics

【速读】：该论文旨在解决传统自动语音识别（ASR）模型评估依赖于昂贵且耗时的标注数据的问题。论文的关键解决方案在于提出了一种无需标注数据的新型方法，通过利用多模态嵌入在统一空间中的语音和转录表示，并结合高质量代理模型来计算代理指标。这些特征被用来训练一个回归模型以预测诸如词错误率（WER）和字符错误率（CER）等关键ASR指标。该方法在14个数据集、涵盖标准和真实环境测试条件下的超过40个模型上进行了实验，结果显示所提方法在所有实验配置下均能将指标估算控制在个位数绝对误差范围内，优于最新的基准方法超过50%。

链接: https://arxiv.org/abs/2502.12408
作者: Abdul Waheed,Hanin Atwany,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University(卡内基梅隆大学); MBZUAI(穆罕默德·本·扎耶德国际人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 25 Pages. Work in Progress

点击查看摘要

Abstract:Recent advances in speech foundation models are largely driven by scaling both model size and data, enabling them to perform a wide range of tasks, including speech recognition. Traditionally, ASR models are evaluated using metrics like Word Error Rate (WER) and Character Error Rate (CER), which depend on ground truth labels. As a result of limited labeled data from diverse domains and testing conditions, the true generalization capabilities of these models beyond standard benchmarks remain unclear. Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics. These features are used to train a regression model to predict key ASR metrics like Word Error Rate (WER) and Character Error Rate (CER). We experiment with over 40 models across 14 datasets representing both standard and in-the-wild testing conditions. Our results show that we approximate the metrics within a single-digit absolute difference across all experimental configurations, outperforming the most recent baseline by more than 50%.
zh

[NLP-139] WMT24: Expanding the Language Coverag e of WMT24 to 55 Languages Dialects

【速读】：该论文旨在通过扩展WMT24数据集至涵盖55种语言，解决多语言机器翻译（Machine Translation, MT）基准测试数据不足的问题。解决方案的关键在于收集新的人工撰写参考译文及译后编辑，从而确保在广泛的语种范围内进行有效的性能评估。研究发现，在所有55种语言中，大型语言模型（Large Language Models, LLM）的表现优于其他机器翻译系统。

链接: https://arxiv.org/abs/2502.12404
作者: Daniel Deutsch,Eleftheria Briakou,Isaac Caswell,Mara Finkelstein,Rebecca Galor,Juraj Juraska,Geza Kovacs,Alison Lui,Ricardo Rei,Jason Riesa,Shruti Rijhwani,Parker Riley,Elizabeth Salesky,Firas Trabelsi,Stephanie Winkler,Biao Zhang,Markus Freitag
机构: Google(谷歌); Unbabel
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.
zh

[NLP-140] Prag matics in the Era of Large Language Models : A Survey on Datasets Evaluation Opportunities and Challenges

【速读】：该论文旨在解决自然语言处理（NLP）系统在理解和解析语言细微差别方面的能力不足问题。论文的关键在于通过全面回顾和分析现有用于评估NLP系统语用能力的资源，识别当前评估趋势及存在的局限性，并在此基础上提出更为综合和有针对性的评估基准，以促进开发出更具细微理解和上下文感知能力的NLP模型。

链接: https://arxiv.org/abs/2502.12378
作者: Bolei Ma,Yuting Li,Wei Zhou,Ziwei Gong,Yang Janet Liu,Katja Jasinskaja,Annemarie Friedrich,Julia Hirschberg,Frauke Kreuter,Barbara Plank
机构: LMU Munich & Munich Center for Machine Learning(慕尼黑大学 & 慕尼黑机器学习中心); University of Cologne(科隆大学); University of Augsburg(奥格斯堡大学); Bosch Center for Artificial Intelligence(博世人工智能中心); Columbia University(哥伦比亚大学); University of Maryland, College Park(马里兰大学帕克分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.
zh

[NLP-141] UltraG en: Extremely Fine-grained Controllable Generation via Attribute Reconstruction and Global Preference Optimization

【速读】：该论文旨在解决细粒度可控文本生成中，当属性数量增加到数十个时现有方法性能显著下降的问题。为应对这一挑战，论文提出了一种名为极细粒度可控生成（EFCG）的零样本方法，其关键是引入了自动重构（AR）和全局偏好优化（GPO）。在AR阶段，利用大型语言模型（LLMs）从原始文本中提取软属性，并结合编程推导出的硬属性，构建大量（约45个）多属性需求，以指导细粒度文本重构过程。在GPO阶段，采用直接偏好优化（DPO）在多种属性组合下精炼文本生成，从而有效地探索全局组合空间。此外，还引入了一种高效的属性采样策略来识别和纠正潜在错误的属性，进一步提升全局优化效果。

链接: https://arxiv.org/abs/2502.12375
作者: Longfei Yun,Letian Peng,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine granularity is an essential requirement for controllable text generation, which has seen rapid growth with the ability of LLMs. However, existing methods focus mainly on a small set of attributes like 3 to 5, and their performance degrades significantly when the number of attributes increases to the next order of magnitude. To address this challenge, we propose a novel zero-shot approach for extremely fine-grained controllable generation (EFCG), proposing auto-reconstruction (AR) and global preference optimization (GPO). In the AR phase, we leverage LLMs to extract soft attributes (e.g., Emphasis on simplicity and minimalism in design) from raw texts, and combine them with programmatically derived hard attributes (e.g., The text should be between 300 and 400 words) to construct massive (around 45) multi-attribute requirements, which guide the fine-grained text reconstruction process under weak supervision. In the GPO phase, we apply direct preference optimization (DPO) to refine text generation under diverse attribute combinations, enabling efficient exploration of the global combination space. Additionally, we introduce an efficient attribute sampling strategy to identify and correct potentially erroneous attributes, further improving global optimization. Our framework significantly improves the constraint satisfaction rate (CSR) and text quality for EFCG by mitigating position bias and alleviating attention dilution.
zh

[NLP-142] Factual Inconsistency in Data-to-Text Generation Scales Exponentially with LLM Size: A Statistical Validation

【速读】：该论文旨在探究数据到文本生成（Data-to-Text, D2T）中的事实一致性如何随大型语言模型（Large Language Models, LLMs）规模的变化而变化。论文的关键在于通过探索幂律和指数律两种缩放规律，评估和比较这两种规律在预测性能、拟合优度及对比分析三个阶段的表现，从而揭示事实一致性并非遵循广泛假设的幂律缩放，而是遵循指数律缩放。

链接: https://arxiv.org/abs/2502.12372
作者: Joy Mahapatra,Soumyajit Roy,Utpal Garain
机构: Indian Statistical Institute Kolkata (印度统计学院加尔各答)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Monitoring factual inconsistency is essential for ensuring trustworthiness in data-to-text generation (D2T). While large language models (LLMs) have demonstrated exceptional performance across various D2T tasks, previous studies on scaling laws have primarily focused on generalization error through power law scaling to LLM size (i.e., the number of model parameters). However, no research has examined the impact of LLM size on factual inconsistency in D2T. In this paper, we investigate how factual inconsistency in D2T scales with LLM size by exploring two scaling laws: power law and exponential scaling. To rigorously evaluate and compare these scaling laws, we employ a statistical validation framework consisting of three key stages: predictive performance estimation, goodness-of-fit assessment, and comparative analysis. For a comprehensive empirical study, we analyze three popular LLM families across five D2T datasets, measuring factual inconsistency inversely using four state-of-the-art consistency metrics. Our findings, based on exhaustive empirical results and validated through our framework, reveal that, contrary to the widely assumed power law scaling, factual inconsistency in D2T follows an exponential scaling with LLM size.
zh

[NLP-143] Classifiers of Data Sharing Statements in Clinical Trial Records ALT

【速读】：该论文旨在解决如何有效识别大型数据库中可用的个体参与者数据（IPD）的问题。解决方案的关键在于利用领域特定的预训练语言模型（domain-specific pre-trained language models），通过评估这些模型在预测手动标注标签方面的表现，以简化基于文本输入的有效分类器的实现，从而自动识别大型临床试验数据库中可获得的IPD。

链接: https://arxiv.org/abs/2502.12362
作者: Saber Jelodari Mamaghani,Cosima Strantz,Dennis Toddenroth
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in Proceedings of MIE 2024, IOS Press eBooks. Studies in Health Technology and Informatics, Vol. 316, pp. 834-838. Conference held in Athens, Greece

点击查看摘要

Abstract:Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from this http URL, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.
zh

[NLP-144] ConFit v2: Improving Resume-Job Matching using Hypothetical Resume Embedding and Runner-Up Hard-Negative Mining

【速读】：该论文旨在解决简历与岗位匹配系统中因交互标签稀疏导致的推荐准确性问题。解决方案的关键在于引入ConFit v2，通过增强编码器的对比训练过程来改进匹配效果。具体而言，ConFit v2采用两项技术：一是利用大规模语言模型生成假设参考简历以扩充岗位数据；二是运用一种新颖的硬负样本挖掘策略从未标记的简历-岗位对中创建高质量的硬负样本。

链接: https://arxiv.org/abs/2502.12361
作者: Xiao Yu,Ruize Xu,Chengyuan Xue,Jinzhong Zhang,Zhou Yu
机构: Columbia University (哥伦比亚大学); University of Toronto (多伦多大学); Intellipro Group Inc. (智谱集团)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2401.16349

点击查看摘要

Abstract:A reliable resume-job matching system helps a company recommend suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction labels in resume-job datasets are sparse. We introduce ConFit v2, an improvement over ConFit to tackle this sparsity problem. We propose two techniques to enhance the encoder’s contrastive training process: augmenting job data with hypothetical reference resume generated by a large language model; and creating high-quality hard negatives from unlabeled resume/job pairs using a novel hard-negative mining strategy. We evaluate ConFit v2 on two real-world datasets and demonstrate that it outperforms ConFit and prior methods (including BM25 and OpenAI text-embedding-003), achieving an average absolute improvement of 13.8% in recall and 17.5% in nDCG across job-ranking and resume-ranking tasks.
zh

[NLP-145] LM Agents for Coordinating Multi-User Information Gathering

【速读】：该论文旨在解决通过语言模型（Language Model, LM）中介的协作问题解决能力评估问题。为实现这一目标，论文提出PeopleJoin基准，其关键是设计了一套机制，使代理能够在模拟的多用户协作场景中识别合适的队友、进行信息交流，并最终汇总出有用的答案或摘要。该基准包含两个评估领域：针对表格数据问答的PeopleJoin-QA和针对文档创建任务的PeopleJoin-DocCreation，这些任务的信息分布在由2至20个用户组成的合成“组织”中。

链接: https://arxiv.org/abs/2502.12328
作者: Harsh Jhamtani,Jacob Andreas,Benjamin Van Durme
机构: Microsoft
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces PeopleJoin, a benchmark for evaluating LM-mediated collaborative problem solving. Given a user request, PeopleJoin agents must identify teammates who might be able to assist, converse with these teammates to gather information, and finally compile a useful answer or summary for the original user. PeopleJoin comprises two evaluation domains: PeopleJoin-QA, focused on questions about tabular data, and PeopleJoin-DocCreation, focused on document creation tasks. The two domains are adapted from existing NLP benchmarks for database question answering and multi-document summarization; here, however, the information needed to complete these tasks is distributed across synthetic ``organizations’’ of 2–20 users, simulating natural multi-user collaboration scenarios. We implemented several popular LM agent architectures, evaluating their accuracy and efficiency at completing tasks, and highlight new research questions that can be studied using PeopleJoin.
zh

[NLP-146] From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在不同推理约束条件下训练成本高昂的问题，并且一旦训练完成，这些模型通常以统一方式处理所有tokens，缺乏灵活性。为应对这些问题，论文提出了一种名为DynaMoE的后训练优化框架，通过最小的微调成本将预训练的密集型LLM转换为基于token难度的专家混合模型。关键在于其token难度感知路由机制，能够预测token的难度并将它们导向适当的子网络或专家，从而实现更高效和灵活的处理能力。

链接: https://arxiv.org/abs/2502.12325
作者: Kumari Nishu,Sachin Mehta,Samira Abnar,Mehrdad Farajtabar,Maxwell Horton,Mahyar Najibi,Moin Nabi,Minsik Cho,Devang Naik
机构: Apple
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only 10B tokens, a minimal cost compared to the base model’s training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only \frac19\textth of their fine-tuning cost.
zh

[NLP-147] Can Language Models Learn Typologically Implausible Languages?

【速读】：该论文旨在探讨域一般学习偏差（Domain-General Learning Biases）在语言普遍性中的作用，并通过大规模自然主义的语言模型（Language Models, LMs）实验来验证。关键在于通过训练和测试语言模型于高度自然但反事实的英语（主语先行）和日语（主语后置）版本，评估这些模型对类型学合理与不合理语言的学习差异，从而揭示语言模型是否显示出与类型学一致的学习偏好，以及这种偏好可能源于域一般的认知偏差。

链接: https://arxiv.org/abs/2502.12317
作者: Tianyang Xu,Tatsuki Kuribayashi,Yohei Oseki,Ryan Cotterell,Alex Warstadt
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans. However, empirical evidence has been limited to experiments with highly simplified artificial languages, and whether these correlations arise from domain-general or language-specific biases remains a matter of debate. Language models (LMs) provide an opportunity to study artificial language learning at a large scale and with a high degree of naturalism. In this paper, we begin with an in-depth discussion of how LMs allow us to better determine the role of domain-general learning biases in language universals. We then assess learnability differences for LMs resulting from typologically plausible and implausible languages closely following the word-order universals identified by linguistic typologists. We conduct a symmetrical cross-lingual study training and testing LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages. Compared to similar work, our datasets are more naturalistic and fall closer to the boundary of plausibility. Our experiments show that these LMs are often slower to learn these subtly implausible languages, while ultimately achieving similar performance on some metrics regardless of typological plausibility. These findings lend credence to the conclusion that LMs do show some typologically-aligned learning preferences, and that the typological patterns may result from, at least to some degree, domain-general learning biases.
zh

[NLP-148] Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation

【速读】：该论文旨在解决传统有监督微调（Supervised Fine-Tuning, SFT）策略在序列到序列任务中直接生成目标输出导致性能受限的问题。论文提出的关键解决方案是一种任务不可知的框架，该框架使模型能够生成中间“预热”序列。这些预热序列作为后续生成的初始状态，通过优化以增强生成目标序列的概率，且无需依赖外部监督或人为设计的结构。该方法借鉴强化学习的原则，迭代地优化这些中间步骤，以最大化其对最终输出的贡献，类似于基于奖励的优化过程。实验结果表明，该方法在翻译、摘要生成以及逻辑推理的多选题回答等任务中优于传统的SFT方法，并提供了可扩展且灵活的序列到序列任务解决方案。

链接: https://arxiv.org/abs/2502.12304
作者: Senyu Li,Zipeng Sun,Jiayi Wang,Xue Liu,Pontus Stenetorp,Siva Reddy,David Ifeoluwa Adelani
机构: Mila - Quebec AI Institute(麦吉尔大学-魁北克人工智能研究所); McGill University(麦吉尔大学); University College London(伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output. Recent work has shown that guiding models with intermediate steps, such as keywords, outlines, or reasoning chains, can significantly improve performance, coherence, and interpretability. However, these methods often depend on predefined intermediate formats and annotated data, limiting their scalability and generalizability. In this work, we introduce a task-agnostic framework that enables models to generate intermediate “warmup” sequences. These warmup sequences, serving as an initial state for subsequent generation, are optimized to enhance the probability of generating the target sequence without relying on external supervision or human-designed structures. Drawing inspiration from reinforcement learning principles, our method iteratively refines these intermediate steps to maximize their contribution to the final output, similar to reward-driven optimization in reinforcement learning with human feedback. Experimental results across tasks such as translation, summarization, and multi-choice question answering for logical reasoning show that our approach outperforms traditional SFT methods, and offers a scalable and flexible solution for sequence-to-sequence tasks.
zh

[NLP-149] SMOL: Professionally translated parallel data for 115 under-represented languages

【速读】：该论文旨在解决低资源语言（Low-Resource Languages, LRLs）翻译资源匮乏的问题。解决方案的关键在于开源SMOL（Set of Maximal Overall Leverage）数据集，它包含了115种语料稀缺语言的翻译，共计610万个翻译标记。SMOL由两个子数据集组成：SMOL-Sent，用于广泛覆盖独特标记的句子集合；以及SMOL-Doc，关注广泛主题覆盖的文档级源数据。通过使用SMOL来提示或微调大规模语言模型（Large Language Models），论文展示了ChrF评分的显著提升。此外，论文还提供了SMOL-Doc中所有文档的真实性评级及其理由，从而为这些语言提供了首个事实性数据集。

链接: https://arxiv.org/abs/2502.12301
作者: Isaac Caswell,Elizabeth Nielsen,Jiaming Luo,Colin Cherry,Geza Kovacs,Hadar Shemtov,Partha Talukdar,Dinesh Tewari,Baba Mamadi Diane,Koulako Moussa Doumbouya,Djibrila Diane,Solo Farabado Cissé
机构: Google (谷歌); Stanford University (斯坦福大学); NKo USA INC
类目: Computation and Language (cs.CL)
备注: ~10 pages with appendices

点击查看摘要

Abstract:We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages.
zh

[NLP-150] Independence Tests for Language Models

【速读】：该论文旨在解决判定两个模型权重是否来自独立随机初始化的问题。在约束设置下，通过假设模型架构和训练过程，并提出一系列统计检验方法，计算出精确的p值以验证原假设（即模型来自独立随机初始化）。这些检验方法的有效性不受模型训练数据组成的影响。而在非约束设置下，当允许改变模型架构并考虑对抗性攻击时，提出了匹配隐藏激活的新测试方法，该方法能够识别模型中的非独立组件，并且具有对抗变换和架构变化的鲁棒性。关键在于开发适用于不同条件下的统计检验方法，确保在各种情况下都能有效识别模型权重的独立性。

链接: https://arxiv.org/abs/2502.12292
作者: Sally Zhu,Ahmed Ahmed,Rohith Kuditipudi,Percy Liang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We consider the following problem: given the weights of two models, can we test whether they were trained independently – i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. These p-values are valid regardless of the composition of either model’s training data; we compute them by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures of weights and activations between the original two models versus these copies. We report the p-values from these tests on pairs of 21 open-weight models (210 total pairs) and correctly identify all pairs of non-independent models. Our tests remain effective even if one model was fine-tuned for many tokens. In the unconstrained setting, where we make no assumptions about training procedures, can change model architecture, and allow for adversarial evasion attacks, the previous tests no longer work. Instead, we propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture. The test can also do localized testing: identifying specific non-independent components of models. Though we no longer obtain exact p-values from this, empirically we find it behaves as one and reliably identifies non-independent models. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.
zh

[NLP-151] Evaluating Step-by-step Reasoning Traces: A Survey

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂问题中的推理能力评估标准不统一的问题，导致开发度量标准和元评估基准的努力分散。论文的关键解决方案是提出了一种包含四类主要评价标准（可靠性、有效性、连贯性和实用性）的分类法，并系统地调查了各类标准下使用的度量方法，以及评估者模型在不同标准之间的可转移性。

链接: https://arxiv.org/abs/2502.12289
作者: Jinu Lee,Julia Hockenmaier
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computation and Language (cs.CL)
备注: 20 pages (8 pages of main content), 6 figures

点击查看摘要

Abstract:Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, the evaluation criteria remain highly unstandardized, leading to fragmented efforts in developing metrics and meta-evaluation benchmarks. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility). We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria. Finally, we identify key directions for future research.
zh

[NLP-152] Story Grammar Semantic Matching for Literary Study

【速读】：该论文旨在解决传统语义匹配算法在理解文学文本时因依赖词共现特征而受限的问题。解决方案的关键在于提出了一种新的方法——故事语法语义匹配（Story Grammar Semantic Matching），通过使用BERT语言模型标注故事元素标签，并仅以这些标签作为特征进行语义匹配，从而更透明地揭示文本间的语义相似性。

链接: https://arxiv.org/abs/2502.12276
作者: Abigail Swenor,Neil Coffee,Walter Scheirer
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to Journal of Computational Literary Studies

点击查看摘要

Abstract:In Natural Language Processing (NLP), semantic matching algorithms have traditionally relied on the feature of word co-occurrence to measure semantic similarity. While this feature approach has proven valuable in many contexts, its simplistic nature limits its analytical and explanatory power when used to understand literary texts. To address these limitations, we propose a more transparent approach that makes use of story structure and related elements. Using a BERT language model pipeline, we label prose and epic poetry with story element labels and perform semantic matching by only considering these labels as features. This new method, Story Grammar Semantic Matching, guides literary scholars to allusions and other semantic similarities across texts in a way that allows for characterizing patterns and literary technique.
zh

[NLP-153] Integrating Expert Knowledge into Logical Programs via LLM s

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在集成专家知识到逻辑推理系统中的有效性。解决方案的关键在于ExKLoP框架，它通过系统性地评估LLM生成的逻辑规则，考察其句法流畅性和逻辑正确性，并利用基于代码执行结果的迭代反馈循环探索模型的自我修正能力。ExKLoP提供了包含130个工程前提、950个提示及其相应验证点的可扩展数据集，从而实现全面的基准测试，同时控制任务复杂度和实验的可扩展性。

链接: https://arxiv.org/abs/2502.12275
作者: Franciszek Górski,Oskar Wysocki,Marco Valentino,Andre Freitas
机构: Gdansk University of Technology (格但斯克工业大学), Poland; University of Manchester (曼彻斯特大学), United Kingdom; National Biomarker Centre (NBC), CRUK Manchester Institute (英国癌症研究中心曼彻斯特研究所), United Kingdom; Idiap Research Institute (伊迪亚普研究所), Martigny, Switzerland
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper introduces ExKLoP, a novel framework designed to evaluate how effectively Large Language Models (LLMs) integrate expert knowledge into logical reasoning systems. This capability is especially valuable in engineering, where expert knowledge-such as manufacturer-recommended operational ranges-can be directly embedded into automated monitoring systems. By mirroring expert verification steps, tasks like range checking and constraint validation help ensure system safety and reliability. Our approach systematically evaluates LLM-generated logical rules, assessing both syntactic fluency and logical correctness in these critical validation tasks. We also explore the models capacity for self-correction via an iterative feedback loop based on code execution outcomes. ExKLoP presents an extensible dataset comprising 130 engineering premises, 950 prompts, and corresponding validation points. It enables comprehensive benchmarking while allowing control over task complexity and scalability of experiments. We leverage the synthetic data creation methodology to conduct extensive empirical evaluation on a diverse set of LLMs including Llama3, Gemma, Mixtral, Mistral, and Qwen. Results reveal that while models generate nearly perfect syntactically correct code, they frequently exhibit logical errors in translating expert knowledge. Furthermore, iterative self-correction yields only marginal improvements (up to 3%). Overall, ExKLoP serves as a robust evaluation platform that streamlines the selection of effective models for self-correcting systems while clearly delineating the types of errors encountered. The complete implementation, along with all relevant data, is available at GitHub.
zh

[NLP-154] Learning to Reason at the Frontier of Learnability

【速读】：该论文旨在解决在强化学习阶段，大型语言模型（LLMs）训练过程中存在的一些问题，特别是在处理推理类任务如数学问题时。研究发现，在使用两种流行的算法（PPO 和 VinePPO）对两个广泛使用的数据集进行训练时，许多问题要么被所有尝试都解决（意味着它们已经被完全掌握），要么没有一个尝试能够解决（提供不了有意义的训练信号）。为了解决这一问题，论文提出了一种新的方法：根据可学习性采样来优化课程学习策略。这种方法的关键在于优先选择成功率变化较大的问题，即那些有时可以成功解决但并非每次都能成功的任务。通过这种方法，论文展示了其课程学习策略能够在多种算法和数据集上持续提升训练性能，从而为更高效且有效的强化学习应用于LLMs铺平了道路。

链接: https://arxiv.org/abs/2502.12272
作者: Thomas Foster,Jakob Foerster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning in LLMs.
zh

[NLP-155] InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context

【速读】：该论文旨在解决大型语言模型在处理含糊或不完整用户请求时存在的问题，这些模型通常会提供冗长且通用的回答，而不是寻求澄清。论文的关键解决方案在于引入InfoQuest，一个多轮对话基准，用于评估对话系统如何处理开放性用户请求中的隐含上下文。通过故意设计模糊的情景，促使模型通过提问来澄清信息，从而在提供适当响应之前有效收集必要信息。这种方法揭示了当前语言模型在多轮交互中处理模糊请求时的局限性。

链接: https://arxiv.org/abs/2502.12257
作者: Bryan L. M. de Oliveira,Luana G. B. Martins,Bruno Brandão,Luckeciano C. Melo
机构: Advanced Knowledge Center for Immersive Technologies – AKCIT (沉浸式技术先进知识中心), Brazil; Institute of Informatics, Federal University of Goiás (戈亚斯联邦大学信息学研究所), Brazil; OATML, University of Oxford (牛津大学OATML), United Kingdom
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While large language models excel at following explicit instructions, they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses rather than seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. The benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue through clarifying questions before providing appropriate responses. Our evaluation of both open and closed-source models reveals that while proprietary models generally perform better, all current assistants struggle with effectively gathering critical information, often requiring multiple turns to infer user intent and frequently defaulting to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models’ information-seeking capabilities, offering insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.
zh

[NLP-156] GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation

【速读】：该论文旨在解决手语翻译中的长时序依赖捕捉难题，以提升Sign Language Machine Translation (SLMT)的性能。论文的关键解决方案是提出了一种新颖的Gated-Logarithmic Transformer (GLoT)，它能够有效处理手语作为时间序列数据的动态特性，从而更好地捕获长时序依赖关系。研究表明，GLoT在所有评估指标上均优于其他模型，展示了其在解决聋哑及听力受损群体交流挑战方面的潜力。

链接: https://arxiv.org/abs/2502.12223
作者: Nada Shahin,Leila Ismail
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine Translation has played a critical role in reducing language barriers, but its adaptation for Sign Language Machine Translation (SLMT) has been less explored. Existing works on SLMT mostly use the Transformer neural network which exhibits low performance due to the dynamic nature of the sign language. In this paper, we propose a novel Gated-Logarithmic Transformer (GLoT) that captures the long-term temporal dependencies of the sign language as a time-series data. We perform a comprehensive evaluation of GloT with the transformer and transformer-fusion models as a baseline, for Sign-to-Gloss-to-Text translation. Our results demonstrate that GLoT consistently outperforms the other models across all metrics. These findings underscore its potential to address the communication challenges faced by the Deaf and Hard of Hearing community.
zh

[NLP-157] Optimal Brain Iterative Merging: Mitigating Interference in LLM Merging

【速读】：该论文旨在解决大型语言模型（LLMs）在定制化过程中因高计算成本带来的挑战，并提出了一种名为Optimal Brain Iterative Merging (OBIM)的方法。OBIM的关键在于通过引入一个衡量参数重要性的机制来减少模型内干扰，并采用一种互斥迭代合并框架来避免直接参数平均，从而减少模型间干扰。

链接: https://arxiv.org/abs/2502.12217
作者: Zhixiang Wang,Zhenyu Mao,Yixuan Qiao,Yunfang Wu,Biye Li
机构: National Key Laboratory for Multimedia Information Processing, Peking University(北京大学国家重点多媒体信息处理实验室); GAI, Du Xiaoman(度小满GAI); School of Software & Microelectronics, Peking University(北京大学软件与微电子学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities, but their high computational costs pose challenges for customization. Model merging offers a cost-effective alternative, yet existing methods suffer from interference among parameters, leading to performance degradation. In this work, we propose Optimal Brain Iterative Merging (OBIM), a novel method designed to mitigate both intra-model and inter-model interference. OBIM consists of two key components: (1) A saliency measurement mechanism that evaluates parameter importance based on loss changes induced by individual weight alterations, reducing intra-model interference by preserving only high-saliency parameters. (2) A mutually exclusive iterative merging framework, which incrementally integrates models using a binary mask to avoid direct parameter averaging, thereby mitigating inter-model interference. We validate OBIM through experiments on both Supervised Fine-Tuned (SFT) models and post-pretrained checkpoints. The results show that OBIM significantly outperforms existing merging techniques. Overall, OBIM provides an effective and practical solution for enhancing LLM merging.
zh

[NLP-158] actic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLM s

【速读】：该论文旨在解决长上下文模型在解码过程中加载大型KV缓存效率低下的问题。现有方法通过稀疏注意力机制限制固定token预算，但忽视了不同注意力头、层及上下文之间的重要性差异。论文提出的关键解决方案是Tactic，这是一种自适应稀疏注意力机制，它根据累积注意力分数动态选择token，而非依赖固定的token预算。通过设定目标注意力分数比例，Tactic能够自然适应注意力稀疏性的变化，并利用基于聚类的排序和分布拟合高效近似选择，从而在最小计算开销下准确估计token的重要性。这种方法显著提升了模型的精度和解码注意力速度，实现了高达7.29倍的加速，整体端到端推理速度提升1.58倍。

链接: https://arxiv.org/abs/2502.12216
作者: Kan Zhu,Tian Tang,Qinyu Xu,Yile Gu,Zhichen Zeng,Rohan Kadekodi,Liangyu Zhao,Ang Li,Arvind Krishnamurthy,Baris Kasikci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.
zh

[NLP-159] Zero Token-Driven Deep Thinking in LLM s: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

【速读】：该论文旨在解决大型语言模型（LLMs）在资源限制下参数数量受限，从而影响性能的问题。论文的关键解决方案是提出零令牌变换器（Zero Token Transformer, ZTT），其采用了一种头尾分离的参数循环方法，并引入了零令牌机制（Zero-Token Mechanism）。该机制通过从零令牌池中检索带有可训练键值的零令牌，并将其与常规令牌一起整合到注意力机制中，实现层特定计算的动态引导以及计算重要性的反映，从而实现在紧缩参数预算下的优越性能和通过早期退出有效减少计算开销。

链接: https://arxiv.org/abs/2502.12214
作者: Guanghao Li,Wenhao Jiang,Li Shen,Ming Tang,Chun Yuan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer’s computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.
zh

[NLP-160] Enhancing Frame Detection with Retrieval Augmented Generation

【速读】：该论文旨在解决通过 Retrieval-Augmented Generation (RAG) 模型进行框架检测的问题，特别是在仅提供原始文本的情况下。解决方案的关键在于提出了一种名为 RCIF (Retrieve Candidates and Identify Frames) 的新方法，该方法包括三个主要阶段：(1) 从不同表示中生成框架嵌入；(2) 根据输入文本检索候选框架；(3) 确定最合适的框架。RCIF 方法通过其检索组件显著减少了任务的复杂性，缩小了搜索空间，从而允许框架识别器细化和完成候选集。该方法在 FrameNet 1.5 和 1.7 上实现了最先进的性能，并通过结构化表示增强了跨词汇变化的任务中的自然语言问题到 SPARQL 查询的翻译能力。

链接: https://arxiv.org/abs/2502.12210
作者: Papa Abdou Karim Karou Diallo,Amal Zouaq
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in Natural Language Processing have significantly improved the extraction of structured semantic representations from unstructured text, especially through Frame Semantic Role Labeling (FSRL). Despite this progress, the potential of Retrieval-Augmented Generation (RAG) models for frame detection remains under-explored. In this paper, we present the first RAG-based approach for frame detection called RCIF (Retrieve Candidates and Identify Frames). RCIF is also the first approach to operate without the need for explicit target span and comprises three main stages: (1) generation of frame embeddings from various representations ; (2) retrieval of candidate frames given an input text; and (3) identification of the most suitable frames. We conducted extensive experiments across multiple configurations, including zero-shot, few-shot, and fine-tuning settings. Our results show that our retrieval component significantly reduces the complexity of the task by narrowing the search space thus allowing the frame identifier to refine and complete the set of candidates. Our approach achieves state-of-the-art performance on FrameNet 1.5 and 1.7, demonstrating its robustness in scenarios where only raw text is provided. Furthermore, we leverage the structured representation obtained through this method as a proxy to enhance generalization across lexical variations in the task of translating natural language questions into SPARQL queries.
zh

[NLP-161] Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在优化过程中与人类目标和价值观保持一致的挑战。论文的关键在于探讨工具收敛（instrumental convergence）现象，特别是在通过直接强化学习（Reinforcement Learning, RL）训练的模型中，这种现象可能导致模型产生未预期的中间目标，从而偏离人类意图。为此，作者引入了InstrumentalEval基准来评估基于强化学习训练的LLMs中的工具收敛情况，并通过对比直接RL优化训练的模型（如o1模型）与基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）训练的模型，验证了RL驱动的模型更可能表现出工具收敛倾向，因为它们在优化目标导向行为时可能与人类意图不一致。

链接: https://arxiv.org/abs/2502.12206
作者: Yufei He,Yuexin Li,Jiaying Wu,Yuan Sui,Yulin Chen,Bryan Hooi
机构: National University of Singapore
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to evolve, ensuring their alignment with human goals and values remains a pressing challenge. A key concern is \textitinstrumental convergence, where an AI system, in optimizing for a given objective, develops unintended intermediate goals that override the ultimate objective and deviate from human-intended goals. This issue is particularly relevant in reinforcement learning (RL)-trained models, which can generate creative but unintended strategies to maximize rewards. In this paper, we explore instrumental convergence in LLMs by comparing models trained with direct RL optimization (e.g., the o1 model) to those trained with reinforcement learning from human feedback (RLHF). We hypothesize that RL-driven models exhibit a stronger tendency for instrumental convergence due to their optimization of goal-directed behavior in ways that may misalign with human intentions. To assess this, we introduce InstrumentalEval, a benchmark for evaluating instrumental convergence in RL-trained LLMs. Initial experiments reveal cases where a model tasked with making money unexpectedly pursues instrumental objectives, such as self-replication, implying signs of instrumental convergence. Our findings contribute to a deeper understanding of alignment challenges in AI systems and the risks posed by unintended model behaviors.
zh

[NLP-162] Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration

【速读】：该论文旨在解决现有抑郁症检测方法在建模临床访谈主题内容方面的缺陷：1) 未能明确捕捉主题内部和主题之间的关联；2) 无法让临床医生介入并关注感兴趣的特定主题。为了解决这些问题，论文提出了一种交互式抑郁症检测框架（Interactive Depression Detection Framework），该框架利用上下文学习技术识别临床访谈中的主题，并建模主题内部及主题间的关联。此外，通过AI驱动的反馈模拟临床医生的兴趣，实现主题重要性的交互式调整。关键在于引入了能够捕捉主题间关联并允许互动干预的机制。

链接: https://arxiv.org/abs/2502.12204
作者: Xianbing Zhao,Yiqing Lyu,Di Wang,Buzhou Tang
机构: Xidian University; Harbin Institute of Technology, Shenzhen
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic depression detection provides cues for early clinical intervention by clinicians. Clinical interviews for depression detection involve dialogues centered around multiple themes. Existing studies primarily design end-to-end neural network models to capture the hierarchical structure of clinical interview dialogues. However, these methods exhibit defects in modeling the thematic content of clinical interviews: 1) they fail to capture intra-theme and inter-theme correlation explicitly, and 2) they do not allow clinicians to intervene and focus on themes of interest. To address these issues, this paper introduces an interactive depression detection framework. This framework leverages in-context learning techniques to identify themes in clinical interviews and then models both intra-theme and inter-theme correlation. Additionally, it employs AI-driven feedback to simulate the interests of clinicians, enabling interactive adjustment of theme importance. PDIMC achieves absolute improvements of 35% and 12% compared to the state-of-the-art on the depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of modeling theme correlation and incorporating interactive external feedback.
zh

[NLP-163] BoT: Breaking Long Thought Processes of o1 -like Large Language Models through Backdoor Attack

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中可能遭受的攻击问题。具体而言，这类模型通过生成深度思考过程来实现高性能，但这种机制也使它们容易受到强制快速响应的攻击，从而削弱其性能。论文的关键解决方案是BoT（Break CoT），这是一种通过后门攻击破坏内在推理机制的方法。BoT通过构建带有设计触发器的中毒数据集，并利用监督微调或直接偏好优化注入后门，使得模型在触发时直接生成答案而无需经过思考过程，同时对于干净输入仍保持正常的推理能力。

链接: https://arxiv.org/abs/2502.12202
作者: Zihao Zhu,Hongbao Zhang,Mingda Zhang,Ruotong Wang,Guanzong Wu,Ke Xu,Baoyuan Wu
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳); Huawei International, Singapore(华为国际新加坡)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Longer thought, better performance: large language models with deep reasoning capabilities, particularly o1-like models, have demonstrated remarkable performance by generating extensive thought processes during inference. This trade-off reveals a potential vulnerability: adversaries could compromise model performance by forcing immediate responses without thought processes. To this end, in this paper, we introduce a novel attack scenario targeting the long thought processes of o1-like models and propose BoT (Break CoT), which can selectively break intrinsic reasoning mechanisms through backdoor attacks. BoT constructs poisoned datasets with designed triggers and injects backdoor by either supervised fine-tuning or direct preference optimization. When triggered, the model directly generates answers without thought processes, while maintaining normal reasoning capabilities for clean inputs. Extensive experiments on open-source o1-like models, including recent DeepSeek-R1, demonstrate that BoT nearly achieves high attack success rates while maintaining clean accuracy, highlighting the critical safety risk in current models. Furthermore, the relationship between task difficulty and helpfulness reveals a potential application for good, enabling users to customize model behavior based on task complexity. Code is available at \hrefthis https URLthis https URL.
zh

[NLP-164] Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product

【速读】：该论文旨在解决现有Prompt Tuning (PT)方法中存在的两个主要问题：一是忽视了软提示词（soft prompt tokens）之间的内在语义关联，导致高离散性和有限交互，从而降低模型在复杂任务中的理解和有效性；二是为了提高性能需要较长的软提示词，但长提示词会增加内存使用和计算成本。论文的关键解决方案是提出了一种新的低参数Prompt Tuning (LAMP)方法，通过采用截断奇异值分解（Truncated SVD）来减少训练参数和显著降低软提示词参数空间的维度，并利用压缩外积模块促进提示词间的多重交互，探索其内在关联以增强知识表示。最终，LAMP通过平均池化减少内存使用和训练/推理时间。

链接: https://arxiv.org/abs/2502.12200
作者: Pengxiang Lan,Haoyu Xu,Enneng Yang,Yuliang Liang,Guibing Guo,Jianzhe Zhao,Xingwei Wang
机构: Software College, Northeastern University, China(软件学院，东北大学，中国); School of Computer Science and Engineering, Northeastern University, China(计算机科学与工程学院，东北大学，中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt tuning (PT) offers a cost-effective alternative to fine-tuning large-scale pre-trained language models (PLMs), requiring only a few parameters in soft prompt tokens added before the input text. However, existing PT approaches face two significant issues: (i) They overlook intrinsic semantic associations between soft prompt tokens, leading to high discreteness and limited interactions, thus reducing the model’s comprehension and effectiveness in complex tasks. (ii) Due to the complexity of downstream tasks, long soft prompt is necessitated to improve performance, but prompt length correlates positively with memory usage and computational costs. Achieving high efficiency and performance remains an ongoing challenge. To address these issues, we propose a novel Low-parameters prompt tuning (LAMP) method, which leverages prompt decomposition and compressed outer product. Specifically, the prompt decomposition module employs Truncated SVD to reduce training parameters and significantly lower the dimensionality of the soft prompt parameter space. It then utilizes a compressed outer product module to facilitate multiple interactions among prompt tokens, exploring their intrinsic associations to enhance knowledge representation. Finally, LAMP uses average pooling to reduce memory usage and training/inference time. Extensive experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
zh

[NLP-165] A Closer Look at System Prompt Robustness

【速读】：该论文旨在解决大型语言模型（LLMs）在对话和代理场景中系统提示（system prompts）鲁棒性不足的问题。关键在于通过创建基于OpenAI的GPT Store和HuggingFace的HuggingChat收集的真实评估和微调数据集，来改进系统提示的鲁棒性，并通过现实的微调数据以及推理时干预如无分类器引导（classifier-free guidance）等方法显著提升模型性能。

链接: https://arxiv.org/abs/2502.12197
作者: Norman Mu,Jonathan Lu,Michael Lavery,David Wagner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Artifacts: this https URL

点击查看摘要

Abstract:System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI’s GPT Store and HuggingFace’s HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.
zh

[NLP-166] AI and the Law: Evaluating ChatGPT s Performance in Legal Classification

【速读】：该论文旨在解决ChatGPT在波兰语言背景下分析和分类刑事诉讼证据的有效性问题。研究的关键在于评估ChatGPT在适用波兰刑法典的情况下进行法律案例二元分类的准确性，并通过定量与定性分析验证其在提供恰当法律依据方面的有效性。研究表明，ChatGPT能够准确分类证据并应用相应的法律规则，显示出其作为法律资源的潜力，尤其对于经验或知识较少的使用者。

链接: https://arxiv.org/abs/2502.12193
作者: Pawel Weichbroth
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages; 1 figure; 2 tables; 32 references

点击查看摘要

Abstract:The use of ChatGPT to analyze and classify evidence in criminal proceedings has been a topic of ongoing discussion. However, to the best of our knowledge, this issue has not been studied in the context of the Polish language. This study addresses this research gap by evaluating the effectiveness of ChatGPT in classifying legal cases under the Polish Penal Code. The results show excellent binary classification accuracy, with all positive and negative cases correctly categorized. In addition, a qualitative evaluation confirms that the legal basis provided for each case, along with the relevant legal content, was appropriate. The results obtained suggest that ChatGPT can effectively analyze and classify evidence while applying the appropriate legal rules. In conclusion, ChatGPT has the potential to assist interested parties in the analysis of evidence and serve as a valuable legal resource for individuals with less experience or knowledge in this area.
zh

[NLP-167] Self-supervised Attribute-aware Dynamic Preference Ranking Alignment

【速读】：该论文旨在解决在列表级场景（如社区问答）中，通过人类反馈进行强化学习（Reinforcement Learning from Human Feedback, RLHF）过程中存在的两个主要问题：一是昂贵的人工标注成对比较（costly human-annotated pairwise comparisons），二是人类偏好受多种内在因素影响导致的决策不一致性。为了解决这些问题，论文提出了一种名为SeAdpra的自监督属性感知动态偏好排序方法。该方法基于属性感知距离因子（Attribute-Perceptual Distance Factors, APDF）量化响应之间的偏好差异，并动态确定列表级对齐顺序，从而实现细粒度的偏好差异学习和精确对齐。

链接: https://arxiv.org/abs/2502.12189
作者: Hongyu Yang,Qi Zhao,Zhenhua hu,Rui Li
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback and its variants excel in aligning with human intentions to generate helpful, harmless, and honest responses. However, most of them rely on costly human-annotated pairwise comparisons for supervised alignment, which is not suitable for list-level scenarios, such as community question answering. Additionally, human preferences are influenced by multiple intrinsic factors in responses, leading to decision-making inconsistencies. Therefore, we propose \textbfSelf-supervised \textbfAttribute-aware \textbfdynamic \textbfpreference \textbfranking, called \shortname. \ It quantifies preference differences between responses based on Attribute-Perceptual Distance Factors (APDF) and dynamically determines the list-wise alignment order. Furthermore, it achieves fine-grained preference difference learning and enables precise alignment with the optimal one. We specifically constructed a challenging code preference dataset named StaCoCoQA, and introduced more cost-effective and scalable preference evaluation metrics: PrefHit and PrefRecall. Extensive experimental results show that SeAdpra exhibits superior performance and generalizability on both StaCoCoQA and preference datasets from eight popular domains.
zh

[NLP-168] Hallucinations are inevitable but statistically negligible

【速读】：该论文旨在解决语言模型（Language Model, LM）在生成过程中产生幻觉（hallucinations）的问题。论文的关键解决方案在于从概率论的角度证明，通过提高训练数据的质量和数量，可以使幻觉现象在统计上变得可忽略不计。尽管存在不可消除的理论限制，表明在无限输入集上必然会产生幻觉，但通过改进算法和增加高质量训练数据，可以降低其发生的概率。

链接: https://arxiv.org/abs/2502.12187
作者: Atsushi Suzuki,Yulan He,Feng Tian,Zhongyuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Hallucinations, a phenomenon where a language model (LM) generates nonfactual content, pose a significant challenge to the practical deployment of LMs. While many empirical methods have been proposed to mitigate hallucinations, a recent study established a computability-theoretic result showing that any LM will inevitably generate hallucinations on an infinite set of inputs, regardless of the quality and quantity of training datasets and the choice of the language model architecture and training and inference algorithms. Although the computability-theoretic result may seem pessimistic, its significance in practical viewpoints has remained unclear. In contrast, we present a positive theoretical result from a probabilistic perspective. Specifically, we prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. Interestingly, our positive result coexists with the computability-theoretic result, implying that while hallucinations on an infinite set of inputs cannot be entirely eliminated, their probability can always be reduced by improving algorithms and training data. By evaluating the two seemingly contradictory results through the lens of information theory, we argue that our probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.
zh

[NLP-169] Large Language Models for Extrapolative Modeling of Manufacturing Processes

【速读】：该论文旨在解决传统制造过程参数关系预测建模中受限于人工经验和直觉的主观性以及实验数据生成的成本和时间的问题。解决方案的关键在于建立一个新的大型语言模型（Large Language Model, LLM）框架，通过自动提取文献中的工艺相关知识，并基于少量实验数据进行迭代模型优化。这种结合知识自动化提取与小样本学习的方法显著提升了模型的外推性能，且无需依赖人工初始建模或专家解读文献。

链接: https://arxiv.org/abs/2502.12185
作者: Kiarash Naghavi Khanghah,Anandkumar Patel,Rajiv Malhotra,Hongyi Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional predictive modeling of parametric relationships in manufacturing processes is limited by the subjectivity of human expertise and intuition on the one hand and by the cost and time of experimental data generation on the other hand. This work addresses this issue by establishing a new Large Language Model (LLM) framework. The novelty lies in combining automatic extraction of process-relevant knowledge embedded in the literature with iterative model refinement based on a small amount of experimental data. This approach is evaluated on three distinct manufacturing processes that are based on machining, deformation, and additive principles. The results show that for the same small experimental data budget the models derived by our framework have unexpectedly high extrapolative performance, often surpassing the capabilities of conventional Machine Learning. Further, our approach eliminates manual generation of initial models or expertise-dependent interpretation of the literature. The results also reveal the importance of the nature of the knowledge extracted from the literature and the significance of both the knowledge extraction and model refinement components.
zh

[NLP-170] Leverag ing large language models for structured information extraction from pathology reports

【速读】：该论文旨在解决从乳腺癌病理报告中结构化信息提取的问题，以提高数据可访问性，支持临床研究。传统方法依赖于专家手动提取，耗时且成本高昂，难以扩展。论文的关键解决方案在于使用大型语言模型（Large Language Models, LLMs）进行零样本提示学习（zero-shot prompting），仅需自然语言指令即可实现自动化提取，无需标注数据或额外训练。通过开发名为“Medical Report Information Extractor”的工具，论文评估了多种LLMs在提取病理特征方面的准确性，并展示了Llama 3.1 405B和GPT-4o等模型的性能与人类注释员相当，从而验证了这一方法的有效性。

链接: https://arxiv.org/abs/2502.12183
作者: Jeya Balaji Balasubramanian,Daniel Adams,Ioannis Roxanis,Amy Berrington de Gonzalez,Penny Coulson,Jonas S. Almeida,Montserrat García-Closas
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 29 pages, 6 figures

点击查看摘要

Abstract:Background: Structured information extraction from unstructured histopathology reports facilitates data accessibility for clinical research. Manual extraction by experts is time-consuming and expensive, limiting scalability. Large language models (LLMs) offer efficient automated extraction through zero-shot prompting, requiring only natural language instructions without labeled data or training. We evaluate LLMs’ accuracy in extracting structured information from breast cancer histopathology reports, compared to manual extraction by a trained human annotator. Methods: We developed the Medical Report Information Extractor, a web application leveraging LLMs for automated extraction. We developed a gold standard extraction dataset to evaluate the human annotator alongside five LLMs including GPT-4o, a leading proprietary model, and the Llama 3 model family, which allows self-hosting for data privacy. Our assessment involved 111 histopathology reports from the Breast Cancer Now (BCN) Generations Study, extracting 51 pathology features specified in the study’s data dictionary. Results: Evaluation against the gold standard dataset showed that both Llama 3.1 405B (94.7% accuracy) and GPT-4o (96.1%) achieved extraction accuracy comparable to the human annotator (95.4%; p = 0.146 and p = 0.106, respectively). While Llama 3.1 70B (91.6%) performed below human accuracy (p 0.001), its reduced computational requirements make it a viable option for self-hosting. Conclusion: We developed an open-source tool for structured information extraction that can be customized by non-programmers using natural language. Its modular design enables reuse for various extraction tasks, producing standardized, structured data from unstructured text reports to facilitate analytics through improved accessibility and interoperability. Comments: 29 pages, 6 figures Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2502.12183 [cs.CL] (or arXiv:2502.12183v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.12183 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jeya Balaji Balasubramanian [view email] [v1] Fri, 14 Feb 2025 21:46:02 UTC (1,571 KB)
zh

[NLP-171] Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

【速读】：该论文旨在解决传统引导方法依赖于昂贵且有限的监督数据的问题，并提出了一种新的无监督方法来实现语言模型属性的精确控制。论文的关键在于引入了Sparse Shift Autoencoders (SSAEs)，通过将嵌入向量之间的差异映射到稀疏表示，从而能够在没有监督的情况下准确地引导单一概念，而不会无意中操控无关属性。这种方法展示了在半合成和真实语言数据集上的有效性，使用Llama-3.1嵌入验证了其准确性。

链接: https://arxiv.org/abs/2502.12179
作者: Shruti Joshi,Andrea Dittadi,Sébastien Lachapelle,Dhanya Sridhar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages, 9 figures

点击查看摘要

Abstract:Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations. Crucially, we show that SSAEs are identifiable from paired observations that vary in \textitmultiple unknown concepts, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.
zh

[NLP-172] GoRA: Gradient-driven Adaptive Low Rank Adaptation

【速读】：该论文旨在解决低秩适应（Low-Rank Adaptation, LoRA）在微调预训练大语言模型（LLMs）过程中性能受限的问题。论文的关键在于提出了一种新的方法——梯度驱动自适应低秩适应（GoRA），通过同时基于梯度信息自适应分配秩和初始化权重，从而显著提升性能，同时保持LoRA的高可用性和效率。实验结果表明，GoRA在GLUE基准上对T5模型的微调中比LoRA提升了5.88点，并略微超越了全量微调；在GSM8k任务中对Llama3.1-8B-Base模型的微调中，GoRA相比LoRA提升了5.13点，并在高秩设置中超过了全量微调2.05点。

链接: https://arxiv.org/abs/2502.12171
作者: Haonan He,Peng Ye,Yuchen Ren,Yuan Yuan,Lei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a crucial method for efficiently fine-tuning pretrained large language models (LLMs), with its performance largely influenced by two key factors: rank and initialization strategy. Numerous LoRA variants have been proposed to enhance its performance by addressing these factors. However, these variants often compromise LoRA’s usability or efficiency. In this paper, we analyze the fundamental limitations of existing methods and introduce a novel approach, GoRA (Gradient-driven Adaptive Low Rank Adaptation), which adaptively assigns ranks and initializes weights for low-rank adapters simultaneously based on gradient information. Extensive experimental results demonstrate that GoRA significantly improves performance while preserving the high usability and efficiency of LoRA. On the T5 model fine-tuned for the GLUE benchmark, GoRA achieves a 5.88-point improvement over LoRA and slightly surpasses full fine-tuning. Similarly, on the Llama3.1-8B-Base model fine-tuned for GSM8k tasks, GoRA outperforms LoRA with a 5.13-point improvement and exceeds full fine-tuning in high-rank settings by a margin of 2.05 points.
zh

[NLP-173] MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

【速读】：该论文旨在解决现有Transformer架构中残差连接（Residual Connections）的局限性，并增强跨层信息流。关键解决方案是引入了一种名为多路动态密集连接（MUltiway Dynamic Dense, MUDD）的方法。与已有静态共享权重的密集连接方法不同，MUDD能够根据每个序列位置及Transformer块中解耦输入流（查询、键、值或残差）的隐藏状态动态生成连接权重。这种方法可以无缝集成到任何Transformer架构中，形成MUDDFormer，从而显著提升模型在语言建模等任务上的性能。

链接: https://arxiv.org/abs/2502.12170
作者: Da Xiao,Qingye Meng,Shengping Li,Xingyuan Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at this https URL .
zh

[NLP-174] Mining Social Determinants of Health for Heart Failure Patient 30-Day Readmission via Large Language Model

【速读】：该论文旨在解决心力衰竭（Heart Failure, HF）患者高再入院率所带来的重大医疗挑战。论文的关键在于利用先进的大型语言模型（Large Language Models, LLMs）从临床文本中提取社会决定因素健康（Social Determinants of Health, SDOHs），并通过逻辑回归分析这些因素与HF再入院之间的关联。通过识别与再入院风险相关的关键SDOHs（如烟草使用、交通不便等），论文提供了减少再入院和改善患者护理的可行见解。

链接: https://arxiv.org/abs/2502.12158
作者: Mingchen Shao,Youjeong Kang,Xiao Hu,Hyunjung Gloria Kwak,Carl Yang,Jiaying Lu
机构: Emory University (埃默里大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Heart Failure (HF) affects millions of Americans and leads to high readmission rates, posing significant healthcare challenges. While Social Determinants of Health (SDOH) such as socioeconomic status and housing stability play critical roles in health outcomes, they are often underrepresented in structured EHRs and hidden in unstructured clinical notes. This study leverages advanced large language models (LLMs) to extract SDOHs from clinical text and uses logistic regression to analyze their association with HF readmissions. By identifying key SDOHs (e.g. tobacco usage, limited transportation) linked to readmission risk, this work also offers actionable insights for reducing readmissions and improving patient care.
zh

[NLP-175] Causal Interpretations in Observational Studies: The Role of Sociocultural Backgrounds and Team Dynamics

【速读】：该论文旨在探讨在科学交流中，从观察性研究中得出因果结论时所使用的因果语言是否适当，并试图识别影响这种语言使用模式的因素。关键在于通过分析超过80,000篇观察性研究摘要，采用计算语言学和回归方法，发现因果语言的使用与作者的经验水平、研究团队规模、作者性别以及国家的文化不确定性规避指数等因素有关。这些发现表明，因果语言的使用可能受到作者的社会文化背景和研究合作动态等外部因素的影响。

链接: https://arxiv.org/abs/2502.12159
作者: Jun Wang,Bei Yu
机构: 未知
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注: 13 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The prevalence of drawing causal conclusions from observational studies has raised concerns about potential exaggeration in science communication. While some believe causal language should only apply to randomized controlled trials, others argue that rigorous methods can justify causal claims in observational studies. Ideally, causal language should align with the strength of the evidence. However, through the analysis of over 80,000 observational study abstracts using computational linguistic and regression methods, we found that causal language is more frequently used by less experienced authors, smaller research teams, male last authors, and authors from countries with higher uncertainty avoidance indices. These findings suggest that the use of causal language may be influenced by external factors such as the sociocultural backgrounds of authors and the dynamics of research collaboration. This newly identified link deepens our understanding of how such factors help shape scientific conclusions in causal inference and science communication.
zh

计算机视觉

[CV-0] Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

【速读】：该论文旨在解决大型视觉语言模型（Vision Language Models, VLMs）在跨模态应用中出现的显著幻觉现象，特别是跨模态不一致性问题。解决方案的关键在于引入Re-Align框架，该框架利用图像检索构建双重偏好数据集，有效结合文本和视觉偏好信号。此外，提出rDPO方法，在标准直接偏好优化基础上增加视觉偏好目标，以进一步提升模型性能和鲁棒性。实验结果表明，Re-Align不仅更有效地减少了幻觉现象，还在一般视觉问答任务中实现了显著的性能提升，并保持了广泛的模型规模和架构下的稳健性和可扩展性。

链接: https://arxiv.org/abs/2502.13146
作者: Shuo Xing,Yuping Wang,Peiran Li,Ruizheng Bai,Yueqi Wang,Chengxuan Qian,Huaxiu Yao,Zhengzhong Tu
机构: Texas A&M University(德克萨斯农工大学); University of Michigan(密歇根大学); UIUC(伊利诺伊大学香槟分校); UNC Chapel Hill(北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications. We release all the code in this https URL.
zh

[CV-1] Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在部署过程中面临的计算复杂度高、关键值缓存需求大以及依赖独立视觉编码器等问题。论文的关键解决方案是提出了一种名为mmMamba的框架，通过渐进蒸馏技术将现有的MLLMs转化为线性复杂度的原生多模态状态空间模型。这种方法无需预训练的循环神经网络（RNN）基础语言模型或视觉编码器，同时支持灵活的混合架构，从而实现可定制的效率与性能之间的权衡。

链接: https://arxiv.org/abs/2502.13145
作者: Bencheng Liao,Hongyuan Tao,Qian Zhang,Tianheng Cheng,Yingyue Li,Haoran Yin,Wenyu Liu,Xinggang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and model are available at this https URL

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE’s capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6 \times speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5 \times speedup and 60.2% memory savings. Code and models are released at this https URL
zh

[CV-2] RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

【速读】：该论文旨在解决现有端到端自动驾驶（End-to-end Autonomous Driving, AD）算法在模仿学习（Imitation Learning, IL）范式下所面临的因果混淆（causal confusion）和开环间隙（open-loop gap）等问题。为了解决这些问题，论文提出了一种基于3D几何场景（3D Geometric Scene, 3DGS）的闭环强化学习（Reinforcement Learning, RL）训练范式。关键在于利用3DGS技术构建了一个逼真的数字复制品来模拟真实物理世界，从而使得自动驾驶策略能够广泛探索状态空间，并通过大规模的试错学习处理分布外场景。此外，论文设计了专门的奖励机制以增强安全性，引导策略有效应对安全关键事件，并理解现实世界的因果关系。同时，IL作为正则化项被纳入RL训练中，以更好地与人类驾驶行为保持一致。

链接: https://arxiv.org/abs/2502.13144
作者: Hao Gao,Shaoyu Chen,Bo Jiang,Bencheng Liao,Yiang Shi,Xiaoyang Guo,Yuechuan Pu,Haoran Yin,Xiangyu Li,Xinbang Zhang,Ying Zhang,Wenyu Liu,Qian Zhang,Xinggang Wang
机构: Huazhong University of Science & Technology (华中科技大学); Horizon Robotics (地平线 robotics)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at this https URL.
zh

[CV-3] SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

【速读】：该论文旨在解决视觉语言模型（VLMs）在理解物体方向方面的不足，特别是对于需要精细操作的任务。论文的关键在于引入语义方向（semantic orientation）的概念，通过自然语言描述物体的方向，而无需依赖特定的参考帧。为了支持这一概念，构建了一个包含30万条标注语义方向的三维模型数据集OrienText300K。通过将语义方向整合到VLM系统中，使机器人能够生成既考虑位置又考虑方向限制的操作动作。实验结果表明，该方法显著提升了机器人的操作能力，如在Open6DOR任务上的准确率达到48.7%，在SIMPLER任务上的准确率达到74.9%。

链接: https://arxiv.org/abs/2502.13143
作者: Zekun Qi,Wenyao Zhang,Yufei Ding,Runpei Dong,Xinqiang Yu,Jingwen Li,Lingyun Xu,Baoyu Li,Xialin He,Guofan Fan,Jiazhao Zhang,Jiawei He,Jiayuan Gu,Xin Jin,Kaisheng Ma,Zhizheng Zhang,He Wang,Li Yi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ‘‘plug-in’’ direction of a USB or the ‘‘handle’’ direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.
zh

[CV-4] AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

【速读】：该论文旨在解决仅凭文本输入生成逼真的4D动态虚拟人物的问题。解决方案的关键在于设计了两个并行的扩散变换器（Diffusion Transformers），并通过中间的高速公路连接（highway connections）确保音频和视觉模态之间的信息交流，从而实现同步的语音语调和面部动态（如眉毛动作）。此外，采用流匹配（flow matching）训练方法以获得富有表现力的结果和快速推理能力。

链接: https://arxiv.org/abs/2502.13133
作者: Aggelina Chatziagapi,Louis-Philippe Morency,Hongyu Gong,Michael Zollhoefer,Dimitris Samaras,Alexander Richard
机构: Stony Brook University; Meta AI; Codec Avatars Lab, Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: this https URL
zh

[CV-5] Magma: A Foundation Model for Multimodal AI Agents

【速读】：该论文旨在开发一种能够处理数字和物理世界中多模态人工智能代理任务的基础模型，称为Magma。解决方案的关键在于通过预训练大量异构数据集（包括图像、视频和机器人数据）来赋予模型行动能力。具体而言，Magma利用Set-of-Mark (SoM) 标注图像中的可操作视觉对象（如图形用户界面中的可点击按钮）以实现动作定位，并使用Trace-of-Mark (ToM) 标注视频中的物体运动（如人手或机械臂的轨迹）以支持动作规划。这些技术使Magma不仅具备视觉-语言理解能力（言语智能），还具备在视觉空间世界中规划和行动的能力（时空智能），从而在UI导航和机器人操控等任务上取得了新的最先进成果。

链接: https://arxiv.org/abs/2502.13130
作者: Jianwei Yang,Reuben Tan,Qianhui Wu,Ruijie Zheng,Baolin Peng,Yongyuan Liang,Yu Gu,Mu Cai,Seonghyeon Ye,Joel Jang,Yuquan Deng,Lars Liden,Jianfeng Gao
机构: Microsoft Research (微软研究院); University of Maryland (马里兰大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); KAIST (韩国科学技术院); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 29 pages, 16 figures, technical report from MSR

点击查看摘要

Abstract:We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at this https URL.
zh

[CV-6] Is Noise Conditioning Necessary for Denoising Generative Models?

【速读】：该论文旨在挑战广泛认为去噪扩散模型成功运行需要噪声调节的前提。研究的关键在于探索在缺乏噪声调节条件下，多种基于去噪的生成模型的表现。研究发现大多数模型在没有噪声调节的情况下仍表现出良好的性能，并且某些情况下甚至表现更佳。论文通过理论分析解释了去除噪声调节所导致的误差，并引入了一种无需噪声调节的模型，在CIFAR-10数据集上达到了2.23的FID得分，显著缩小了与领先噪声调节模型之间的差距。论文希望这些发现能够启发社区重新审视去噪生成模型的基础和公式化方法。

链接: https://arxiv.org/abs/2502.13129
作者: Qiao Sun,Zhicheng Jiang,Hanhong Zhao,Kaiming He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a theoretical analysis of the error caused by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.
zh

[CV-7] WeedsGalore: A Multispectral and Multitemporal UAV-based Dataset for Crop and Weed Segmentation in Agricultural Maize Fields

【速读】：该论文旨在解决农作物田间杂草管理不及时和不精确的问题。当前除草实践未能高效且有针对性地管理杂草，特别是在全球产量高的作物如玉米中，有效杂草管理对于最大化作物产量以满足日益增长的全球需求至关重要。为解决这一问题，论文提出了一种新的多光谱无人机（UAV）数据集，用于农业玉米田中作物和杂草的语义和实例分割。该数据集包含RGB、红边和近红外波段图像，并具有大量植物实例和密集标注的玉米及四种杂草类别，同时具有多时间点特性。关键解决方案在于利用先进的分割模型结合新型传感技术，以及引入额外的光谱信息（红边和近红外），从而显著提高除草和监测系统的准确性和时效性，并增强了模型在不同分布数据上的泛化能力。

链接: https://arxiv.org/abs/2502.13103
作者: Ekin Celikkan,Timo Kunzmann,Yertay Yeskaliyev,Sibylle Itzerott,Nadja Klein,Martin Herold
机构: GFZ German Research Centre for Geosciences(德国地球科学研究中心); Humboldt-Universität zu Berlin(柏林洪堡大学); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Weeds are one of the major reasons for crop yield loss but current weeding practices fail to manage weeds in an efficient and targeted manner. Effective weed management is especially important for crops with high worldwide production such as maize, to maximize crop yield for meeting increasing global demands. Advances in near-sensing and computer vision enable the development of new tools for weed management. Specifically, state-of-the-art segmentation models, coupled with novel sensing technologies, can facilitate timely and accurate weeding and monitoring systems. However, learning-based approaches require annotated data and show a lack of generalization to aerial imaging for different crops. We present a novel dataset for semantic and instance segmentation of crops and weeds in agricultural maize fields. The multispectral UAV-based dataset contains images with RGB, red-edge, and near-infrared bands, a large number of plant instances, dense annotations for maize and four weed classes, and is multitemporal. We provide extensive baseline results for both tasks, including probabilistic methods to quantify prediction uncertainty, improve model calibration, and demonstrate the approach’s applicability to out-of-distribution data. The results show the effectiveness of the two additional bands compared to RGB only, and better performance in our target domain than models trained on existing datasets. We hope our dataset advances research on methods and operational systems for fine-grained weed identification, enhancing the robustness and applicability of UAV-based weed management. The dataset and code are available at this https URL
zh

[CV-8] Personalized Image Generation with Deep Generative Models: A Decade Survey

【速读】：该论文旨在全面回顾和分析个性化图像生成领域的技术进展，重点关注不同生成模型中的个性化方法。论文的关键在于提出一个统一框架，该框架标准化了不同生成模型中的个性化过程，包括反转空间（inversion spaces）、反转方法（inversion methods）和个性化方案（personalization schemes）这三大核心组件。基于这一统一框架，论文进一步深入探讨了每种生成模型中的个性化技术，并通过比较分析阐明了当前个性化图像生成领域的现状，识别现有方法的共性和特征差异。最后，论文讨论了该领域存在的开放性挑战，并提出了未来研究的潜在方向。

链接: https://arxiv.org/abs/2502.13081
作者: Yuxiang Wei,Yiheng Zheng,Yabo Zhang,Ming Liu,Zhilong Ji,Lei Zhang,Wangmeng Zuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages; under submission; more information: this https URL

点击查看摘要

Abstract:Recent advancements in generative models have significantly facilitated the development of personalized content creation. Given a small set of images with user-specific concept, personalized image generation allows to create images that incorporate the specified concept and adhere to provided text descriptions. Due to its wide applications in content creation, significant effort has been devoted to this field in recent years. Nonetheless, the technologies used for personalization have evolved alongside the development of generative models, with their distinct and interrelated components. In this survey, we present a comprehensive review of generalized personalized image generation across various generative models, including traditional GANs, contemporary text-to-image diffusion models, and emerging multi-model autoregressive models. We first define a unified framework that standardizes the personalization process across different generative models, encompassing three key components, i.e., inversion spaces, inversion methods, and personalization schemes. This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures. Building upon this unified framework, we further provide an in-depth analysis of personalization techniques within each generative model, highlighting their unique contributions and innovations. Through comparative analysis, this survey elucidates the current landscape of personalized image generation, identifying commonalities and distinguishing features among existing methods. Finally, we discuss the open challenges in the field and propose potential directions for future research. We keep tracing related works at this https URL.
zh

[CV-9] L4P: Low-Level 4D Vision Perception Unified

【速读】：该论文旨在解决低层次四维感知（4D perception）任务中像素时空关系的统一建模问题。论文提出的关键解决方案是L4P架构，这是一种前馈通用模型，结合了基于ViT的主干网络与针对具体任务的轻量级头部，从而在统一框架下高效处理密集型任务（如深度估计或光流估计）和稀疏型任务（如2D/3D跟踪），并且其性能与专门化方法相当甚至更优。

链接: https://arxiv.org/abs/2502.13078
作者: Abhishek Badki,Hang Su,Bowen Wen,Orazio Gallo
机构: NVIDIA Research (NVIDIA研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The spatio-temporal relationship between the pixels of a video carries critical information for low-level 4D perception. A single model that reasons about it should be able to solve several such tasks well. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand. We present L4P (pronounced “LAP”), a feedforward, general-purpose architecture that solves low-level 4D perception tasks in a unified framework. L4P combines a ViT-based backbone with per-task heads that are lightweight and therefore do not require extensive training. Despite its general and feedforward formulation, our method matches or surpasses the performance of existing specialized methods on both dense tasks, such as depth or optical flow estimation, and sparse tasks, such as 2D/3D tracking. Moreover, it solves all those tasks at once in a time comparable to that of individual single-task methods.
zh

[CV-10] RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Birds Eye View for 3D Object Detection ICLR2025

【速读】：该论文旨在解决雷达与相机在多模态3D目标检测中的鲁棒性问题，特别是在不同环境和内在干扰下的表现。论文的关键解决方案在于提出了一种名为RobuRCDet的鲁棒检测模型，并设计了一个三维高斯扩展（3D Gaussian Expansion, 3DGE）模块以减轻雷达点的不准确性，包括位置、雷达散射截面（Radar Cross-Section, RCS）和速度信息。此外，引入了一种天气自适应融合模块，根据相机信号置信度自适应融合雷达和相机特征。

链接: https://arxiv.org/abs/2502.13071
作者: Jingtong Yue,Zhiwei Lin,Xin Lin,Xiaoyu Zhou,Xiangtai Li,Lu Qi,Yongtao Wang,Ming-Hsuan Yang
机构: Cranberry-Lemon University (蔓越莓柠檬大学); Department of Computational Neuroscience (计算神经科学系), University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR2025

点击查看摘要

Abstract:While recent low-cost radar-camera approaches have shown promising results in multi-modal 3D object detection, both sensors face challenges from environmental and intrinsic disturbances. Poor lighting or adverse weather conditions degrade camera performance, while radar suffers from noise and positional ambiguity. Achieving robust radar-camera 3D object detection requires consistent performance across varying conditions, a topic that has not yet been fully explored. In this work, we first conduct a systematic analysis of robustness in radar-camera detection on five kinds of noises and propose RobuRCDet, a robust object detection model in BEV. Specifically, we design a 3D Gaussian Expansion (3DGE) module to mitigate inaccuracies in radar points, including position, Radar Cross-Section (RCS), and velocity. The 3DGE uses RCS and velocity priors to generate a deformable kernel map and variance for kernel size adjustment and value distribution. Additionally, we introduce a weather-adaptive fusion module, which adaptively fuses radar and camera features based on camera signal confidence. Extensive experiments on the popular benchmark, nuScenes, show that our model achieves competitive results in regular and noisy conditions.
zh

[CV-11] Enhancing Power Grid Inspections with Machine Learning

【速读】：该论文旨在解决传统电力电网巡检方法（如人工观察或直升机巡查）资源密集且难以规模化的问题。解决方案的关键在于利用三维计算机视觉技术自动化电力电网巡检，并采用TS40K数据集进行3D语义分割，以应对类别不平衡和噪声数据等挑战，从而提高关键电网组件如输电线和铁塔的检测精度。实验结果表明，基于Transformer的模型在检测输电线时的交并比（IoU）得分达到了95.53%，验证了将机器学习（ML）整合到电网维护工作流程中的潜力，进而提升效率并实现主动风险管理策略。

链接: https://arxiv.org/abs/2502.13037
作者: Diogo Lavado,Ricardo Santos,Andre Coelho,Joao Santos,Alessandra Micheletti,Claudia Soares
机构: Department of Computer Science of NOVA University of Lisbon (NOVA大学计算机科学系), Portugal; Department of Environmental Sciences of the University of Milan (米兰大学环境科学系), Italy; Labelec (Labelec), Lisbon, Portugal; NEW Center of Research (NEW研究中心), Lisbon, Portugal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring the safety and reliability of power grids is critical as global energy demands continue to rise. Traditional inspection methods, such as manual observations or helicopter surveys, are resource-intensive and lack scalability. This paper explores the use of 3D computer vision to automate power grid inspections, utilizing the TS40K dataset – a high-density, annotated collection of 3D LiDAR point clouds. By concentrating on 3D semantic segmentation, our approach addresses challenges like class imbalance and noisy data to enhance the detection of critical grid components such as power lines and towers. The benchmark results indicate significant performance improvements, with IoU scores reaching 95.53% for the detection of power lines using transformer-based models. Our findings illustrate the potential for integrating ML into grid maintenance workflows, increasing efficiency and enabling proactive risk management strategies.
zh

[CV-12] A deep learning framework for efficient pathology image analysis

【速读】：该论文旨在解决现有数字病理学中人工智能方法计算效率低下的问题。当前方法处理每张全切片图像（Whole Slide Image, WSI）时需要分析成千上万冗余的小图块，并依赖复杂的聚合模型。为了解决这一问题，论文提出了一种名为EAGLE（Efficient Approach for Guided Local Examination）的深度学习框架。EAGLE的关键在于引入了两个基础模型：CHIEF用于高效选择信息性区域的图块，Virchow2用于提取高质量特征。这种方法不仅将处理时间减少到2.27秒，还提升了高达23%的性能，实现了超过99%的计算时间缩减，从而支持实时工作流程，使病理学家能够验证模型在分析过程中使用的所有图块，并减少了对高性能计算的依赖。

链接: https://arxiv.org/abs/2502.13027
作者: Peter Neidlinger,Tim Lenz,Sebastian Foersch,Chiara M. L. Loeffler,Jan Clusmann,Marco Gustav,Lawrence A. Shaktah,Rupert Langer,Bastian Dislich,Lisa A. Boardman,Amy J. French,Ellen L. Goode,Andrea Gsur,Stefanie Brezina,Marc J. Gunter,Robert Steinfelder,Hans-Michael Behrens,Christoph Röcken,Tabitha Harrison,Ulrike Peters,Amanda I. Phipps,Giuseppe Curigliano,Nicola Fusco,Antonio Marra,Michael Hoffmeister,Hermann Brenner,Jakob Nikolas Kather
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has transformed digital pathology by enabling biomarker prediction from high-resolution whole slide images (WSIs). However, current methods are computationally inefficient, processing thousands of redundant tiles per WSI and requiring complex aggregator models. We introduce EAGLE (Efficient Approach for Guided Local Examination), a deep learning framework that emulates pathologists by selectively analyzing informative regions. EAGLE incorporates two foundation models: CHIEF for efficient tile selection and Virchow2 for extracting high-quality features. Benchmarking was conducted against leading slide- and tile-level foundation models across 31 tasks from four cancer types, spanning morphology, biomarker prediction and prognosis. EAGLE outperformed state-of-the-art foundation models by up to 23% and achieved the highest AUROC overall. It processed a slide in 2.27 seconds, reducing computational time by more than 99% compared to existing models. This efficiency enables real-time workflows, allows pathologists to validate all tiles which are used by the model during analysis, and eliminates dependence on high-performance computing, making AI-powered pathology more accessible. By reliably identifying meaningful regions and minimizing artifacts, EAGLE provides robust and interpretable outputs, supporting rapid slide searches, integration into multi-omics pipelines and emerging clinical foundation models.
zh

[CV-13] Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms

【速读】：该论文旨在解决在密集热带森林中自然生长的棕榈树检测难题，主要挑战包括重叠树冠、不均匀阴影和复杂的地形。论文的关键解决方案是开发了一个名为PRISM（处理、推理、分割和映射）的灵活管道，用于利用大规模正射影像图检测和定位密集热带森林中的棕榈树。PRISM通过引入零样本SAM 2作为分割主干，并采用校准方法将置信分数与交并比对齐，同时探索显著性图以增强特征解释性，从而优化检测精度和地理映射的精确度。尽管PRISM针对棕榈树进行了优化，但其方法具有适应其他自然对象识别任务的潜力。

链接: https://arxiv.org/abs/2502.13023
作者: Kangning Cui,Rongkun Zhu,Manqi Wang,Wei Tang,Gregory D. Larsen,Victor P. Pauca,Sarra Alqahtani,Fan Yang,David Segurado,David Lutz,Jean-Michel Morel,Miles R. Silman
机构: Department of Mathematics, City University of Hong Kong(数学系, 香港城市大学); Department of Computer Science, Xidian University, Xi’an, Shaanxi, China(电子科技大学, 西安, 陕西, 中国); Department of Biology, Wake Forest University, Winston-Salem, NC, USA(生物系, 维克森林大学, 温斯顿-塞勒姆, 北卡罗来纳州, 美国); School of Arts & Sciences, Colby-Sawyer College, New London, NH, USA(艺术与科学学院, 科尔比-斯威耶学院, 新伦敦, 新罕布什尔州, 美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).
zh

[CV-14] Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version)

【速读】：该论文旨在解决高精度人体定位问题，特别是在低成本条件下。现有方法依赖于昂贵且标签相关的硬件，而基于视觉的方法虽成本较低但受限于双目视觉的固有限制以及多阶段奇异值分解（SVD）解算器中的误差传播。为应对这些局限，论文提出了一种概率方法，将人体各部位视为由围绕身体几何中心的分布生成的观测值。这种方法显著改进了采样过程，使每个感兴趣点的样本数量从数百增加到数十亿。通过建模世界坐标系与像素坐标系均值之间的关系，并利用中心极限定理确保正态性，从而简化学习过程。实验结果表明，在使用两台640×480分辨率网络摄像头的情况下，该方法在0.3米范围内的人体定位准确率达到了96%，在0.5米范围内接近100%，且总成本仅为10美元。

链接: https://arxiv.org/abs/2502.13017
作者: Tianyi Zhang,Wengyu Zhang,Xulu Zhang,Jiaxin Wu,Xiao-Yong Wei,Jiannong Cao,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: arXiv admin note: substantial text overlap with arXiv:2407.20870

点击查看摘要

Abstract:Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup this http URL address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body’s geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 96% within a 0.3 m range and nearly 100% accuracy within a 0.5 m range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640 \times 480 pixels.
zh

[CV-15] SHADeS: Self-supervised Monocular Depth Estimation Through Non-Lambertian Image Decomposition

【速读】：该论文旨在解决结肠镜视觉场景重建中的光照分解和深度估计问题，特别是处理复杂光照变化和高光反射（specular reflections）带来的挑战。论文的关键在于提出了一种自监督模型——SHADeS，该模型能够从单个图像中同时估计阴影、albedo（体面反射率）、深度和高光。不同于先前采用Lambertian模型的方法（IID），SHADeS采用非Lambertian模型将高光反射视为独立的光照成分。实验结果表明，SHADeS能够在存在高光区域的情况下稳健地进行光照分解和深度图估计，从而改善了结肠镜检查中场景重建的精度。

链接: https://arxiv.org/abs/2502.12994
作者: Rema Daher,Francisco Vasconcelos,Danail Stoyanov
机构: University College London(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Visual 3D scene reconstruction can support colonoscopy navigation. It can help in recognising which portions of the colon have been visualised and characterising the size and shape of polyps. This is still a very challenging problem due to complex illumination variations, including abundant specular reflections. We investigate how to effectively decouple light and depth in this problem. Methods: We introduce a self-supervised model that simultaneously characterises the shape and lighting of the visualised colonoscopy scene. Our model estimates shading, albedo, depth, and specularities (SHADeS) from single images. Unlike previous approaches (IID), we use a non-Lambertian model that treats specular reflections as a separate light component. The implementation of our method is available at this https URL. Results: We demonstrate on real colonoscopy images (Hyper Kvasir) that previous models for light decomposition (IID) and depth estimation (MonoVIT, ModoDepth2) are negatively affected by specularities. In contrast, SHADeS can simultaneously produce light decomposition and depth maps that are robust to specular regions. We also perform a quantitative comparison on phantom data (C3VD) where we further demonstrate the robustness of our model. Conclusion: Modelling specular reflections improves depth estimation in colonoscopy. We propose an effective self-supervised approach that uses this insight to jointly estimate light decomposition and depth. Light decomposition has the potential to help with other problems, such as place recognition within the colon. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.12994 [cs.CV] (or arXiv:2502.12994v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.12994 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rema Daher [view email] [v1] Tue, 18 Feb 2025 16:15:32 UTC (41,569 KB)
zh

[CV-16] PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization

【速读】：该论文旨在解决工程应用中精确三维形状表示的需求，特别是针对设计、优化和仿真任务中需要处理由独立组件构成的复合结构的问题。现有的方法要么整体建模，要么在没有预定义部件结构的情况下分解形状，这限制了它们在实际设计任务中的适用性。论文的关键解决方案是提出PartSDF，这是一种监督隐式表示框架，能够显式地建模具有独立且可控部分的复合形状，同时保持形状一致性。尽管PartSDF采用简单的单解码器架构，但它在重建和生成任务中超越了监督和非监督基线方法，并展示了作为工程应用中结构化形状先验的有效性，从而实现对各个部件的精确控制，同时保持整体连贯性。

链接: https://arxiv.org/abs/2502.12985
作者: Nicolas Talabot,Olivier Clerc,Arda Cinar Demirtas,Doruk Oner,Pascal Fua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-aware representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Despite its simple single-decoder architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence. Code available at this https URL.
zh

[CV-17] Instance-Level Moving Object Segmentation from a Single Image with Events

【速读】：该论文旨在解决多目标动态场景中移动物体分割的难题，特别是在区分相机运动与物体运动引起的像素位移方面存在的挑战。为了解决这些问题，论文提出了一种新的实例级移动物体分割框架，该框架整合了互补的纹理和运动线索。关键解决方案在于引入隐式的跨模态掩码注意力增强、显式的对比特征学习以及流引导的运动增强，从而分别利用单幅图像中的密集纹理信息和事件中的丰富运动信息。通过这些方法，论文将掩码分割与运动分类分离，以应对不同数量独立移动物体的情况。

链接: https://arxiv.org/abs/2502.12975
作者: Zhexiong Wan,Bin Fan,Le Hui,Yuchao Dai,Gim Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IJCV

点击查看摘要

Abstract:Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images’ inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at this https URL
zh

[CV-18] Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection ALT AAAI2025

【速读】：该论文旨在解决利用临床报告中的文本信息检测心脏LGE MRI图像中瘢痕组织的难题。由于深度学习模型通常需要大量精细标注的数据，而这些数据难以获取，该研究提出了一种创新方法，通过使用基于领域知识的各种策略，仅利用少量患者（965例）的临床报告文本进行训练。关键解决方案包括采用合成数据增强技术生成瘢痕图像及其相关文本，以提高模型性能；采用与解剖结构相关的图像标准化方法，使空间特征和文本特征更好地对齐；引入描述性损失函数实现细粒度监督，并探索视觉编码器预训练对性能的影响。

链接: https://arxiv.org/abs/2502.12948
作者: Athira J Jacob,Puneet Sharma,Daniel Rueckert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Poster at Workshop on Large Language Models and Generative AI for Health at AAAI 2025

点击查看摘要

Abstract:Detection of hyperenhancement from cardiac LGE MRI images is a complex task requiring significant clinical expertise. Although deep learning-based models have shown promising results for the task, they require large amounts of data with fine-grained annotations. Clinical reports generated for cardiac MR studies contain rich, clinically relevant information, including the location, extent and etiology of any scars present. Although recently developed CLIP-based training enables pretraining models with image-text pairs, it requires large amounts of data and further finetuning strategies on downstream tasks. In this study, we use various strategies rooted in domain knowledge to train a model for LGE detection solely using text from clinical reports, on a relatively small clinical cohort of 965 patients. We improve performance through the use of synthetic data augmentation, by systematically creating scar images and associated text. In addition, we standardize the orientation of the images in an anatomy-informed way to enable better alignment of spatial and text features. We also use a captioning loss to enable fine-grained supervision and explore the effect of pretraining of the vision encoder on performance. Finally, ablation studies are carried out to elucidate the contributions of each design component to the overall performance of the model.
zh

[CV-19] Contrast-Unity for Partially-Supervised Temporal Sentence Grounding ICASSP

【速读】：该论文旨在解决时间句子定位（Temporal Sentence Grounding, TSG）任务中完全监督设置成本高昂而弱监督设置性能较差的问题。为追求高精度且降低标注成本，论文引入了一种中间部分监督设置，即仅在训练过程中使用短片段标签。解决方案的关键在于设计了一个对比统一框架，通过隐式-显式渐进定位（implicit-explicit progressive grounding）的两阶段目标来充分利用部分标签。在隐式阶段，采用四重对比学习（quadruple contrastive learning）对事件-查询表示进行细粒度对齐；在显式阶段，则利用获得的高质量伪标签进一步优化定位目标，实现精炼和去噪。

链接: https://arxiv.org/abs/2502.12917
作者: Haicheng Wang,Chen Ju,Weixiong Lin,Chaofan Ma,Shuai Xiao,Ya Zhang,Yanfeng Wang
机构: SJTU Paris Elite Institute of Technology (上海交通大学巴黎卓越工程师学院, 交大巴黎高科学院); Taobao & Tmall Group (淘宝&天猫集团); School of Artificial Intelligence (上海交通大学人工智能学院); CMIC (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP this http URL first two authors share the same contribution. arXiv admin note: text overlap with arXiv:2302.09850

点击查看摘要

Abstract:Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality representations bring acceptable grounding pseudo-labels. In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising. Extensive experiments and thoroughly ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision, as well as our superior performance.
zh

[CV-20] CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

【速读】：该论文旨在解决从单张RGB图像中恢复高质量3D场景的难题，当前方法常受限于特定领域的问题或低质量的对象生成。解决方案的关键在于提出了一种名为CAST（Component-Aligned 3D Scene Reconstruction from a Single RGB Image）的新方法。CAST通过提取对象级别的2D分割和相对深度信息，并利用基于GPT的模型分析对象间的空间关系，确保对象间关系的理解更加连贯。接着，CAST使用遮挡感知的大规模3D生成模型独立生成每个对象的完整几何形状，并通过MAE和点云条件来减轻遮挡和部分对象信息的影响。最后，CAST引入物理感知校正步骤，通过细粒度的关系图生成约束图，优化物体姿态，确保物理一致性和空间连贯性。这种方法有效解决了遮挡、物体穿透和漂浮物体等问题，确保生成的场景能够准确反映真实世界的物理交互。

链接: https://arxiv.org/abs/2502.12894
作者: Kaixin Yao,Longwen Zhang,Xinhao Yan,Yan Zeng,Qixuan Zhang,Lan Xu,Wei Yang,Jiayuan Gu,Jingyi Yu
机构: ShanghaiTech University(上海科技大学); Deemos Technology(德米斯科技); Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction and recovery. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relationships. This enables the understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object’s full geometry, using MAE and point cloud conditioning to mitigate the effects of occlusions and partial object information, ensuring accurate alignment with the source image’s geometry and texture. To align each object with the scene, the alignment generation model computes the necessary transformations, allowing the generated meshes to be accurately placed and integrated into the scene’s point cloud. Finally, CAST incorporates a physics-aware correction step that leverages a fine-grained relation graph to generate a constraint graph. This graph guides the optimization of object poses, ensuring physical consistency and spatial coherence. By utilizing Signed Distance Fields (SDF), the model effectively addresses issues such as occlusions, object penetration, and floating objects, ensuring that the generated scene accurately reflects real-world physical interactions. CAST can be leveraged in robotics, enabling efficient real-to-simulation workflows and providing realistic, scalable simulation environments for robotic systems.
zh

[CV-21] Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

【速读】：该论文旨在解决稀疏自编码器（Sparse Autoencoders, SAEs）在机器学习解释性框架中的稳定性问题。现有SAEs在相似数据集上训练时，会导致生成的字典差异显著，从而影响其作为解释工具的可靠性。论文的关键解决方案是引入基于凸包约束的原型稀疏自编码器（Archetypal SAEs, A-SAEs），通过几何锚定显著增强字典的稳定性，并进一步提出松弛版本的RA-SAEs，在保持高重建能力的同时，生成更结构化且具有语义意义的表示。

链接: https://arxiv.org/abs/2502.12892
作者: Thomas Fel,Ekdeep Singh Lubana,Jacob S. Prince,Matthew Kowal,Victor Boutin,Isabel Papadimitriou,Binxu Wang,Martin Wattenberg,Demba Ba,Talia Konkle
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover “true” classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
zh

[CV-22] An Experimental Study of SOTA LiDAR Segmentation Models

【速读】：该论文旨在解决现有点云分割（Point Cloud Segmentation, PCS）模型在实际应用中的综合性能比较难题，特别是针对点、体素和范围图像表示方法的不同模型。关键在于通过考虑激光雷达数据运动补偿及模型参数、测试期间最大GPU内存分配、推理延迟、每秒帧数、交并比（Intersection-over-Union, IoU）以及平均交并比（mean IoU, mIoU）得分等指标，提供这些模型的全面比较。这有助于工程师选择适合其应用的合理PCS模型，并启发研究人员设计更实用的模型以应对真实世界场景。

链接: https://arxiv.org/abs/2502.12860
作者: Bike Chen,Antti Tikanmäki,Juha Röning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: No comments

点击查看摘要

Abstract:Point cloud segmentation (PCS) is to classify each point in point clouds. The task enables robots to parse their 3D surroundings and run autonomously. According to different point cloud representations, existing PCS models can be roughly divided into point-, voxel-, and range image-based models. However, no work has been found to report comprehensive comparisons among the state-of-the-art point-, voxel-, and range image-based models from an application perspective, bringing difficulty in utilizing these models for real-world scenarios. In this paper, we provide thorough comparisons among the models by considering the LiDAR data motion compensation and the metrics of model parameters, max GPU memory allocated during testing, inference latency, frames per second, intersection-over-union (IoU) and mean IoU (mIoU) scores. The experimental results benefit engineers when choosing a reasonable PCS model for an application and inspire researchers in the PCS field to design more practical models for a real-world scenario.
zh

[CV-23] Leverag ing Intermediate Representations for Better Out-of-Distribution Detection

【速读】：该论文旨在解决在实际应用中机器学习模型可靠检测Out-of-Distribution (OoD)样本的问题，以防止不安全决策的发生。当前的OoD检测方法通常依赖于分析神经网络的logits或倒数第二层的嵌入向量。然而，这些方法未能充分利用中间层编码的丰富信息。论文的关键在于分析中间层的区分能力，并提出通过引入基于能量的对比损失正则化中间层，以及将多个层聚合为单一响应来改进OoD检测性能。

链接: https://arxiv.org/abs/2502.12849
作者: Gianluca Guglielmo,Marc Masana
机构: Institute of Visual Computing, TU Graz (TU 格拉茨大学视觉计算研究所); SAL Dependable Embedded Systems, Silicon Austria Labs (硅奥地利实验室可靠的嵌入式系统)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:In real-world applications, machine learning models must reliably detect Out-of-Distribution (OoD) samples to prevent unsafe decisions. Current OoD detection methods often rely on analyzing the logits or the embeddings of the penultimate layer of a neural network. However, little work has been conducted on the exploitation of the rich information encoded in intermediate layers. To address this, we analyze the discriminative power of intermediate layers and show that they can positively be used for OoD detection. Therefore, we propose to regularize intermediate layers with an energy-based contrastive loss, and by grouping multiple layers in a single aggregated response. We demonstrate that intermediate layer activations improves OoD detection performance by running a comprehensive evaluation across multiple datasets.
zh

[CV-24] Carotid Artery Plaque Analysis in 3D Based on Distance Encoding in Mesh Representations

【速读】：该论文旨在通过从三维血管壁分割中提取和可视化定量斑块参数，实现颈动脉斑块的全面且稳健的评估。论文的关键解决方案在于提出了一种新的方法，利用内外壁网格的距离编码来精确分析斑块结构，并应用针对每个病例特定的阈值，该阈值基于正常血管壁厚度，从而从包含多达50%狭窄的受试者的202个T1加权黑血MRI扫描数据集中提取斑块。这种方法支持详细的斑块形态随时间变化的分析，包括体积量化，并通过展开网格改善可视化效果。

链接: https://arxiv.org/abs/2502.12819
作者: Hinrich Rahlfs,Markus Hüllebrand,Sebastian Schmitter,Christoph Strecker,Andreas Harloff,Anja Hennemuth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 Figures, Submitted to the International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Purpose: Enabling a comprehensive and robust assessment of carotid artery plaques in 3D through extraction and visualization of quantitative plaque parameters. These parameters have potential applications in stroke risk analysis, evaluation of therapy effectiveness, and plaque progression prediction. Methods: We propose a novel method for extracting a plaque mesh from 3D vessel wall segmentation using distance encoding on the inner and outer wall mesh for precise plaque structure analysis. A case-specific threshold, derived from the normal vessel wall thickness, was applied to extract plaques from a dataset of 202 T1-weighted black-blood MRI scans of subjects with up to 50% stenosis. Applied to baseline and one-year follow-up data, the method supports detailed plaque morphology analysis over time, including plaque volume quantification, aided by improved visualization via mesh unfolding. Results: We successfully extracted plaque meshes from 341 carotid arteries, capturing a wide range of plaque shapes with volumes ranging from 2.69\mul to 847.7\mul. The use of a case-specific threshold effectively eliminated false positives in young, healthy subjects. Conclusion: The proposed method enables precise extraction of plaque meshes from 3D vessel wall segmentation masks enabling a correspondence between baseline and one-year follow-up examinations. Unfolding the plaque meshes enhances visualization, while the mesh-based analysis allows quantification of plaque parameters independent of voxel resolution.
zh

[CV-25] Learning Wall Segmentation in 3D Vessel Trees using Sparse Annotations

【速读】：该论文旨在解决利用稀疏标注进行颈动脉壁三维分割的问题。解决方案的关键在于使用中心线标注采样垂直于中心线的横截面，并采用对抗二维网络进行分割。通过在分叉区域使用垂直于分叉轴的横截面生成伪标签，进一步提升了分割性能。此方法能够高效训练三维分割模型，有望改善颈动脉狭窄评估，并提取如斑块体积等三维生物标志物。

链接: https://arxiv.org/abs/2502.12801
作者: Hinrich Rahlfs,Markus Hüllebrand,Sebastian Schmitter,Christoph Strecker,Andreas Harloff,Anja Hennemuth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at MICAD 2024 Conference

点击查看摘要

Abstract:We propose a novel approach that uses sparse annotations from clinical studies to train a 3D segmentation of the carotid artery wall. We use a centerline annotation to sample perpendicular cross-sections of the carotid artery and use an adversarial 2D network to segment them. These annotations are then transformed into 3D pseudo-labels for training of a 3D convolutional neural network, circumventing the creation of manual 3D masks. For pseudo-label creation in the bifurcation area we propose the use of cross-sections perpendicular to the bifurcation axis and show that this enhances segmentation performance. Different sampling distances had a lesser impact. The proposed method allows for efficient training of 3D segmentation, offering potential improvements in the assessment of carotid artery stenosis and allowing the extraction of 3D biomarkers such as plaque volume.
zh

[CV-26] RAPID: Retrieval Augmented Training of Differentially Private Diffusion Models ICLR2025

【速读】：该论文旨在解决现有差分隐私扩散模型（Differentially Private Diffusion Models, DPDMs）在训练过程中存在的显著效用损失、大内存占用及昂贵推理成本等问题。解决方案的关键在于提出了一种名为RAPID的新方法，即将检索增强生成（Retrieval Augmented Generation, RAG）集成到DPDM的训练中。具体而言，RAPID利用可用的公共数据构建样本轨迹的知识库，在训练私有数据上的扩散模型时，计算早期采样步骤作为查询，从知识库中检索相似轨迹作为替代，并以差分隐私的方式专注于后期采样步骤的训练。

链接: https://arxiv.org/abs/2502.12794
作者: Tanqiu Jiang,Changjiang Li,Fenglong Ma,Ting Wang
机构: Stony Brook University; Pennsylvania State University
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in ICLR 2025

点击查看摘要

Abstract:Differentially private diffusion models (DPDMs) harness the remarkable generative capabilities of diffusion models while enforcing differential privacy (DP) for sensitive data. However, existing DPDM training approaches often suffer from significant utility loss, large memory footprint, and expensive inference cost, impeding their practical uses. To overcome such limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a novel approach that integrates retrieval augmented generation (RAG) into DPDM training. Specifically, RAPID leverages available public data to build a knowledge base of sample trajectories; when training the diffusion model on private data, RAPID computes the early sampling steps as queries, retrieves similar trajectories from the knowledge base as surrogates, and focuses on training the later sampling steps in a differentially private manner. Extensive evaluation using benchmark datasets and models demonstrates that, with the same privacy guarantee, RAPID significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost, suggesting that retrieval-augmented DP training represents a promising direction for developing future privacy-preserving generative models. The code is available at: this https URL
zh

[CV-27] Beyond Timesteps: A Novel Activation-wise Membrane Potential Propagation Mechanism for Spiking Neural Networks in 3D cloud

【速读】：该论文旨在解决事件驱动视觉数据与点云分析领域中方法应用范围有限以及基于脉冲神经网络（Spiking Neural Network, SNN）的时间步长限制导致实际应用延迟和计算成本增加的问题。关键解决方案在于提出了一种新的激活策略——激活感知膜电位传播（Activation-wise Membrane Potential Propagation, AMP2），该策略将时间步长的概念从激活函数中的手动参数扩展到任何现有的网络结构中，从而有效稳定SNN训练，保持竞争力性能，并减少延迟。

链接: https://arxiv.org/abs/2502.12791
作者: Jian Song,Boxuan Zheng,Xiangfei Yang,Donglin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to the similar characteristics between event-based visual data and point clouds, recent studies have emerged that treat event data as event clouds to learn based on point cloud analysis. Additionally, some works approach point clouds from the perspective of event vision, employing Spiking Neural Network (SNN) due to their asynchronous nature. However, these contributions are often domain-specific, making it difficult to extend their applicability to other intersecting fields. Moreover, while SNN-based visual tasks have seen significant growth, the conventional timestep-wise iterative activation strategy largely limits their real-world applications by large timesteps, resulting in significant delays and increased computational costs. Although some innovative methods achieve good performance with short timesteps (10), few have fundamentally restructured the update strategy of spiking neurons to completely overcome the limitations of timesteps. In response to these concerns, we propose a novel and general activation strategy for spiking neurons called Activation-wise Membrane Potential Propagation (AMP2). This approach extends the concept of timesteps from a manually crafted parameter within the activation function to any existing network structure. In experiments on common point cloud tasks (classification, object, and scene segmentation) and event cloud tasks (action recognition), we found that AMP2 stabilizes SNN training, maintains competitive performance, and reduces latency compared to the traditional timestep-wise activation paradigm.
zh

[CV-28] High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion

【速读】：该论文旨在解决从单张或稀疏图像中合成高保真新视图的问题。现有基于散射的方法常因散射误差导致几何形变，而基于扩散的方法虽利用丰富的三维先验知识改善了几何结构，但常常遭受纹理幻觉的困扰。论文提出的关键解决方案是SplatDiff模型，这是一种像素散射引导的视频扩散模型。通过引入对齐合成策略以精确控制目标视角和几何一致的视图合成，并设计了一个纹理桥接模块来实现自适应特征融合的高保真纹理生成，从而克服纹理幻觉问题。这种方法融合了散射和扩散的优点，能够生成几何一致且细节逼真的新视图。

链接: https://arxiv.org/abs/2502.12752
作者: Xiang Zhang,Yang Zhang,Lukas Mehl,Markus Gross,Christopher Schroers
机构: ETH Zürich(苏黎世联邦理工学院); Disney Research (迪士尼研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in Novel View Synthesis (NVS), generating high-fidelity views from single or sparse observations remains a significant challenge. Existing splatting-based approaches often produce distorted geometry due to splatting errors. While diffusion-based methods leverage rich 3D priors to achieve improved geometry, they often suffer from texture hallucination. In this paper, we introduce SplatDiff, a pixel-splatting-guided video diffusion model designed to synthesize high-fidelity novel views from a single image. Specifically, we propose an aligned synthesis strategy for precise control of target viewpoints and geometry-consistent view synthesis. To mitigate texture hallucination, we design a texture bridge module that enables high-fidelity texture generation through adaptive feature fusion. In this manner, SplatDiff leverages the strengths of splatting and diffusion to generate novel views with consistent geometry and high-fidelity details. Extensive experiments verify the state-of-the-art performance of SplatDiff in single-view NVS. Additionally, without extra training, SplatDiff shows remarkable zero-shot performance across diverse tasks, including sparse-view NVS and stereo video conversion.
zh

[CV-29] 3D Shape-to-Image Brownian Bridge Diffusion for Brain MRI Synthesis from Cortical Surfaces

【速读】：该论文旨在解决现有医学图像生成方法难以生成解剖学上合理的三维结构的问题。具体而言，在合成脑磁共振图像（MRI）中，特征性裂隙常常缺失，重建的大脑皮层表面显得分散而非密集卷曲。为了解决这一问题，论文提出了一种基于扩散模型的方法Cor2Vox，其关键是利用布朗桥过程实现形状轮廓与医学图像之间的直接结构化映射，并将其扩展以适应多种互补的形状表示。这种方法显著提高了重建结构的几何准确性，并在图像质量和多样性方面表现出色。

链接: https://arxiv.org/abs/2502.12742
作者: Fabian Bongratz,Yitong Li,Sama Elbaroudy,Christian Wachinger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Information Processing in Medical Imaging (IPMI) 2025

点击查看摘要

Abstract:Despite recent advances in medical image generation, existing methods struggle to produce anatomically plausible 3D structures. In synthetic brain magnetic resonance images (MRIs), characteristic fissures are often missing, and reconstructed cortical surfaces appear scattered rather than densely convoluted. To address this issue, we introduce Cor2Vox, the first diffusion model-based method that translates continuous cortical shape priors to synthetic brain MRIs. To achieve this, we leverage a Brownian bridge process which allows for direct structured mapping between shape contours and medical images. Specifically, we adapt the concept of the Brownian bridge diffusion model to 3D and extend it to embrace various complementary shape representations. Our experiments demonstrate significant improvements in the geometric accuracy of reconstructed structures compared to previous voxel-based approaches. Moreover, Cor2Vox excels in image quality and diversity, yielding high variation in non-target structures like the skull. Finally, we highlight the capability of our approach to simulate cortical atrophy at the sub-voxel level. Our code is available at this https URL.
zh

[CV-30] myEye2Wheeler: A Two-Wheeler Indian Driver Real-World Eye-Tracking Dataset

【速读】：该论文旨在解决两轮车驾驶员在复杂印度交通环境中视觉注意力模式和决策行为的研究不足问题。论文的关键在于引入myEye2Wheeler数据集，该数据集提供了两轮车驾驶员在实际交通环境中的真实视线行为。研究发现现有显著性模型（如TASED-Net）在处理该数据集时表现不如在欧洲四轮车眼动追踪数据集（DR(Eye)VE）时有效，强调了开发针对特定交通条件的显著性模型的重要性。

链接: https://arxiv.org/abs/2502.12723
作者: Bhaiya Vaibhaw Kumar,Deepti Rawat,Tanvi Kandalla,Aarnav Nagariya,Kavita Vemuri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the myEye2Wheeler dataset, a unique resource of real-world gaze behaviour of two-wheeler drivers navigating complex Indian traffic. Most datasets are from four-wheeler drivers on well-planned roads and homogeneous traffic. Our dataset offers a critical lens into the unique visual attention patterns and insights into the decision-making of Indian two-wheeler drivers. The analysis demonstrates that existing saliency models, like TASED-Net, perform less effectively on the myEye-2Wheeler dataset compared to when applied on the European 4-wheeler eye tracking datasets (DR(Eye)VE), highlighting the need for models specifically tailored to the traffic conditions. By introducing the dataset, we not only fill a significant gap in two-wheeler driver behaviour research in India but also emphasise the critical need for developing context-specific saliency models. The larger aim is to improve road safety for two-wheeler users and lane-planning to support a cost-effective mode of transport.
zh

[CV-31] Uncertainty Propagation for Echocardiography Clinical Metric Estimation via Contour Sampling

【速读】：该论文旨在解决自动化技术计算心脏超声临床参数（如左心室容积和射血分数）时的不确定性估计问题。由于这些参数通常源自分割图，难以将逐像素的不确定性值转化为下游临床指标计算中的不确定性估计。为此，论文提出了一种基于轮廓预测而非分割的新型不确定性估计方法。该方法的关键在于显式地预测轮廓位置的不确定性，并从中抽取轮廓样本，最终利用这些样本将不确定性传播到临床指标中。这种方法不仅能够准确估计轮廓任务的不确定性，还能提高两个心脏超声数据集下游临床指标的不确定性评估准确性。

链接: https://arxiv.org/abs/2502.12713
作者: Thierry Judge,Olivier Bernard,Woo-Jin Cho Kim,Alberto Gomez,Arian Beqiri,Agisilaos Chartsias,Pierre-Marc Jodoin
机构: Department of Computer Science, University of Sherbrooke (谢布罗克大学计算机科学系), Canada; INSA, Universite Claude Bernard Lyon 1 (里昂第一大学), CNRS UMR 5220, Inserm U1206, CREATIS, France; Ultromics Ltd. (Ultromics有限公司), Oxford, OX4 2SU, UK; King’s College London (伦敦国王学院), London, WC2R 2LS, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, submitted to IEEE TMI

点击查看摘要

Abstract:Echocardiography plays a fundamental role in the extraction of important clinical parameters (e.g. left ventricular volume and ejection fraction) required to determine the presence and severity of heart-related conditions. When deploying automated techniques for computing these parameters, uncertainty estimation is crucial for assessing their utility. Since clinical parameters are usually derived from segmentation maps, there is no clear path for converting pixel-wise uncertainty values into uncertainty estimates in the downstream clinical metric calculation. In this work, we propose a novel uncertainty estimation method based on contouring rather than segmentation. Our method explicitly predicts contour location uncertainty from which contour samples can be drawn. Finally, the sampled contours can be used to propagate uncertainty to clinical metrics. Our proposed method not only provides accurate uncertainty estimations for the task of contouring but also for the downstream clinical metrics on two cardiac ultrasound datasets. Code is available at: this https URL.
zh

[CV-32] Spherical Dense Text-to-Image Synthesis

【速读】：该论文旨在解决文本到全景图像（Text-to-Image, T2I）合成中的布局控制难题以及生成全方位全景图像的挑战。当前的密集型文本到图像（Dense T2I, DT2I）和球面文本到图像（Spherical T2I, ST2I）模型虽有所改进，但尚未提出统一的方法来处理这些问题。论文的关键解决方案在于通过将无训练的DT2I方法与微调后的全景模型相结合，实现球面密集文本到图像（Spherical Dense T2I, SDT2I）。具体而言，论文提出了MultiStitchDiffusion (MSTD) 和 MultiPanFusion (MPF)，分别通过将MultiDiffusion整合进StitchDiffusion和PanFusion实现。此外，为了评估这些模型，论文构建了一个包含球面布局的新合成数据集Dense-Synthetic-View (DSynView)。实验结果显示，MSTD在图像质量和提示及布局遵从性方面优于MPF，而MPF虽然能生成更多样化的图像但在合成完美前景对象方面存在困难。

链接: https://arxiv.org/abs/2502.12691
作者: Timon Winter,Stanislav Frolov,Brian Bernhard Moser,Andreas Dengel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF.
zh

[CV-33] Fast Data Aware Neural Architecture Search via Supernet Accelerated Evaluation

【速读】：该论文旨在解决Tiny机器学习（TinyML）在资源受限的嵌入式系统中优化模型架构与输入数据配置的联合调优问题。当前，硬件感知神经网络架构搜索虽然能够显著提升TinyML模型性能，但仅关注架构优化可能不足以应对极端资源限制下的挑战。论文的关键解决方案是提出一种新的数据感知神经网络架构搜索（Data Aware Neural Architecture Search）技术，通过同时优化模型架构和输入数据配置，以实现更高效的TinyML系统。实验结果表明，该方法在不同时间和硬件约束条件下均能发现优于单纯基于架构优化的方法的TinyML系统，强调了数据感知优化在推进TinyML领域中的重要性。

链接: https://arxiv.org/abs/2502.12690
作者: Emil Njor,Colby Banbury,Xenofon Fafoutis
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tiny machine learning (TinyML) promises to revolutionize fields such as healthcare, environmental monitoring, and industrial maintenance by running machine learning models on low-power embedded systems. However, the complex optimizations required for successful TinyML deployment continue to impede its widespread adoption. A promising route to simplifying TinyML is through automatic machine learning (AutoML), which can distill elaborate optimization workflows into accessible key decisions. Notably, Hardware Aware Neural Architecture Searches - where a computer searches for an optimal TinyML model based on predictive performance and hardware metrics - have gained significant traction, producing some of today’s most widely used TinyML models. Nevertheless, limiting optimization solely to neural network architectures can prove insufficient. Because TinyML systems must operate under extremely tight resource constraints, the choice of input data configuration, such as resolution or sampling rate, also profoundly impacts overall system efficiency. Achieving truly optimal TinyML systems thus requires jointly tuning both input data and model architecture. Despite its importance, this “Data Aware Neural Architecture Search” remains underexplored. To address this gap, we propose a new state-of-the-art Data Aware Neural Architecture Search technique and demonstrate its effectiveness on the novel TinyML ``Wake Vision’’ dataset. Our experiments show that across varying time and hardware constraints, Data Aware Neural Architecture Search consistently discovers superior TinyML systems compared to purely architecture-focused methods, underscoring the critical role of data-aware optimization in advancing TinyML.
zh

[CV-34] Spiking Vision Transformer with Saccadic Attention ICLR2025

【速读】：该论文旨在解决基于脉冲神经网络（Spiking Neural Networks, SNNs）的视觉变换器（Vision Transformers, ViTs）在性能上与传统人工神经网络（ANN）相比存在的显著差距。论文的关键在于引入了一种创新的扫视脉冲自注意力机制（Saccadic Spike Self-Attention, SSSA），该机制通过改进的空间相关性评估方法和动态聚焦于选定视觉区域的扫视交互模块，有效提升了空间相关性和时间交互能力，从而显著增强了整体场景的理解。基于SSSA机制，论文开发了一种具有线性计算复杂度的SNN基视觉变换器（SNN-ViT），并在多种视觉任务中展示了其最先进的性能。

链接: https://arxiv.org/abs/2502.12677
作者: Shuai Wang,Malu Zhang,Dehao Zhang,Ammar Belatreche,Yichen Xiao,Yu Liang,Yimeng Shan,Qian Sun,Enqi Zhang,Yang Yang
机构: University of Electronic Science and Technology of China (电子科技大学); Northumbria University (诺桑比亚大学); Liaoning Technical University (辽宁工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:The combination of Spiking Neural Networks (SNNs) and Vision Transformers (ViTs) holds potential for achieving both energy efficiency and high performance, particularly suitable for edge vision applications. However, a significant performance gap still exists between SNN-based ViTs and their ANN counterparts. Here, we first analyze why SNN-based ViTs suffer from limited performance and identify a mismatch between the vanilla self-attention mechanism and spatio-temporal spike trains. This mismatch results in degraded spatial relevance and limited temporal interactions. To address these issues, we draw inspiration from biological saccadic attention mechanisms and introduce an innovative Saccadic Spike Self-Attention (SSSA) method. Specifically, in the spatial domain, SSSA employs a novel spike distribution-based method to effectively assess the relevance between Query and Key pairs in SNN-based ViTs. Temporally, SSSA employs a saccadic interaction module that dynamically focuses on selected visual areas at each timestep and significantly enhances whole scene understanding through temporal interactions. Building on the SSSA mechanism, we develop a SNN-based Vision Transformer (SNN-ViT). Extensive experiments across various visual tasks demonstrate that SNN-ViT achieves state-of-the-art performance with linear computational complexity. The effectiveness and efficiency of the SNN-ViT highlight its potential for power-critical edge vision applications.
zh

[CV-35] ROI-NeRFs: Hi-Fi Visualization of Objects of Interest within a Scene by NeRFs Composition

【速读】：该论文旨在解决在大规模场景中高效且高精度地可视化特定对象的问题。论文提出的关键解决方案是ROI-NeRFs框架，该框架通过将场景分为场景级NeRF（Scene NeRF）和多个关注区域（Region of Interest, ROI）NeRF，实现了整体场景的适度细节表示与特定对象的详细表示。此外，文中还引入了自动化的相机选择模块和光线级组合渲染技术，以优化计算效率并提升视觉保真度。这些方法共同作用，提高了感兴趣对象区域的细节水平，减少了伪影，并保持了较低的推理时间。

链接: https://arxiv.org/abs/2502.12673
作者: Quoc-Anh Bui,Gilles Rougeron,Géraldine Morin,Simone Gasparini
机构: Université Paris-Saclay, CEA, List (巴黎萨克雷大学，CEA，LIST); Université de Toulouse, Toulouse INP – IRIT (图卢兹大学，图卢兹国立理工学院 – IRIT)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 17 pages including appendix, 16 figures, 8 tables

点击查看摘要

Abstract:Efficient and accurate 3D reconstruction is essential for applications in cultural heritage. This study addresses the challenge of visualizing objects within large-scale scenes at a high level of detail (LOD) using Neural Radiance Fields (NeRFs). The aim is to improve the visual fidelity of chosen objects while maintaining the efficiency of the computations by focusing on details only for relevant content. The proposed ROI-NeRFs framework divides the scene into a Scene NeRF, which represents the overall scene at moderate detail, and multiple ROI NeRFs that focus on user-defined objects of interest. An object-focused camera selection module automatically groups relevant cameras for each NeRF training during the decomposition phase. In the composition phase, a Ray-level Compositional Rendering technique combines information from the Scene NeRF and ROI NeRFs, allowing simultaneous multi-object rendering composition. Quantitative and qualitative experiments conducted on two real-world datasets, including one on a complex eighteen’s century cultural heritage room, demonstrate superior performance compared to baseline methods, improving LOD for object regions, minimizing artifacts, and without significantly increasing inference time.
zh

[CV-36] RecDreamer: Consistent Text-to-3D Generation via Uniform Score Distillation

【速读】：该论文旨在解决基于评分蒸馏的文本到3D生成方法中存在的几何不一致性问题，即多面Janus问题。这一问题源于现有方法难以在不同姿态下保持一致性，并且偏向于某一标准姿态。论文的关键解决方案是提出RecDreamer，通过重塑底层数据分布来实现更一致的姿态表示。具体而言，RecDreamer通过引入辅助函数修正先验分布，确保姿态变化均匀分布而非偏向标准形式。这种方法将修正后的数据分布整合到现有的评分蒸馏算法中，称为统一评分蒸馏。此外，RecDreamer引入了一种无需训练的分类器以高效计算所需的后验分布，从而显著提升系统性能。

链接: https://arxiv.org/abs/2502.12640
作者: Chenxi Zheng,Yihong Lin,Bangzhen Liu,Xuemiao Xu,Yongwei Nie,Shengfeng He
机构: South China University of Technology (华南理工大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current text-to-3D generation methods based on score distillation often suffer from geometric inconsistencies, leading to repeated patterns across different poses of 3D assets. This issue, known as the Multi-Face Janus problem, arises because existing methods struggle to maintain consistency across varying poses and are biased toward a canonical pose. While recent work has improved pose control and approximation, these efforts are still limited by this inherent bias, which skews the guidance during generation. To address this, we propose a solution called RecDreamer, which reshapes the underlying data distribution to achieve a more consistent pose representation. The core idea behind our method is to rectify the prior distribution, ensuring that pose variation is uniformly distributed rather than biased toward a canonical form. By modifying the prescribed distribution through an auxiliary function, we can reconstruct the density of the distribution to ensure compliance with specific marginal constraints. In particular, we ensure that the marginal distribution of poses follows a uniform distribution, thereby eliminating the biases introduced by the prior knowledge. We incorporate this rectified data distribution into existing score distillation algorithms, a process we refer to as uniform score distillation. To efficiently compute the posterior distribution required for the auxiliary function, RecDreamer introduces a training-free classifier that estimates pose categories in a plug-and-play manner. Additionally, we utilize various approximation techniques for noisy states, significantly improving system performance. Our experimental results demonstrate that RecDreamer effectively mitigates the Multi-Face Janus problem, leading to more consistent 3D asset generation across different poses.
zh

[CV-37] Corrupted but Not Broken: Rethinking the Impact of Corrupted Data in Visual Instruction Tuning

【速读】：该论文旨在解决视觉指令调教（Visual Instruction Tuning, VIT）过程中因数据集包含幻觉内容、错误响应及光学字符识别（OCR）质量低下而导致的性能下降问题。论文的关键在于发现通过禁用一小部分参数或进行少量干净数据的再训练，可以显著恢复多模态大型语言模型（Multimodal Large Language Models, MLLMs）的性能，并且这些受损模型能够自行区分清洁样本与受损样本，从而实现无需外部帮助的数据集清洗。基于此，论文提出了一种结合自验证和再训练的鲁棒性训练范式，该方法显著优于现有的污染缓解策略。

链接: https://arxiv.org/abs/2502.12635
作者: Yunhao Gou,Hansi Yang,Zhili Liu,Kai Chen,Yihan Zeng,Lanqing Hong,Zhenguo Li,Qun Liu,James T. Kwok,Yu Zhang
机构: Southern University of Science and Technology; The Hong Kong University of Science and Technology; Huawei Noah’s Ark Lab; Southern University of Science and Technology; The Hong Kong University of Science and Technology; Huawei Noah’s Ark Lab; Huawei Noah’s Ark Lab; James T. Kwok; Southern University of Science and Technology; The Hong Kong University of Science and Technology; Huawei Noah’s Ark Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Instruction Tuning (VIT) enhances Multimodal Large Language Models (MLLMs) but it is hindered by corrupted datasets containing hallucinated content, incorrect responses, and poor OCR quality. While prior works focus on dataset refinement through high-quality data collection or rule-based filtering, they are costly or limited to specific types of corruption. To deeply understand how corrupted data affects MLLMs, in this paper, we systematically investigate this issue and find that while corrupted data degrades the performance of MLLMs, its effects are largely superficial in that the performance of MLLMs can be largely restored by either disabling a small subset of parameters or post-training with a small amount of clean data. Additionally, corrupted MLLMs exhibit improved ability to distinguish clean samples from corrupted ones, enabling the dataset cleaning without external help. Based on those insights, we propose a corruption-robust training paradigm combining self-validation and post-training, which significantly outperforms existing corruption mitigation strategies.
zh

[CV-38] MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

【速读】：该论文旨在解决扩散模型在生成长视频（如数分钟）方面存在的局限性。现有扩散模型主要局限于生成短片段（例如2-10秒）。为了解决这一问题，论文提出了一种名为MALT Diffusion（使用Memory-Augmented Latent Transformers）的新模型，专门用于长视频生成。其关键是通过将长视频划分为较短的段落，并采用序列自回归生成方法，利用递归注意力层将多个段落编码为紧凑的记忆潜在向量。这种方法使得模型能够基于长时间上下文条件连续生成新的画面。

链接: https://arxiv.org/abs/2502.12632
作者: Sihyun Yu,Meera Hahn,Dan Kondratyuk,Jinwoo Shin,Agrim Gupta,José Lezama,Irfan Essa,David Ross,Jonathan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint. 26 pages

点击查看摘要

Abstract:Diffusion models are successful for synthesizing high-quality videos but are limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT’s capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.
zh

[CV-39] DAMamba: Vision State Space Model with Dynamic Adaptive Scan

【速读】：该论文旨在解决现有状态空间模型（State Space Models, SSMs）在计算机视觉任务中的局限性。当前方法主要依赖于人工设计的扫描策略来将图像块转换为序列，这会破坏图像原有的语义空间相邻关系，并缺乏灵活性，难以捕捉复杂的图像结构。为了解决这一问题，论文提出了一种名为动态自适应扫描（Dynamic Adaptive Scan, DAS）的数据驱动方法，该方法能够自适应地分配扫描顺序和区域。通过这种方法，不仅保持了线性的计算复杂度和全局建模能力，还增强了模型的灵活性。基于DAS，作者进一步提出了视觉主干网络DAMamba，在图像分类、目标检测、实例分割和语义分割等任务中显著超越了当前最先进的视觉Mamba模型及部分最新的卷积神经网络（CNNs）和视觉变换器（Vision Transformers, ViTs）。

链接: https://arxiv.org/abs/2502.12627
作者: Tanzhe Li,Caoshuo Li,Jiayi Lyu,Hongjuan Pei,Baochang Zhang,Taisong Jin,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at this https URL.
zh

[CV-40] S2C: Learning Noise-Resistant Differences for Unsupervised Change Detection in Multimodal Remote Sensing Images

【速读】：该论文旨在解决多模态遥感图像中无监督变化检测（Unsupervised Change Detection, UCD）的问题。由于数据内在的空间-时间复杂性和不同成像传感器导致的异质性，这一任务极具挑战性。论文的关键解决方案在于引入了一种名为语义到变化学习框架（Semantic-to-Change, S2C），该框架结合了视觉基础模型（Visual Foundation Models, VFMs）和对比学习（Contrastive Learning, CL）的方法。不同于现有方法主要关注于学习多时相相似性，本文提出了一种新颖的三元组学习策略，明确建模时序差异，这是变化检测（Change Detection, CD）任务中的关键因素。此外，通过引入随机的空间和光谱扰动以及网格稀疏正则化来抑制不显著的变化，并利用交并比匹配算法优化变化检测结果。

链接: https://arxiv.org/abs/2502.12604
作者: Lei Ding,Xibing Zuo,Danfeng Hong,Haitao Guo,Jun Lu,Zhihui Gong,Lorenzo Bruzzone
机构: Information Engineering University, Zhengzhou, China (信息工程大学，郑州，中国); Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China (中国科学院航空航天信息研究所，北京，中国); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, 100049 Beijing, China (中国科学院大学电子电气与通信工程学院，北京，中国); Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy (信息工程与计算机科学系，意大利特伦托大学，38123 特伦托，意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Change Detection (UCD) in multimodal Remote Sensing (RS) images remains a difficult challenge due to the inherent spatio-temporal complexity within data, and the heterogeneity arising from different imaging sensors. Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate implicit knowledge in VFM into change representations, thus eliminating the need for explicit supervision. To this end, we introduce a Semantic-to-Change (S2C) learning framework for UCD in both homogeneous and multimodal RS images. Differently from existing CL methodologies that typically focus on learning multi-temporal similarities, we introduce a novel triplet learning strategy that explicitly models temporal differences, which are crucial to the CD task. Furthermore, random spatial and spectral perturbations are introduced during the training to enhance robustness to temporal noise. In addition, a grid sparsity regularization is defined to suppress insignificant changes, and an IoU-matching algorithm is developed to refine the CD results. Experiments on four benchmark CD datasets demonstrate that the proposed S2C learning framework achieves significant improvements in accuracy, surpassing current state-of-the-art by over 31%, 9%, 23%, and 15%, respectively. It also demonstrates robustness and sample efficiency, suitable for training and adaptation of various Visual Foundation Models (VFMs) or backbone neural networks. The relevant code will be available at: this http URL.
zh

[CV-41] Revisiting the Generalization Problem of Low-level Vision Models Through the Lens of Image Deraining

【速读】：该论文旨在解决低级视觉模型在实际场景中因未见过的退化情况而导致的泛化问题。论文的关键解决方案在于通过平衡训练数据中背景图像和退化模式的复杂性，引导网络更多地关注图像的内容而非退化模式本身，并结合从预训练生成模型中获得的内容先验，从而显著提升模型的泛化能力。实验结果验证了所提出策略的有效性。

链接: https://arxiv.org/abs/2502.12600
作者: Jinfan Hu,Zhiyuan You,Jinjin Gu,Kaiwen Zhu,Tianfan Xue,Chao Dong
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院); University of Chinese Academy of Sciences(中国科学院大学); The Chinese University of Hong Kong(香港中文大学); The University of Sydney(悉尼大学); Shanghai Jiao Tong University(上海交通大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳视觉计算与模式识别重点实验室,深圳先进技术研究院,中国科学院); Shenzhen University of Advanced Technology(深圳高等技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2305.15134

点击查看摘要

Abstract:Generalization remains a significant challenge for low-level vision models, which often struggle with unseen degradations in real-world scenarios despite their success in controlled benchmarks. In this paper, we revisit the generalization problem in low-level vision models. Image deraining is selected as a case study due to its well-defined and easily decoupled structure, allowing for more effective observation and analysis. Through comprehensive experiments, we reveal that the generalization issue is not primarily due to limited network capacity but rather the failure of existing training strategies, which leads networks to overfit specific degradation patterns. Our findings show that guiding networks to focus on learning the underlying image content, rather than the degradation patterns, is key to improving generalization. We demonstrate that balancing the complexity of background images and degradations in the training data helps networks better fit the image distribution. Furthermore, incorporating content priors from pre-trained generative models significantly enhances generalization. Experiments on both image deraining and image denoising validate the proposed strategies. We believe the insights and solutions will inspire further research and improve the generalization of low-level vision models.
zh

[CV-42] Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition

【速读】：该论文旨在解决在现实世界的行为识别系统中，利用单一模型同时识别多个属性会导致准确率下降的问题。为解决这一问题，论文提出了一种新颖的方法，即自适应属性原型模型（Adaptive Attribute Prototype Model, AAPM），其关键是引入了文本约束模块（Text-Constrain Module, TCM）来整合潜在标签中的文本信息，并构建不同属性的原型表示；同时探索了属性分配方法（Attribute Assignment Method, AAM）以解决训练偏差问题并增强模型的鲁棒性。

链接: https://arxiv.org/abs/2502.12582
作者: Juefeng Xiao,Tianqi Xiang,Zhigang Tu
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. However, using a single model to simultaneously recognize multiple attributes can lead to a decrease in accuracy. In this work, we propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information and strikes a balance between accuracy and robustness. Firstly, we introduce the Text-Constrain Module (TCM) to incorporate textual information from potential labels, and constrain the construction of different attributes prototype representations. In addition, we explore the Attribute Assignment Method (AAM) to address the issue of training bias and increase robustness during the training this http URL, we construct a new video dataset with attribute-based multi-label called Multi-Kinetics for evaluation, which contains various attribute labels (e.g. action, scene, object, etc.) related to human behavior. Extensive experiments demonstrate that our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition. The project and dataset are available at an anonymous account this https URL
zh

[CV-43] CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

【速读】：该论文旨在解决现有文本到图像生成模型在实现强文本图像对齐、高质量生成以及与人类审美标准一致性方面面临的挑战。关键在于首次探索将人类偏好对齐和测试时采样相结合的方法，提出了一种新的生成框架CHATS（结合人类对齐优化和测试时采样），该框架通过分别建模首选和不首选分布，并采用基于代理提示的采样策略来利用这两个分布中的有用信息。

链接: https://arxiv.org/abs/2502.12579
作者: Minghao Fu,Guo-Hua Wang,Liangfu Cao,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks.
zh

[CV-44] GVTNet: Graph Vision Transformer For Face Super-Resolution

【速读】：该论文旨在解决现有基于Transformer架构的人脸超分辨率算法在处理低分辨率图像时，无法有效管理不同图像块之间关系的问题，导致超分辨率结果中面部组件出现失真。解决方案的关键在于提出了一种基于图神经网络的变换器架构，即图视觉变换器网络（Graph Vision Transformer Network）。通过将每个图像块视为图节点，并基于块间信息建立邻接矩阵，使得图像块仅与相邻块进行交互，从而更有效地处理面部组件之间的关系。

链接: https://arxiv.org/abs/2502.12570
作者: Chao Yang,Yong Fan,Cheng Lu,Minghao Yuan,Zhijing Yang
机构: Southwest University of Science and Technology, China(西南科技大学,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in face super-resolution research have utilized the Transformer architecture. This method processes the input image into a series of small patches. However, because of the strong correlation between different facial components in facial images. When it comes to super-resolution of low-resolution images, existing algorithms cannot handle the relationships between patches well, resulting in distorted facial components in the super-resolution results. To solve the problem, we propose a transformer architecture based on graph neural networks called graph vision transformer network. We treat each patch as a graph node and establish an adjacency matrix based on the information between patches. In this way, the patch only interacts between neighboring patches, further processing the relationship of facial components. Quantitative and visualization experiments have underscored the superiority of our algorithm over state-of-the-art techniques. Through detailed comparisons, we have demonstrated that our algorithm possesses more advanced super-resolution capabilities, particularly in enhancing facial components. The PyTorch code is available at this https URL
zh

[CV-45] DeltaDiff: A Residual-Guided Diffusion Model for Enhanced Image Super-Resolution

【速读】：该论文旨在解决在使用扩散模型进行超分辨率（Super-Resolution, SR）任务时，由于过度引入随机因素导致生成结果不忠实于高分辨率图像的问题。论文的关键解决方案是提出了一种新的扩散模型Deltadiff，该模型仅利用图像之间的残差进行扩散，从而使得整个扩散过程更加稳定，进而提高了生成结果的忠实度。实验结果表明，Deltadiff方法超越了现有的最先进模型，生成的结果具有更好的保真度。

链接: https://arxiv.org/abs/2502.12567
作者: Chao Yang,Yong Fan,Cheng Lu,Zhijing Yang
机构: Southwest University of Science and Technology, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the application of diffusion models in super-resolution tasks has become a popular research direction. Existing work is focused on fully migrating diffusion models to SR tasks. The diffusion model is proposed in the field of image generation, so in order to make the generated results diverse, the diffusion model combines random Gaussian noise and distributed sampling to increase the randomness of the model. However, the essence of super-resolution tasks requires the model to generate high-resolution images with fidelity. Excessive addition of random factors can result in the model generating detailed information that does not belong to the HR image. To address this issue, we propose a new diffusion model called Deltadiff, which uses only residuals between images for diffusion, making the entire diffusion process more stable. The experimental results show that our method surpasses state-of-the-art models and generates results with better fidelity. Our code and model are publicly available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.12567 [cs.CV] (or arXiv:2502.12567v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.12567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-46] MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos

【速读】：该论文旨在解决长视频理解相关的挑战，特别是如何有效地从长视频（超过500秒）中检索有用的片段。解决方案的关键在于提出了一套名为MomentSeeker的综合基准测试，用于评估检索模型在处理通用长视频片段检索（LVMR）任务中的性能。MomentSeeker通过涵盖广泛的场景和任务类型，并通过人工标注确保评估任务的可靠性，从而提供了全面的评估工具。此外，论文还展示了基于合成数据微调的多模态大型语言模型（MLLM）在该基准上的强大表现，并通过实验揭示了现有方法在LVMR任务中的局限性。

链接: https://arxiv.org/abs/2502.12558
作者: Huaying Yuan,Jian Ni,Yueze Wang,Junjie Zhou,Zhengyang Liang,Zheng Liu,Zhao Cao,Zhicheng Dou,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); School of Smart Governance, Renmin University of China (智能治理学院，中国人民大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) holds great promise in addressing challenges associated with long video understanding. These methods retrieve useful moments from long videos for their presented tasks, thereby enabling multimodal large language models (MLLMs) to generate high-quality answers in a cost-effective way. In this work, we present MomentSeeker, a comprehensive benchmark to evaluate retrieval models’ performance in handling general long-video moment retrieval (LVMR) tasks. MomentSeeker offers three key advantages. First, it incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval. Second, it covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios (e.g., sports, movies, cartoons, and ego), making it a comprehensive tool for assessing retrieval models’ general LVMR performance. Additionally, the evaluation tasks are carefully curated through human annotation, ensuring the reliability of assessment. We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark. We perform extensive experiments with various popular multimodal retrievers based on our benchmark, whose results highlight the challenges of LVMR and limitations for existing methods. Our created resources will be shared with community to advance future research in this field.
zh

[CV-47] Spatiotemporal Multi-Camera Calibration using Freely Moving People

【速读】：该论文旨在解决多摄像机时空标定问题，特别是在动态多人场景中自由移动的人体。解决方案的关键在于将人体姿态估计转化为单位球面上的三维点，并通过交替优化旋转、时间偏移和关联性来处理标定问题。采用概率方法联合解决时空数据对齐和视图间软分配的对应关系，并利用共面性约束确定平移。最后，通过非线性优化方法整合多视角结果，提高摄像机姿态、时间偏移及多人关联的准确性。

链接: https://arxiv.org/abs/2502.12546
作者: Sang-Eun Lee,Ko Nishino,Shohei Nobuhara
机构: Graduate School of Informatics, Kyoto University (京都大学信息学研究生院); Information and Human Sciences, Kyoto Institute of Technology (京都工艺纤维大学信息与人类科学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:We propose a novel method for spatiotemporal multi-camera calibration using freely moving people in multiview videos. Since calibrating multiple cameras and finding matches across their views are inherently interdependent, performing both in a unified framework poses a significant challenge. We address these issues as a single registration problem of matching two sets of 3D points, leveraging human motion in dynamic multi-person scenes. To this end, we utilize 3D human poses obtained from an off-the-shelf monocular 3D human pose estimator and transform them into 3D points on a unit sphere, to solve the rotation, time offset, and the association alternatingly. We employ a probabilistic approach that can jointly solve both problems of aligning spatiotemporal data and establishing correspondences through soft assignment between two views. The translation is determined by applying coplanarity constraints. The pairwise registration results are integrated into a multiview setup, and then a nonlinear optimization method is used to improve the accuracy of the camera poses, temporal offsets, and multi-person associations. Extensive experiments on synthetic and real data demonstrate the effectiveness and flexibility of the proposed method as a practical marker-free calibration tool.
zh

[CV-48] IM360: Textured Mesh Reconstruction for Large-scale Indoor Mapping with 360circ Cameras

【速读】：该论文旨在解决在大规模室内场景中由于缺乏纹理和存在重复区域导致的传统Structure-from-Motion (SfM)方法效果不佳的问题。为了解决这一挑战，论文提出了一种名为IM360的方法，该方法利用全景图像的宽视场，并将球形相机模型集成到SfM流程的核心组件中。关键解决方案包括：采用神经隐式曲面重建技术处理稀疏输入数据以生成高质量表面，以及利用基于网格的神经渲染方法来优化纹理贴图并准确捕捉视角相关的属性。

链接: https://arxiv.org/abs/2502.12545
作者: Dongki Jung,Jaehoon Choi,Yonghan Lee,Dinesh Manocha
机构: University of Maryland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel 3D reconstruction pipeline for 360 ^\circ cameras for 3D mapping and rendering of indoor environments. Traditional Structure-from-Motion (SfM) methods may not work well in large-scale indoor scenes due to the prevalence of textureless and repetitive regions. To overcome these challenges, our approach (IM360) leverages the wide field of view of omnidirectional images and integrates the spherical camera model into every core component of the SfM pipeline. In order to develop a comprehensive 3D reconstruction solution, we integrate a neural implicit surface reconstruction technique to generate high-quality surfaces from sparse input data. Additionally, we utilize a mesh-based neural rendering approach to refine texture maps and accurately capture view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on large-scale indoor scenes from the Matterport3D and Stanford2D3D datasets. In practice, IM360 demonstrate superior performance in terms of textured mesh reconstruction over SOTA. We observe accuracy improvements in terms of camera localization and registration as well as rendering high frequency details.
zh

[CV-49] When Segmentation Meets Hyperspectral Image: New Paradigm for Hyperspectral Image Classification

【速读】：该论文旨在解决基于小patch的分类器在高光谱图像（HSI）分类任务中的局限性，包括有限的感受野导致的空间结构信息不足、噪声般的误分类以及多形状意识的缺乏。关键解决方案在于提出了一种新的范式HSIseg，结合动态位移区域Transformer（DSRT）技术，并引入了一个渐进学习框架和自适应伪标签机制，以迭代地将未标记区域纳入训练过程，同时通过多源数据协作增强特征交互。

链接: https://arxiv.org/abs/2502.12541
作者: Weilian Zhou,Weixuan Xie,Sei-ichiro Kamata,Man Sing Wong,Huiying(Cynthia)Hou,Haipeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) classification is a cornerstone of remote sensing, enabling precise material and land-cover identification through rich spectral information. While deep learning has driven significant progress in this task, small patch-based classifiers, which account for over 90% of the progress, face limitations: (1) the small patch (e.g., 7x7, 9x9)-based sampling approach considers a limited receptive field, resulting in insufficient spatial structural information critical for object-level identification and noise-like misclassifications even within uniform regions; (2) undefined optimal patch sizes lead to coarse label predictions, which degrade performance; and (3) a lack of multi-shape awareness around objects. To address these challenges, we draw inspiration from large-scale image segmentation techniques, which excel at handling object boundaries-a capability essential for semantic labeling in HSI classification. However, their application remains under-explored in this task due to (1) the prevailing notion that larger patch sizes degrade performance, (2) the extensive unlabeled regions in HSI groundtruth, and (3) the misalignment of input shapes between HSI data and segmentation models. Thus, in this study, we propose a novel paradigm and baseline, HSIseg, for HSI classification that leverages segmentation techniques combined with a novel Dynamic Shifted Regional Transformer (DSRT) to overcome these challenges. We also introduce an intuitive progressive learning framework with adaptive pseudo-labeling to iteratively incorporate unlabeled regions into the training process, thereby advancing the application of segmentation techniques. Additionally, we incorporate auxiliary data through multi-source data collaboration, promoting better feature interaction. Validated on five public HSI datasets, our proposal outperforms state-of-the-art methods.
zh

[CV-50] Learning Transformation-Isomorphic Latent Space for Accurate Hand Pose Estimation

【速读】：该论文旨在解决基于视觉回归任务（如手部姿态估计）中特征表示学习所面临的两个主要挑战：高语义层次的图像特征不足以回归低级信息，以及提取的特征包含与任务无关的信息，从而降低其紧凑性并干扰回归任务。为了解决这些问题，论文提出了一种名为TI-Net的高通用性视觉网络主干，用于构建变换同构潜在空间。关键解决方案在于使用线性变换在潜在空间中建模几何变换，并确保这些变换与图像空间中的变换对齐，从而使潜在特征捕获有利于姿态估计任务的紧凑低级信息。在DexYCB数据集上的实验表明，与专门的最新方法相比，TI-Net在PA-MPJPE指标上提高了10%。

链接: https://arxiv.org/abs/2502.12535
作者: Kaiwen Ren,Lei Hu,Zhiheng Zhang,Yongjing Ye,Shihong Xia
机构: University of Chinese Academic of Science (中国科学技术大学); Institute of Computing Technology, CAS (中科院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based regression tasks, such as hand pose estimation, have achieved higher accuracy and faster convergence through representation learning. However, existing representation learning methods often encounter the following issues: the high semantic level of features extracted from images is inadequate for regressing low-level information, and the extracted features include task-irrelevant information, reducing their compactness and interfering with regression tasks. To address these challenges, we propose TI-Net, a highly versatile visual Network backbone designed to construct a Transformation Isomorphic latent space. Specifically, we employ linear transformations to model geometric transformations in the latent space and ensure that \rm TI-Net aligns them with those in the image space. This ensures that the latent features capture compact, low-level information beneficial for pose estimation tasks. We evaluated TI-Net on the hand pose estimation task to demonstrate the network’s superiority. On the DexYCB dataset, TI-Net achieved a 10% improvement in the PA-MPJPE metric compared to specialized state-of-the-art (SOTA) hand pose estimation methods. Our code will be released in the future.
zh

[CV-51] NoKSR: Kernel-Free Neural Surface Reconstruction via Point Cloud Serialization

【速读】：该论文旨在解决大规模点云表面重建的问题，提出了一个高效的框架将不规则点云转换为Signed Distance Field (SDF)。解决方案的关键在于利用基于变换器的架构（如PointTransformerV3）将点云序列化为局部保持的令牌序列，并通过聚合邻近令牌来高效预测SDF值。此外，通过在不同层次/尺度上序列化点云并进行非线性特征聚合以预测SDF值，进一步克服了序列化引入的近似误差。这一方法在精度和效率方面达到了新的技术水平，尤其在户外数据集上表现突出。

链接: https://arxiv.org/abs/2502.12534
作者: Zhen Li,Weiwei Sun,Shrisudhan Govindarajan,Shaobo Xia,Daniel Rebain,Kwang Moo Yi,Andrea Tagliasacchi
机构: Simon Fraser University; University of British Columbia; Amazon; Changsha University of Science and Technology; University of Toronto; Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: see /https://theialab. this http URL

点击查看摘要

Abstract:We present a novel approach to large-scale point cloud surface reconstruction by developing an efficient framework that converts an irregular point cloud into a signed distance field (SDF). Our backbone builds upon recent transformer-based architectures (i.e., PointTransformerV3), that serializes the point cloud into a locality-preserving sequence of tokens. We efficiently predict the SDF value at a point by aggregating nearby tokens, where fast approximate neighbors can be retrieved thanks to the serialization. We serialize the point cloud at different levels/scales, and non-linearly aggregate a feature to predict the SDF value. We show that aggregating across multiple scales is critical to overcome the approximations introduced by the serialization (i.e. false negatives in the neighborhood). Our frameworks sets the new state-of-the-art in terms of accuracy and efficiency (better or similar performance with half the latency of the best prior method, coupled with a simpler implementation), particularly on outdoor datasets where sparse-grid methods have shown limited performance.
zh

[CV-52] Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型在生成内容时，尽管经过NSFW（不适宜工作场所的内容）过滤，仍可能产生不当内容的问题。关键在于系统性地评估并改进现有的11种先进基线方法及其14个变体，以实现更有效的概念擦除（concept erasure），从而确保这些模型的安全部署。论文从六个不同的评估角度，包括传统的擦除比例、图像质量、语义一致性以及新的过度擦除、显性和隐性不安全提示的影响和鲁棒性等方面进行深入分析，并提供了全面的评估指标，以提升概念擦除技术的有效性。

链接: https://arxiv.org/abs/2502.12527
作者: Die Chen,Zhiwen Li,Cen Chen,Xiaodan Li,Jinyan Ye
机构: East China Normal University(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of these models can inadvertently led they to generate NSFW content even with efforts on filtering NSFW content from the training dataset, posing risks to their safe deployment. While several concept erasure methods have been proposed to mitigate this issue, a comprehensive evaluation of their effectiveness remains absent. To bridge this gap, we present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models. At the task level, we provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants. Specifically, we analyze these methods from six distinct assessment perspectives, including three conventional perspectives, i.e., erasure proportion, image quality, and semantic alignment, and three new perspectives, i.e., excessive erasure, the impact of explicit and implicit unsafe prompts, and robustness. At the tool level, we perform a detailed toxicity analysis of NSFW datasets and compare the performance of different NSFW classifiers, offering deeper insights into their performance alongside a compilation of comprehensive evaluation metrics. Our benchmark not only systematically evaluates concept erasure methods, but also delves into the underlying factors influencing their performance at the insight level. By synthesizing insights from various evaluation perspectives, we provide a deeper understanding of the challenges and opportunities in the field, offering actionable guidance and inspiration for advancing research and practical applications in concept erasure.
zh

[CV-53] YOLOv12: Attention-Centric Real-Time Object Detectors

【速读】：该论文旨在解决在提升YOLO框架网络架构的过程中过度依赖CNN改进而忽视了注意力机制优越性的建模能力的问题。论文的关键解决方案在于提出了一种以注意力为中心的YOLO框架（YOLOv12），该框架在保持与基于CNN的传统模型相当的速度的同时，能够充分利用注意力机制的性能优势。这使得YOLOv12在精度上超越了所有流行的实时目标检测器，并且在速度方面具有竞争力。

链接: https://arxiv.org/abs/2502.12524
作者: Yunjie Tian,Qixiang Ye,David Doermann
机构: University at Buffalo (布法罗大学); University of Chinese Academy of Sciences (中国科学院大学); University at Buffalo (布法罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
zh

[CV-54] SAFEERASER: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在安全方面的未充分探索的机器遗忘（Machine Unlearning, MU）问题。论文的关键解决方案是提出了SAFEERASER，一个用于MLLM的安全遗忘基准，包含3,000张图像和28.8K个视觉问答（VQA）对。为了从遗忘质量和模型效用两个角度全面评估遗忘方法，作者引入了Prompt Decouple (PD) Loss来缓解过拟合遗忘问题，并通过Safe Answer Refusal Rate (SARR) 新指标定量衡量PD Loss减轻过拟合遗忘的效果。实验结果显示，结合PD Loss与现有遗忘方法可以有效防止过拟合遗忘，并在LLaVA-7B和LLaVA-13B模型中将SARR指标降低79.5%，同时保持遗忘质量和模型效用。

链接: https://arxiv.org/abs/2502.12520
作者: Junkai Chen,Zhijie Deng,Kening Zheng,Yibo Yan,Shuliang Liu,PeiJun Wu,Peijie Jiang,Jia Liu,Xuming Hu
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (香港科技大学); Southeast University (东南大学); Ant Group (蚂蚁集团), Alibaba (阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. Machine Unlearning (MU), as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, MU for safety in MLLM has yet to be fully explored. To address this issue, we propose SAFEERASER, a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: forget quality and model utility. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from over-forgetting. Hence, we introduce Prompt Decouple (PD) Loss to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called Safe Answer Refusal Rate (SARR). Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.
zh

[CV-55] RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

【速读】：该论文旨在解决非配对多模态文档（Unpaired Multimodal Documents）在视觉-语言表示学习中的未充分利用问题。关键解决方案包括建立了一个真实世界数据提取管道以提取高质量图像和文本（Real-World Data Extraction Pipeline），设计了一种分层检索方法以高效关联图像与多个语义相关的现实文本（Hierarchical Retrieval Method），提出了一种图像语义增强生成模块以提升细粒度视觉信息（Image Semantic Augmented Generation Module），以及采用了一种语义平衡采样策略以提高数据集多样性（Semantic Balance Sampling Strategy）。基于这些创新，构建了包含现实文本和合成文本的RealSyn数据集，并展示了其在视觉-语言表示学习中的有效性及强扩展性。

链接: https://arxiv.org/abs/2502.12513
作者: Tiancheng Gu,Kaicheng Yang,Chaoyi Zhang,Yin Xie,Xiang An,Ziyong Feng,Dongnan Liu,Weidong Cai,Jiankang Deng
机构: The University of Sydney(悉尼大学); DeepGlint; Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 12 figures, Webpage: this https URL

点击查看摘要

Abstract:After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability. Models pre-trained on RealSyn achieve state-of-the-art performance on multiple downstream tasks. To facilitate future research, the RealSyn dataset and pre-trained model weights are released at this https URL.
zh

[CV-56] Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

【速读】：该论文旨在解决现有脉冲神经网络（Spiking Neural Networks, SNNs）主要专注于单模态处理且缺乏有效的跨模态信息融合的问题，从而限制其在真实世界多模态场景中的有效性。为了解决这一挑战，论文提出了一种基于Transformer的语义对齐跨模态残差学习（Semantic-alignment Cross-Modal Residual Learning, S-CMRL）框架。S-CMRL的关键在于利用时空脉冲注意力机制来提取跨模态的互补特征，并引入跨模态残差学习策略以增强特征融合，同时通过语义对齐优化机制将跨模态特征对齐到共享语义空间中，从而提高特征的一致性和互补性。

链接: https://arxiv.org/abs/2502.12488
作者: Xiang He,Dongcheng Zhao,Yiting Dong,Guobin Shen,Xin Yang,Yi Zeng
机构: Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences(脑启发认知智能实验室, 自动化研究所, 中国科学院), Beijing 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院, 中国科学院大学), Beijing 100049, China; School of Future Technology, University of Chinese Academy of Sciences(未来技术学院, 中国科学院大学), Beijing 100049, China; CAS Key Laboratory of Molecular Imaging(分子成像重点实验室), Institute of Automation, Chinese Academy of Sciences(自动化研究所, 中国科学院), Beijing 100190, China; Center for Long-term Artificial Intelligence(长期人工智能中心), Beijing 100190, China; Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences(脑认知与类脑智能技术重点实验室, 中国科学院), Shanghai, 200031, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The manuscript is under review and the code is available this https URL

点击查看摘要

Abstract:Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain’s information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at this https URL.
zh

[CV-57] Predicate Hierarchies Improve Few-Shot State Classification ICLR2025

【速读】：该论文旨在解决状态分类（State Classification）在长周期任务中的挑战，特别是在机器人规划与操作中的物体及其关系分类。由于可能的对象-谓词组合的组合爆炸性增长，以及需要适应新环境的需求，现有的状态分类模型难以在少量样本下泛化到新的查询。为了解决这一问题，论文提出了一种名为PHIER的方法，其关键是利用谓词层次结构（predicate hierarchies）来有效实现少样本场景下的泛化。PHIER通过对象中心的场景编码器、自监督损失来推断谓词间的语义关系，并采用双曲距离度量捕捉层次结构，从而学习图像-谓词对的结构化潜在空间，指导状态分类查询的推理过程。

链接: https://arxiv.org/abs/2502.12481
作者: Emily Jin,Joy Hsu,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICLR 2025. First two authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desideratum for state classification models to generalize to novel queries with few examples. To this end, we propose PHIER, which leverages predicate hierarchies to generalize effectively in few-shot scenarios. PHIER uses an object-centric scene encoder, self-supervised losses that infer semantic relations between predicates, and a hyperbolic distance metric that captures hierarchical structure; it learns a structured latent space of image-predicate pairs that guides reasoning over state classification queries. We evaluate PHIER in the CALVIN and BEHAVIOR robotic environments and show that PHIER significantly outperforms existing methods in few-shot, out-of-distribution state classification, and demonstrates strong zero- and few-shot generalization from simulated to real-world tasks. Our results demonstrate that leveraging predicate hierarchies improves performance on state classification tasks with limited data.
zh

[CV-58] Not-So-Optimal Transport Flows for 3D Point Cloud Generation

【速读】：该论文旨在解决三维点云生成模型学习过程中因置换不变性导致的大规模点云处理效率低下及学习难度高的问题。关键解决方案在于提出了一种次优传输流模型，通过离线预计算获得近似最优传输（Optimal Transport, OT），从而实现OT对的有效构建。此外，在训练过程中结合近似OT与独立耦合，形成混合耦合，以简化目标流模型的学习过程。

链接: https://arxiv.org/abs/2502.12456
作者: Ka-Hei Hui,Chao Liu,Xiaohui Zeng,Chi-Wing Fu,Arash Vahdat
机构: The Chinese University of Hong Kong(香港中文大学); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning generative models of 3D point clouds is one of the fundamental problems in 3D generative learning. One of the key properties of point clouds is their permutation invariance, i.e., changing the order of points in a point cloud does not change the shape they represent. In this paper, we analyze the recently proposed equivariant OT flows that learn permutation invariant generative models for point-based molecular data and we show that these models scale poorly on large point clouds. Also, we observe learning (equivariant) OT flows is generally challenging since straightening flow trajectories makes the learned flow model complex at the beginning of the trajectory. To remedy these, we propose not-so-optimal transport flow models that obtain an approximate OT by an offline OT precomputation, enabling an efficient construction of OT pairs for training. During training, we can additionally construct a hybrid coupling by combining our approximate OT and independent coupling to make the target flow models easier to learn. In an extensive empirical study, we show that our proposed model outperforms prior diffusion- and flow-based approaches on a wide range of unconditional generation and shape completion on the ShapeNet benchmark.
zh

[CV-59] Benchmarking Zero-Shot Facial Emotion Annotation with Large Language Models : A Multi-Class and Multi-Frame Approach in DailyLife

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在自动标注日常场景中人类情绪的可行性和性能。研究的关键解决方案在于使用零样本标注方法，通过GPT-4o-mini模型对从视频片段中提取的关键帧进行快速标注，并探索将多帧集成以提升标注性能和降低成本的方法。实验结果显示，在七类情感分类任务中，LLMs达到了约50%的平均精度；而在三类情感分类任务中，平均精度提高到约64%。这一策略的实施表明，零样本LLMs在人脸情绪自动标注任务中具有潜在应用价值，为减少标注成本和扩展LLMs在复杂多模态环境中的适用性提供了新途径。

链接: https://arxiv.org/abs/2502.12454
作者: He Zhang,Xinyi Fu
机构: College of Information Sciences and Technology, Penn State University(信息科学与技术学院, 宾夕法尼亚州立大学); The Future Laboratory, Tsinghua University(未来实验室, 清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:This study investigates the feasibility and performance of using large language models (LLMs) to automatically annotate human emotions in everyday scenarios. We conducted experiments on the DailyLife subset of the publicly available FERV39k dataset, employing the GPT-4o-mini model for rapid, zero-shot labeling of key frames extracted from video segments. Under a seven-class emotion taxonomy (“Angry,” “Disgust,” “Fear,” “Happy,” “Neutral,” “Sad,” “Surprise”), the LLM achieved an average precision of approximately 50%. In contrast, when limited to ternary emotion classification (negative/neutral/positive), the average precision increased to approximately 64%. Additionally, we explored a strategy that integrates multiple frames within 1-2 second video clips to enhance labeling performance and reduce costs. The results indicate that this approach can slightly improve annotation accuracy. Overall, our preliminary findings highlight the potential application of zero-shot LLMs in human facial emotion annotation tasks, offering new avenues for reducing labeling costs and broadening the applicability of LLMs in complex multimodal environments.
zh

[CV-60] YUNet: Improved YOLOv11 Network for Skyline Detection

【速读】：该论文旨在解决复杂多变天气或光照环境下天空区域分割及天际线检测的挑战。关键解决方案在于提出了一种改进的YUNet算法，通过扩展YOLOv11架构为类似于UNet的结构，包括编码器、颈部模块和解码器子模块，实现了多尺度特征融合与重建预测，从而提升了在复杂环境下的天空区域分割和天际线检测精度。

链接: https://arxiv.org/abs/2502.12449
作者: Gang Yang,Miao Wang,Quan Zhou,Jiangchuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skyline detection plays an important role in geolocalizaion, flight control, visual navigation, port security, etc. The appearance of the sky and non-sky areas are variable, because of different weather or illumination environment, which brings challenges to skyline detection. In this research, we proposed the YUNet algorithm, which improved the YOLOv11 architecture to segment the sky region and extract the skyline in complicated and variable circumstances. To improve the ability of multi-scale and large range contextual feature fusion, the YOLOv11 architecture is extended as an UNet-like architecture, consisting of an encoder, neck and decoder submodule. The encoder extracts the multi-scale features from the given images. The neck makes fusion of these multi-scale features. The decoder applies the fused features to complete the prediction rebuilding. To validate the proposed approach, the YUNet was tested on Skyfinder and CH1 datasets for segmentation and skyline detection respectively. Our test shows that the IoU of YUnet segmentation can reach 0.9858, and the average error of YUnet skyline detection is just 1.36 pixels. The implementation is published at this https URL.
zh

[CV-61] Multi Image Super Resolution Modeling for Earth System Models

【速读】：该论文旨在解决地球系统模型（Earth System Model, ESM）数据空间分辨率不足的问题，以更好地理解复杂的环境过程。为实现这一目标，论文提出了一种新的算法ViFOR，它结合了视觉变换器（Vision Transformers, ViT）和隐式神经表示网络（Implicit Neural Representation Networks, INRs），通过引入基于Fourier的激活函数到视觉变换器架构中，从而有效捕捉全局上下文和高频率细节。这一关键创新使得ViFOR在全图像源温度、短波和长波通量方面的峰值信噪比（PSNR）分别提高了4.18 dB、1.56 dB和1.73 dB，优于现有的ViT、正弦表示网络（SIREN）和超分辨率生成对抗网络（SRGANs）。

链接: https://arxiv.org/abs/2502.12427
作者: Ehsan Zeraatkar,Salah A Faroughi,Jelena Tešić
机构: Texas State University (德克萨斯州立大学); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Super-resolution (SR) techniques are essential for improving Earth System Model (ESM) data’s spatial resolution, which helps better understand complex environmental processes. This paper presents a new algorithm, ViFOR, which combines Vision Transformers (ViT) and Implicit Neural Representation Networks (INRs) to generate High-Resolution (HR) images from Low-Resolution (LR) inputs. ViFOR introduces a novel integration of Fourier-based activation functions within the Vision Transformer architecture, enabling it to effectively capture global context and high-frequency details critical for accurate SR reconstruction. The results show that ViFOR outperforms state-of-the-art methods such as ViT, Sinusoidal Representation Networks (SIREN), and SR Generative Adversarial Networks (SRGANs) based on metrics like Peak Signal-to-Noise Ratio (PSNR) and Mean Squared Error (MSE) both for global as well as the local imagery. ViFOR improves PSNR of up to 4.18 dB, 1.56 dB, and 1.73 dB over ViT for full images in the Source Temperature, Shortwave, and Longwave Flux.
zh

[CV-62] Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

【速读】：该论文旨在解决物理视听常识推理中的两个主要问题：一是如何充分利用多模态数据的不同特性；二是如何增强模型的因果推理能力以促进隐含物理知识的推断。为了解决这些问题，论文提出的关键方案是Robust Disentangled Counterfactual Learning (RDCL)方法。该方法通过解耦序列编码器在潜在空间中将视频分解为静态和动态因素，并采用变分自编码器（VAE）最大化互信息。此外，引入反事实学习模块以增强模型在不同对象间基于物理知识关系的推理能力。为了应对模态数据不完整的问题，论文还提出了一种鲁棒多模态学习方法，通过分解共享特征和模型特定特征来恢复缺失的数据。这种方法可以作为插件集成到任何基线模型中，包括视觉语言模型（VLMs）。

链接: https://arxiv.org/abs/2502.12425
作者: Mengshi Qi,Changsheng Lv,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China (网络与交换技术国家重点实验室, 北京邮电大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects’ physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model’s reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs. In experiments, we show that our proposed method improves the reasoning accuracy and robustness of baseline methods and achieves the state-of-the-art performance.
zh

[CV-63] Boosting Illuminant Estimation in Deep Color Constancy through Enhancing Brightness Robustness

【速读】：该论文旨在解决现有深度神经网络驱动的颜色恒常性（DNNCC）模型在亮度变化下的鲁棒性不足的问题。解决方案的关键在于提出了一种名为BRE的增强策略，通过自适应步长对抗亮度增强技术来识别高风险亮度变化，并生成经过显式亮度调整的增强图像。BRE进一步发展了一种亮度鲁棒性感知的模型优化策略，结合对抗亮度训练和亮度对比损失，显著提高了DNNCC模型的亮度鲁棒性。BRE无需额外超参数且在测试阶段不增加开销，实验结果表明BRE能够一致提升六种主流DNNCC模型的照明估计性能，平均减少误差5.04%。

链接: https://arxiv.org/abs/2502.12418
作者: Mengda Xie,Chengzhi Zhong,Yiling He,Zhan Qin,Meie Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Color constancy estimates illuminant chromaticity to correct color-biased images. Recently, Deep Neural Network-driven Color Constancy (DNNCC) models have made substantial advancements. Nevertheless, the potential risks in DNNCC due to the vulnerability of deep neural networks have not yet been explored. In this paper, we conduct the first investigation into the impact of a key factor in color constancy-brightness-on DNNCC from a robustness perspective. Our evaluation reveals that several mainstream DNNCC models exhibit high sensitivity to brightness despite their focus on chromaticity estimation. This sheds light on a potential limitation of existing DNNCC models: their sensitivity to brightness may hinder performance given the widespread brightness variations in real-world datasets. From the insights of our analysis, we propose a simple yet effective brightness robustness enhancement strategy for DNNCC models, termed BRE. The core of BRE is built upon the adaptive step-size adversarial brightness augmentation technique, which identifies high-risk brightness variation and generates augmented images via explicit brightness adjustment. Subsequently, BRE develops a brightness-robustness-aware model optimization strategy that integrates adversarial brightness training and brightness contrastive loss, significantly bolstering the brightness robustness of DNNCC models. BRE is hyperparameter-free and can be integrated into existing DNNCC models, without incurring additional overhead during the testing phase. Experiments on two public color constancy datasets-ColorChecker and Cube±demonstrate that the proposed BRE consistently enhances the illuminant estimation performance of existing DNNCC models, reducing the estimation error by an average of 5.04% across six mainstream DNNCC models, underscoring the critical role of enhancing brightness robustness in these models.
zh

[CV-64] Gaseous Object Detection

【速读】：该论文致力于探索物体检测技术能否从固态物质扩展到气态物质，即进行气体物体检测（Gaseous Object Detection, GOD）。为解决此问题，论文的关键在于设计了一个基于高斯分散模型的物理启发式体素位移场（Voxel Shift Field, VSF），用于建模潜在三维空间中的几何不规则性和不断变化的形状。通过将VSF集成到Faster R-CNN中，提出了一个简明但强大的基线方法VSF R-CNN，以实现对气体物体的有效检测。

链接: https://arxiv.org/abs/2502.12415
作者: Kailai Zhou,Yibo Wang,Tao Lv,Qiu Shen,Xun Cao
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

点击查看摘要

Abstract:Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.
zh

[CV-65] Multi-vision-based Picking Point Localisation of Target Fruit for Harvesting Robots

【速读】：该论文旨在解决机器人采摘过程中精准识别采摘点的问题。解决方案的关键在于采用了多视觉系统结合分析方法和基于模型的算法来提高采摘点定位的准确性。具体而言，通过使用两个RGB-D相机提取不同表面点，并采用Adaboost回归等模型预测目标果实的几何中心点，从而显著提高了采摘成功率和定位精度。

链接: https://arxiv.org/abs/2502.12406
作者: C. Beldek,A. Dunn,J. Cunningham,E. Sariyildiz,S. L. Phung,G.Alici
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:This paper presents multi-vision-based localisation strategies for harvesting robots. Identifying picking points accurately is essential for robotic harvesting because insecure grasping can lead to economic loss through fruit damage and dropping. In this study, two multi-vision-based localisation methods, namely the analytical approach and model-based algorithms, were employed. The actual geometric centre points of fruits were collected using a motion capture system (mocap), and two different surface points Cfix and Ceih were extracted using two Red-Green-Blue-Depth (RGB-D) cameras. First, the picking points of the target fruit were detected using analytical methods. Second, various primary and ensemble learning methods were employed to predict the geometric centre of target fruits by taking surface points as input. Adaboost regression, the most successful model-based localisation algorithm, achieved 88.8% harvesting accuracy with a Mean Euclidean Distance (MED) of 4.40 mm, while the analytical approach reached 81.4% picking success with a MED of 14.25 mm, both demonstrating better performance than the single-camera, which had a picking success rate of 77.7% with a MED of 24.02 mm. To evaluate the effect of picking point accuracy in collecting fruits, a series of robotic harvesting experiments were performed utilising a collaborative robot (cobot). It is shown that multi-vision systems can improve picking point localisation, resulting in higher success rates of picking in robotic harvesting.
zh

[CV-66] OCT Data is All You Need: How Vision Transformers with and without Pre-training Benefit Imaging

【速读】：该论文旨在探讨ImageNet预训练对基于Vision Transformer (ViT)的眼科光学相干断层扫描(OCT)图像分类性能的影响，特别是在不同数据集大小下的表现。研究涵盖了四种视网膜病理类别(CNV、DME、Drusen、Normal)。关键在于发现虽然在小规模数据集上预训练可以加速收敛并可能提供更好的性能，但在充足OCT数据可用的情况下，从零开始训练可能达到相当或更优的准确性。研究表明，在预训练过程中匹配领域特性的重要性，并呼吁进一步研究针对大规模OCT特定数据的预训练方法。

链接: https://arxiv.org/abs/2502.12379
作者: Zihao Han,Philippe De Wilde
机构: University of Kent (肯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) provides high-resolution cross-sectional images useful for diagnosing various diseases, but their distinct characteristics from natural images raise questions about whether large-scale pre-training on datasets like ImageNet is always beneficial. In this paper, we investigate the impact of ImageNet-based pre-training on Vision Transformer (ViT) performance for OCT image classification across different dataset sizes. Our experiments cover four-category retinal pathologies (CNV, DME, Drusen, Normal). Results suggest that while pre-training can accelerate convergence and potentially offer better performance in smaller datasets, training from scratch may achieve comparable or even superior accuracy when sufficient OCT data is available. Our findings highlight the importance of matching domain characteristics in pre-training and call for further study on large-scale OCT-specific pre-training.
zh

[CV-67] Alignment and Adversarial Robustness: Are More Human-Like Models More Secure?

【速读】：该论文旨在探究表征对齐（Representational Alignment）与对抗鲁棒性（Adversarial Robustness）之间的关系。研究通过大规模实证分析，评估了涵盖多种架构和训练范式的118个模型，并测量其神经和行为对齐度以及在106个基准任务中的工程性能和对抗鲁棒性。研究发现虽然平均对齐度和鲁棒性之间整体相关性较弱，但特定的对齐基准，特别是那些衡量纹理或形状选择性的基准，可以作为对抗鲁棒性的强预测因子。关键在于识别不同形式的对齐在模型鲁棒性中所起的不同作用，从而为进一步利用对齐驱动的方法构建更安全且感知基础的视觉模型提供依据。

链接: https://arxiv.org/abs/2502.12377
作者: Blaine Hoak,Kunyang Li,Patrick McDaniel
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Representational alignment refers to the extent to which a model’s internal representations mirror biological vision, offering insights into both neural similarity and functional correspondence. Recently, some more aligned models have demonstrated higher resiliency to adversarial examples, raising the question of whether more human-aligned models are inherently more secure. In this work, we conduct a large-scale empirical analysis to systematically investigate the relationship between representational alignment and adversarial robustness. We evaluate 118 models spanning diverse architectures and training paradigms, measuring their neural and behavioral alignment and engineering task performance across 106 benchmarks as well as their adversarial robustness via AutoAttack. Our findings reveal that while average alignment and robustness exhibit a weak overall correlation, specific alignment benchmarks serve as strong predictors of adversarial robustness, particularly those that measure selectivity towards texture or shape. These results suggest that different forms of alignment play distinct roles in model robustness, motivating further investigation into how alignment-driven approaches can be leveraged to build more secure and perceptually-grounded vision models.
zh

[CV-68] Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions

【速读】：该论文旨在解决深度神经网络（DNNs）在处理未结构化数据时，系统性弱点的发现难题。这些系统性弱点指的是DNN在特定子集或切片（slice）上的低性能表现，而这些切片在语义上具有连贯性且包含top-k个元素。论文的关键解决方案在于提出了一种结合现代基础模型与组合搜索算法的工作流程，该流程能够处理结构化数据及DNN错误，从而识别出符合预定义人类可理解维度的弱切片。此外，为了应对元数据噪声的影响，该工作流程内嵌了相应的处理方法。论文通过在四个流行的计算机视觉数据集上进行评估，验证了所提方法的有效性。

链接: https://arxiv.org/abs/2502.12360
作者: Sujan Sai Gannamaneni,Rohil Prakash Rao,Michael Mock,Maram Akila,Stefan Wrobel
机构: Fraunhofer IAIS(弗劳恩霍夫协会人工智能研究所); Lamarr Institute(拉玛尔学院); University of Bonn(波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Studying systematic weaknesses of DNNs has gained prominence in the last few years with the rising focus on building safe AI systems. Slice discovery methods (SDMs) are prominent algorithmic approaches for finding such systematic weaknesses. They identify top-k semantically coherent slices/subsets of data where a DNN-under-test has low performance. For being directly useful, e.g., as evidences in a safety argumentation, slices should be aligned with human-understandable (safety-relevant) dimensions, which, for example, are defined by safety and domain experts as parts of the operational design domain (ODD). While straightforward for structured data, the lack of semantic metadata makes these investigations challenging for unstructured data. Therefore, we propose a complete workflow which combines contemporary foundation models with algorithms for combinatorial search that consider structured data and DNN errors for finding systematic weaknesses in images. In contrast to existing approaches, ours identifies weak slices that are in line with predefined human-understandable dimensions. As the workflow includes foundation models, its intermediate and final results may not always be exact. Therefore, we build into our workflow an approach to address the impact of noisy metadata. We evaluate our approach w.r.t. its quality on four popular computer vision datasets, including autonomous driving datasets like Cityscapes, BDD100k, and RailSem19, while using multiple state-of-the-art models as DNNs-under-test.
zh

[CV-69] LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在处理部分遮挡物体的问题时，由于语言先验（language priors）不够强而导致的问答准确率低下的问题。关键在于通过提出一个名为LanP的基准测试来重新评估当前LVLMs中的语言先验强度，以探究这些模型是否具备足够的语言先验能力来有效辅助在视觉信息不足的情况下完成任务。

链接: https://arxiv.org/abs/2502.12359
作者: Zongyu Wu,Yuwei Niu,Hongcheng Gao,Minhua Lin,Zhiwei Zhang,Zhifang Zhang,Qi Shi,Yilong Wang,Sike Fu,Junjie Xu,Junjie Ao,Enyan Dai,Lei Feng,Xiang Zhang,Suhang Wang
机构: The Pennsylvania State University; Singapore University of Technology and Design; Peking University; Rensselaer Polytechnic Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown impressive performance in various tasks. However, LVLMs suffer from hallucination, which hinders their adoption in the real world. Existing studies emphasized that the strong language priors of LVLMs can overpower visual information, causing hallucinations. However, the positive role of language priors is the key to a powerful LVLM. If the language priors are too weak, LVLMs will struggle to leverage rich parameter knowledge and instruction understanding abilities to complete tasks in challenging visual scenarios where visual information alone is insufficient. Therefore, we propose a benchmark called LanP to rethink the impact of Language Priors in LVLMs. It is designed to investigate how strong language priors are in current LVLMs. LanP consists of 170 images and 340 corresponding well-designed questions. Extensive experiments on 25 popular LVLMs reveal that many LVLMs’ language priors are not strong enough to effectively aid question answering when objects are partially hidden. Many models, including GPT-4 Turbo, exhibit an accuracy below 0.5 in such a scenario.
zh

[CV-70] REAL-MM-RAG : A Real-World Multi-Modal Retrieval Benchmark

【速读】：该论文旨在解决多模态文档检索在 Retrieval-Augmented Generation (RAG) 系统中的准确性问题，并提出 REAL-MM-RAG 基准以应对实际应用中的挑战。论文的关键解决方案在于引入一个包含多模态文档、增强难度、真实RAG查询及精确标注的自动基准，以及基于查询重述的多难度等级方案，用于评估模型的语义理解能力。此外，通过整理重述训练集和引入新的金融领域、表格密集型数据集，进一步提升了模型在处理表格密集型文档和查询重述方面的性能。

链接: https://arxiv.org/abs/2502.12342
作者: Navve Wasserman,Roi Pony,Oshri Naparstek,Adi Raz Goldfarb,Eli Schwartz,Udi Barzelay,Leonid Karlinsky
机构: IBM Research Israel (IBM以色列研究院); Weizmann Institute of Science (魏茨曼科学研究学院)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models’ semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.
zh

[CV-71] owards Fusing Point Cloud and Visual Representations for Imitation Learning

【速读】：该论文旨在解决在模仿学习中利用点云与RGB图像融合以实现高效操作任务的问题。现有方法通常将二维图像特征分配到点云上，但这些方法往往丢失原始图像中的全局上下文信息。论文的关键解决方案在于提出了一种新的模仿学习方法，通过自适应层规范化条件来结合点云编码器与全局及局部图像标记，从而有效发挥两种模态的优势。这一方法通过在RoboCasa基准上的广泛实验验证，展示了其在所有任务中达到最先进性能的能力。

链接: https://arxiv.org/abs/2502.12320
作者: Atalay Donat,Xiaogang Jia,Xi Huang,Aleksandar Taranovic,Denis Blessing,Ge Li,Hongyi Zhou,Hanyi Zhang,Rudolf Lioutikov,Gerhard Neumann
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments on the challenging RoboCasa benchmark, we demonstrate the limitations of relying on either modality alone and show that our method achieves state-of-the-art performance across all tasks.
zh

[CV-72] From Gaming to Research: GTA V for Synthetic Data Generation for Robotics and Navigations

【速读】：该论文旨在解决在计算机视觉领域中，获取大规模多样化环境条件下的真实数据集耗时、昂贵且有时不可行的问题。论文的关键解决方案是利用合成数据，特别是通过使用虚拟游戏环境Grand Theft Auto V (GTA V) 创建的合成数据集，并设计了一种无需人工监督的算法来生成用于Visual Place Recognition (VPR) 的数据集。实验结果表明，这些从GTA V衍生出的合成数据在Simultaneous Localization and Mapping (SLAM) 和VPR应用中与真实数据具有可比性，能够补充甚至替代真实数据。

链接: https://arxiv.org/abs/2502.12303
作者: Matteo Scucchia,Matteo Ferrara,Davide Maltoni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In computer vision, the development of robust algorithms capable of generalizing effectively in real-world scenarios more and more often requires large-scale datasets collected under diverse environmental conditions. However, acquiring such datasets is time-consuming, costly, and sometimes unfeasible. To address these limitations, the use of synthetic data has gained attention as a viable alternative, allowing researchers to generate vast amounts of data while simulating various environmental contexts in a controlled setting. In this study, we investigate the use of synthetic data in robotics and navigation, specifically focusing on Simultaneous Localization and Mapping (SLAM) and Visual Place Recognition (VPR). In particular, we introduce a synthetic dataset created using the virtual environment of the video game Grand Theft Auto V (GTA V), along with an algorithm designed to generate a VPR dataset, without human supervision. Through a series of experiments centered on SLAM and VPR, we demonstrate that synthetic data derived from GTA V are qualitatively comparable to real-world data. Furthermore, these synthetic data can complement or even substitute real-world data in these applications. This study sets the stage for the creation of large-scale synthetic datasets, offering a cost-effective and scalable solution for future research and development.
zh

[CV-73] Per-channel autoregressive linear prediction padding in tiled CNN processing of 2D spatial data

【速读】：该论文旨在解决卫星图像超分辨率重建过程中不同填充方法带来的误差问题。论文的关键在于提出了一种基于线性预测的可微分填充方法。通过最小化噪声项的最小二乘误差，该方法针对每个通道拟合了一个随机自回归线性模型。实验表明，与零填充和复制填充相比，线性预测填充略微降低了均方超分辨率误差，并更好地逼近了卫星图像数据及RVSR特征图数据。

链接: https://arxiv.org/abs/2502.12300
作者: Olli Niemitalo,Otto Rosenberg,Nathaniel Narra,Olli Koskela,Iivari Kunttu(HAMK Häme University of Applied Sciences)
机构: HAMK Häme University of Applied Sciences (哈米应用科学大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 20 figures including appendix; to be submitted for review; for source code, see this https URL

点击查看摘要

Abstract:We present linear prediction as a differentiable padding method. For each channel, a stochastic autoregressive linear model is fitted to the padding input by minimizing its noise terms in the least-squares sense. The padding is formed from the expected values of the autoregressive model given the known pixels. We trained the convolutional RVSR super-resolution model from scratch on satellite image data, using different padding methods. Linear prediction padding slightly reduced the mean square super-resolution error compared to zero and replication padding, with a moderate increase in time cost. Linear prediction padding better approximated satellite image data and RVSR feature map data. With zero padding, RVSR appeared to use more of its capacity to compensate for the high approximation error. Cropping the network output by a few pixels reduced the super-resolution error and the effect of the choice of padding method on the error, favoring output cropping with the faster replication and zero padding methods, for the studied workload.
zh

[CV-74] Duo Streamers: A Streaming Gesture Recognition Framework

【速读】：该论文旨在解决在资源受限场景下手势识别面临的高精度与低延迟难以兼顾的挑战。解决方案的关键在于提出了一种名为Duo Streamers的流式手势识别框架，该框架通过三阶段稀疏识别机制、带有外部隐藏状态的轻量级RNN模型以及专门设计的训练和后处理管道，实现了实时性能和轻量化设计的创新进展。

链接: https://arxiv.org/abs/2502.12297
作者: Boxuan Zhu,Sicheng Yang,Zhuo Wang,Haining Liang,Junxiao Shen
机构: OpenInterX; University of Liverpool; HKUST (Guangzhou); University of Bristol
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Gesture recognition in resource-constrained scenarios faces significant challenges in achieving high accuracy and low latency. The streaming gesture recognition framework, Duo Streamers, proposed in this paper, addresses these challenges through a three-stage sparse recognition mechanism, an RNN-lite model with an external hidden state, and specialized training and post-processing pipelines, thereby making innovative progress in real-time performance and lightweight design. Experimental results show that Duo Streamers matches mainstream methods in accuracy metrics, while reducing the real-time factor by approximately 92.3%, i.e., delivering a nearly 13-fold speedup. In addition, the framework shrinks parameter counts to 1/38 (idle state) and 1/9 (busy state) compared to mainstream models. In summary, Duo Streamers not only offers an efficient and practical solution for streaming gesture recognition in resource-constrained devices but also lays a solid foundation for extended applications in multimodal and diverse scenarios.
zh

[CV-75] Data-Efficient Limited-Angle CT Using Deep Priors and Regularization

【速读】：该论文旨在解决在严重角度限制条件下从Radon变换重建图像的问题。在实际应用场景中，如X射线扫描，完全180度扫描不可行或需要减少辐射暴露时，该问题尤为突出。由于逆问题的病态特性，现有方法往往会产生显著的伪影。论文的关键解决方案在于结合多种正则化方法（包括全变分Total Variation、sinogram滤波器、深度图像先验Deep Image Prior以及基于补丁的自动编码器）来处理这一病态问题，并采用可微分Radon变换实现梯度驱动的逆问题求解技术。通过这种方法，仅使用少量数据点（总计12个），论文展示了与最佳的数据驱动合成方法相当的结果。

链接: https://arxiv.org/abs/2502.12293
作者: Ilmari Vahteristo,Zhi-Song Liu,Andreas Rupp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 reference pages, 5 figures, submitted to SCIA 2024

点击查看摘要

Abstract:Reconstructing an image from its Radon transform is a fundamental computed tomography (CT) task arising in applications such as X-ray scans. In many practical scenarios, a full 180-degree scan is not feasible, or there is a desire to reduce radiation exposure. In these limited-angle settings, the problem becomes ill-posed, and methods designed for full-view data often leave significant artifacts. We propose a very low-data approach to reconstruct the original image from its Radon transform under severe angle limitations. Because the inverse problem is ill-posed, we combine multiple regularization methods, including Total Variation, a sinogram filter, Deep Image Prior, and a patch-level autoencoder. We use a differentiable implementation of the Radon transform, which allows us to use gradient-based techniques to solve the inverse problem. Our method is evaluated on a dataset from the Helsinki Tomography Challenge 2022, where the goal is to reconstruct a binary disk from its limited-angle sinogram. We only use a total of 12 data points–eight for learning a prior and four for hyperparameter selection–and achieve results comparable to the best synthetic data-driven approaches.
zh

[CV-76] SmokeNet: Efficient Smoke Segmentation Leverag ing Multiscale Convolutions and Multiview Attention Mechanisms

【速读】：该论文旨在解决烟雾羽流高效分割的问题，特别是在环境监测和工业安全领域，现有模型通常面临高计算需求和对多样烟雾形态适应性有限的挑战。论文的关键解决方案是引入SmokeNet，一种新颖的深度学习架构，利用多尺度卷积和多视角线性注意力机制，并结合分层特定损失函数，以应对复杂多样的烟雾羽流动态，从而实现在不同环境中的高效且准确的分割。

链接: https://arxiv.org/abs/2502.12258
作者: Xuesong Liu,Emmett J. Ientilucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient segmentation of smoke plumes is crucial for environmental monitoring and industrial safety, enabling the detection and mitigation of harmful emissions from activities like quarry blasts and wildfires. Accurate segmentation facilitates environmental impact assessments, timely interventions, and compliance with safety standards. However, existing models often face high computational demands and limited adaptability to diverse smoke appearances, restricting their deployment in resource-constrained environments. To address these issues, we introduce SmokeNet, a novel deep learning architecture that leverages multiscale convolutions and multiview linear attention mechanisms combined with layer-specific loss functions to handle the complex dynamics of diverse smoke plumes, ensuring efficient and accurate segmentation across varied environments. Additionally, we evaluate SmokeNet’s performance and versatility using four datasets, including our quarry blast smoke dataset made available to the community. The results demonstrate that SmokeNet maintains a favorable balance between computational efficiency and segmentation accuracy, making it suitable for deployment in environmental monitoring and safety management systems. By contributing a new dataset and offering an efficient segmentation model, SmokeNet advances smoke segmentation capabilities in diverse and challenging environments.
zh

[CV-77] PUGS: Zero-shot Physical Understanding with Gaussian Splatting ICRA2025

【速读】：该论文旨在解决机器人系统在理解物体物理属性（如质量、摩擦力和硬度）方面面临的挑战。论文的关键解决方案在于提出了一种名为PUGS的新框架，利用高斯点阵表示法（Gaussian Splatting Representation）进行三维物体重建，并以零样本方式预测多种物理属性。为了改善形状质量和区域亲和性，论文提出了几何感知正则化损失函数和区域感知特征对比损失函数。此外，在推理阶段设计了基于特征的属性传播模块和针对高斯表示优化的体积整合模块。这些创新使PUGS在ABO-500质量预测标准基准测试中达到了新的最先进水平。

链接: https://arxiv.org/abs/2502.12231
作者: Yinghao Shuai,Ran Yu,Yuantao Chen,Zijian Jiang,Xiaowei Song,Nan Wang,Jv Zheng,Jianzhu Ma,Meng Yang,Zhicheng Wang,Wenbo Ding,Hao Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025, Project page: this https URL

点击查看摘要

Abstract:Current robotic systems can understand the categories and poses of objects well. But understanding physical properties like mass, friction, and hardness, in the wild, remains challenging. We propose a new method that reconstructs 3D objects using the Gaussian splatting representation and predicts various physical properties in a zero-shot manner. We propose two techniques during the reconstruction phase: a geometry-aware regularization loss function to improve the shape quality and a region-aware feature contrastive loss function to promote region affinity. Two other new techniques are designed during inference: a feature-based property propagation module and a volume integration module tailored for the Gaussian representation. Our framework is named as zero-shot physical understanding with Gaussian splatting, or PUGS. PUGS achieves new state-of-the-art results on the standard benchmark of ABO-500 mass prediction. We provide extensive quantitative ablations and qualitative visualization to demonstrate the mechanism of our designs. We show the proposed methodology can help address challenging real-world grasping tasks. Our codes, data, and models are available at this https URL
zh

[CV-78] AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors ICLR2025

【速读】：该论文旨在解决因低标准化的触觉传感器数据特性导致难以建立强大的触觉感知系统的问题。解决方案的关键在于学习统一的多传感器表示，从而实现传感器之间的整合以及促进触觉知识在它们之间的迁移。为此，论文引入了TacQuad数据集，并提出了AnyTouch框架，通过静态和动态视角的学习，结合触觉图像和视频，利用多层次结构捕捉像素级细节并增强语义级传感器无关特征的学习，以提升综合感知能力和跨传感器的有效迁移能力。

链接: https://arxiv.org/abs/2502.12191
作者: Ruoxuan Feng,Jiangyu Hu,Wenke Xia,Tianci Gao,Ao Shen,Yuhao Sun,Bin Fang,Di Hu
机构: Renmin University of China; Wuhan University of Science and Technology; Beijing University of Posts and Telecommunications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to precisely understand and manipulate objects. Over time, numerous meticulously designed visuo-tactile sensors have been integrated into robotic systems, aiding in completing various tasks. However, the distinct data characteristics of these low-standardized visuo-tactile sensors hinder the establishment of a powerful tactile perception system. We consider that the key to addressing this issue lies in learning unified multi-sensor representations, thereby integrating the sensors and promoting tactile knowledge transfer between them. To achieve unified representation of this nature, we introduce TacQuad, an aligned multi-modal multi-sensor tactile dataset from four different visuo-tactile sensors, which enables the explicit integration of various sensors. Recognizing that humans perceive the physical environment by acquiring diverse tactile information such as texture and pressure changes, we further propose to learn unified multi-sensor representations from both static and dynamic perspectives. By integrating tactile images and videos, we present AnyTouch, a unified static-dynamic multi-sensor representation learning framework with a multi-level structure, aimed at both enhancing comprehensive perceptual abilities and enabling effective cross-sensor transfer. This multi-level architecture captures pixel-level details from tactile data via masked modeling and enhances perception and transferability by learning semantic-level sensor-agnostic features through multi-modal alignment and cross-sensor matching. We provide a comprehensive analysis of multi-sensor transferability, and validate our method on various datasets and in the real-world pouring task. Experimental results show that our method outperforms existing methods, exhibits outstanding static and dynamic perception capabilities across various sensors.
zh

[CV-79] CFIRSTNET: Comprehensive Features for Static IR Drop Estimation with Neural Network

【速读】：该论文旨在解决现代电子产品中可靠性与性能相关的IR（电压降）估计问题。传统方法需要长时间的迭代和仿真流程，而论文的关键在于利用现代AI加速技术，结合图像特征和网表特征，开发了一个定制的卷积神经网络（CNN），以有效提取PDN（电源分配网络）特征并进行静态IR降预测。实验结果表明，该方法在ICCAD 2023 CAD竞赛中取得了最佳性能，证明了其有效性。

链接: https://arxiv.org/abs/2502.12168
作者: Yu-Tung Liu,Yu-Hao Cheng,Shao-Yu Wu,Hung-Ming Chen
机构: National Yang Ming Chiao Tung University(阳明交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/ACM International Conference on Computer-Aided Design (ICCAD '24), October 27–31, 2024

点击查看摘要

Abstract:IR drop estimation is now considered a first-order metric due to the concern about reliability and performance in modern electronic products. Since traditional solution involves lengthy iteration and simulation flow, how to achieve fast yet accurate estimation has become an essential demand. In this work, with the help of modern AI acceleration techniques, we propose a comprehensive solution to combine both the advantages of image-based and netlist-based features in neural network framework and obtain high-quality IR drop prediction very effectively in modern designs. A customized convolutional neural network (CNN) is developed to extract PDN features and make static IR drop estimations. Trained and evaluated with the open-source dataset, experiment results show that we have obtained the best quality in the benchmark on the problem of IR drop estimation in ICCAD CAD Contest 2023, proving the effectiveness of this important design topic.
zh

[CV-80] xture Image Synthesis Using Spatial GAN Based on Vision Transformers

【速读】：该论文旨在解决传统纹理合成方法在处理复杂纹理时存在的局限性。解决方案的关键在于提出了一种名为ViT-SGAN的新混合模型，该模型融合了视觉变换器（Vision Transformers, ViTs）与空间生成对抗网络（Spatial Generative Adversarial Network, SGAN）。通过将专门的纹理描述符如均值-方差（mean-variance, mu, sigma）和文本子（textons）融入到ViTs的自注意力机制中，该模型能够更有效地捕捉复杂的空间依赖关系，从而生成质量更高的纹理，尤其对于规则和不规则纹理的合成表现更为优越。

链接: https://arxiv.org/abs/2502.01842
作者: Elahe Salari,Zohreh Azimifar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at the 2nd International Conference on Artificial Intelligence and Software Engineering (AI-SOFT), Shiraz University, Shiraz, Iran, 2024

点击查看摘要

Abstract:Texture synthesis is a fundamental task in computer vision, whose goal is to generate visually realistic and structurally coherent textures for a wide range of applications, from graphics to scientific simulations. While traditional methods like tiling and patch-based techniques often struggle with complex textures, recent advancements in deep learning have transformed this field. In this paper, we propose ViT-SGAN, a new hybrid model that fuses Vision Transformers (ViTs) with a Spatial Generative Adversarial Network (SGAN) to address the limitations of previous methods. By incorporating specialized texture descriptors such as mean-variance (mu, sigma) and textons into the self-attention mechanism of ViTs, our model achieves superior texture synthesis. This approach enhances the model’s capacity to capture complex spatial dependencies, leading to improved texture quality that is superior to state-of-the-art models, especially for regular and irregular textures. Comparison experiments with metrics such as FID, IS, SSIM, and LPIPS demonstrate the substantial improvement of ViT-SGAN, which underlines its efficiency in generating diverse realistic textures.
zh

[CV-81] 3D ReX: Causal Explanations in 3D Neuroimaging Classification

【速读】：该论文旨在解决医学影像领域中AI模型的可解释性问题，这使得临床医生难以信任基于AI的预测。论文的关键解决方案是引入3D ReX，这是一种基于因果关系的后验可解释性工具，适用于三维模型。3D ReX利用实际因果理论生成责任图，突出显示对模型决策最为关键的区域。通过在中风检测模型上的测试，3D ReX提供了有关与中风相关的特征空间分布的见解。

链接: https://arxiv.org/abs/2502.12181
作者: Melane Navaratnarajah,Sophie A. Martin,David A. Kelly,Nathan Blake,Hana Chocker
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Explainability remains a significant problem for AI models in medical imaging, making it challenging for clinicians to trust AI-driven predictions. We introduce 3D ReX, the first causality-based post-hoc explainability tool for 3D models. 3D ReX uses the theory of actual causality to generate responsibility maps which highlight the regions most crucial to the model’s decision. We test 3D ReX on a stroke detection model, providing insight into the spatial distribution of features relevant to stroke.
zh

[CV-82] ClusMFL: A Cluster-Enhanced Framework for Modality-Incomplete Multimodal Federated Learning in Brain Imaging Analysis

【速读】：本文旨在解决在联邦学习框架下多模态脑影像分析中因模态缺失导致的数据分布不均与知识转移难题。为应对这一挑战，论文提出了一种名为ClusMFL的新框架，关键在于利用FINCH算法进行特征聚类，并通过有监督对比学习实现特征对齐，同时采用模态感知聚合策略以优化模型性能。这些方法有效克服了模态不完整性的限制，实现了跨机构的脑影像分析。

链接: https://arxiv.org/abs/2502.12180
作者: Xinpeng Wang,Rong Zhou,Han Xie,Xiaoying Tang,Lifang He,Carl Yang
机构: School of Science and Engineering, the Chinese University of Hong Kong, Shenzhen(香港中文大学深圳分校理工学院); Department of Computer Science and Engineering, Lehigh University(里海大学计算机科学与工程系); Department of Computer Science, Emory University(埃默里大学计算机科学系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Federated Learning (MFL) has emerged as a promising approach for collaboratively training multimodal models across distributed clients, particularly in healthcare domains. In the context of brain imaging analysis, modality incompleteness presents a significant challenge, where some institutions may lack specific imaging modalities (e.g., PET, MRI, or CT) due to privacy concerns, device limitations, or data availability issues. While existing work typically assumes modality completeness or oversimplifies missing-modality scenarios, we simulate a more realistic setting by considering both client-level and instance-level modality incompleteness in this study. Building on this realistic simulation, we propose ClusMFL, a novel MFL framework that leverages feature clustering for cross-institutional brain imaging analysis under modality incompleteness. Specifically, ClusMFL utilizes the FINCH algorithm to construct a pool of cluster centers for the feature embeddings of each modality-label pair, effectively capturing fine-grained data distributions. These cluster centers are then used for feature alignment within each modality through supervised contrastive learning, while also acting as proxies for missing modalities, allowing cross-modal knowledge transfer. Furthermore, ClusMFL employs a modality-aware aggregation strategy, further enhancing the model’s performance in scenarios with severe modality incompleteness. We evaluate the proposed framework on the ADNI dataset, utilizing structural MRI and PET scans. Extensive experimental results demonstrate that ClusMFL achieves state-of-the-art performance compared to various baseline methods across varying levels of modality incompleteness, providing a scalable solution for cross-institutional brain imaging analysis.
zh

人工智能

[AI-0] Pre-training Auto-regressive Robotic Models with 4D Representations

链接: https://arxiv.org/abs/2502.13142
作者: Dantong Niu,Yuvan Sharma,Haoru Xue,Giscard Biamby,Junyi Zhang,Ziteng Ji,Trevor Darrell,Roei Herzig
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

[AI-1] AIDE: AI-Driven Exploration in the Space of Code

链接: https://arxiv.org/abs/2502.13138
作者: Zhengyao Jiang,Dominik Schmidt,Dhruv Srikanth,Dixing Xu,Ian Kaplan,Deniss Jacenko,Yuxiang Wu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning, the foundation of modern artificial intelligence, has driven innovations that have fundamentally transformed the world. Yet, behind advancements lies a complex and often tedious process requiring labor and compute intensive iteration and experimentation. Engineers and scientists developing machine learning models spend much of their time on trial-and-error tasks instead of conceptualizing innovative solutions or research hypotheses. To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine learning engineering agent powered by large language models (LLMs). AIDE frames machine learning engineering as a code optimization problem, and formulates trial-and-error as a tree search in the space of potential solutions. By strategically reusing and refining promising solutions, AIDE effectively trades computational resources for enhanced performance, achieving state-of-the-art results on multiple machine learning engineering benchmarks, including our Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.

[AI-2] heorem Prover as a Judge for Synthetic Data Generation

链接: https://arxiv.org/abs/2502.13137
作者: Joshua Ong Jun Leang,Giwon Hong,Wenda Li,Shay B. Cohen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.

[AI-3] Learning to Defer for Causal Discovery with Imperfect Experts

链接: https://arxiv.org/abs/2502.13132
作者: Oscar Clivio,Divyat Mahajan,Perouz Taslakian,Sara Magliacane,Ioannis Mitliagkas,Valentina Zantedeschi,Alexandre Drouin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert’s performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.

[AI-4] SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

链接: https://arxiv.org/abs/2502.13128
作者: Zihan Liu,Shuangrui Ding,Zhixiong Zhang,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at this https URL , and the code will be available at this https URL .

[AI-5] Near-Optimal Private Learning in Linear Contextual Bandits

链接: https://arxiv.org/abs/2502.13115
作者: Fan Chen,Jiachun Li,Alexander Rakhlin,David Simchi-Levi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We analyze the problem of private learning in generalized linear contextual bandits. Our approach is based on a novel method of re-weighted regression, yielding an efficient algorithm with regret of order \sqrtT+\frac1\alpha and \sqrtT/\alpha in the joint and local model of \alpha -privacy, respectively. Further, we provide near-optimal private procedures that achieve dimension-independent rates in private linear models and linear contextual bandits. In particular, our results imply that joint privacy is almost “for free” in all the settings we consider, partially addressing the open problem posed by Azize and Basu (2024).

[AI-6] MatterChat: A Multi-Modal LLM for Material Science

链接: https://arxiv.org/abs/2502.13107
作者: Yingheng Tang,Wenbin Xu,Jie Cao,Jianzhu Ma,Weilu Gao,Steve Farrell,Benjamin Erichson,Michael W. Mahoney,Andy Nonaka,Zhi Yao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and predicting the properties of inorganic materials is crucial for accelerating advancements in materials science and driving applications in energy, electronics, and beyond. Integrating material structure data with language-based information through multi-modal large language models (LLMs) offers great potential to support these efforts by enhancing human-AI interaction. However, a key challenge lies in integrating atomic structures at full resolution into LLMs. In this work, we introduce MatterChat, a versatile structure-aware multi-modal LLM that unifies material structural data and textual inputs into a single cohesive model. MatterChat employs a bridging module to effectively align a pretrained machine learning interatomic potential with a pretrained LLM, reducing training costs and enhancing flexibility. Our results demonstrate that MatterChat significantly improves performance in material property prediction and human-AI interaction, surpassing general-purpose LLMs such as GPT-4. We also demonstrate its usefulness in applications such as more advanced scientific reasoning and step-by-step material synthesis.

[AI-7] BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification

链接: https://arxiv.org/abs/2502.13080
作者: Bich-Chung Phan,Thanh Ma,Huu-Hoa Nguyen,and Thanh-Nghi Do
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gene expression classification is a pivotal yet challenging task in bioinformatics, primarily due to the high dimensionality of genomic data and the risk of overfitting. To bridge this gap, we propose BOLIMES, a novel feature selection algorithm designed to enhance gene expression classification by systematically refining the feature subset. Unlike conventional methods that rely solely on statistical ranking or classifier-specific selection, we integrate the robustness of Boruta with the interpretability of LIME, ensuring that only the most relevant and influential genes are retained. BOLIMES first employs Boruta to filter out non-informative genes by comparing each feature against its randomized counterpart, thus preserving valuable information. It then uses LIME to rank the remaining genes based on their local importance to the classifier. Finally, an iterative classification evaluation determines the optimal feature subset by selecting the number of genes that maximizes predictive accuracy. By combining exhaustive feature selection with interpretability-driven refinement, our solution effectively balances dimensionality reduction with high classification performance, offering a powerful solution for high-dimensional gene expression analysis.

[AI-8] Interactive Agents to Overcome Ambiguity in Software Engineering

链接: https://arxiv.org/abs/2502.13069
作者: Sanidhya Vijayvargiya,Xuhui Zhou,Akhila Yerukola,Maarten Sap,Graham Neubig
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and © asking targeted questions. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.

[AI-9] AI-Assisted Decision Making with Human Learning

链接: https://arxiv.org/abs/2502.13062
作者: Gali Noti,Kate Donahue,Jon Kleinberg,Sigal Oren
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:AI systems increasingly support human decision-making. In many cases, despite the algorithm’s superior performance, the final decision remains in human hands. For example, an AI may assist doctors in determining which diagnostic tests to run, but the doctor ultimately makes the diagnosis. This paper studies such AI-assisted decision-making settings, where the human learns through repeated interactions with the algorithm. In our framework, the algorithm – designed to maximize decision accuracy according to its own model – determines which features the human can consider. The human then makes a prediction based on their own less accurate model. We observe that the discrepancy between the algorithm’s model and the human’s model creates a fundamental tradeoff. Should the algorithm prioritize recommending more informative features, encouraging the human to recognize their importance, even if it results in less accurate predictions in the short term until learning occurs? Or is it preferable to forgo educating the human and instead select features that align more closely with their existing understanding, minimizing the immediate cost of learning? This tradeoff is shaped by the algorithm’s time-discounted objective and the human’s learning ability. Our results show that optimal feature selection has a surprisingly clean combinatorial characterization, reducible to a stationary sequence of feature subsets that is tractable to compute. As the algorithm becomes more “patient” or the human’s learning improves, the algorithm increasingly selects more informative features, enhancing both prediction accuracy and the human’s understanding. Notably, early investment in learning leads to the selection of more informative features than a later investment. We complement our analysis by showing that the impact of errors in the algorithm’s knowledge is limited as it does not make the prediction directly.

[AI-10] LAMD: Context-driven Android Malware Detection and Classification with LLM s

链接: https://arxiv.org/abs/2502.13055
作者: Xingzhi Qian,Xinran Zheng,Yiling He,Shuo Yang,Lorenzo Cavallaro
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of mobile applications has escalated Android malware threats. Although there are numerous detection methods, they often struggle with evolving attacks, dataset biases, and limited explainability. Large Language Models (LLMs) offer a promising alternative with their zero-shot inference and reasoning capabilities. However, applying LLMs to Android malware detection presents two key challenges: (1)the extensive support code in Android applications, often spanning thousands of classes, exceeds LLMs’ context limits and obscures malicious behavior within benign functionality; (2)the structural complexity and interdependencies of Android applications surpass LLMs’ sequence-based reasoning, fragmenting code analysis and hindering malicious intent inference. To address these challenges, we propose LAMD, a practical context-driven framework to enable LLM-based Android malware detection. LAMD integrates key context extraction to isolate security-critical code regions and construct program structures, then applies tier-wise code reasoning to analyze application behavior progressively, from low-level instructions to high-level semantics, providing final prediction and explanation. A well-designed factual consistency verification mechanism is equipped to mitigate LLM hallucinations from the first tier. Evaluation in real-world settings demonstrates LAMD’s effectiveness over conventional detectors, establishing a feasible basis for LLM-driven malware analysis in dynamic threat landscapes.

[AI-11] LLM -Powered Proactive Data Systems

链接: https://arxiv.org/abs/2502.13016
作者: Sepanta Zeighami,Yiming Lin,Shreya Shankar,Aditya Parameswaran
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the power of LLMs, we now have the ability to query data that was previously impossible to query, including text, images, and video. However, despite this enormous potential, most present-day data systems that leverage LLMs are reactive, reflecting our community’s desire to map LLMs to known abstractions. Most data systems treat LLMs as an opaque black box that operates on user inputs and data as is, optimizing them much like any other approximate, expensive UDFs, in conjunction with other relational operators. Such data systems do as they are told, but fail to understand and leverage what the LLM is being asked to do (i.e. the underlying operations, which may be error-prone), the data the LLM is operating on (e.g., long, complex documents), or what the user really needs. They don’t take advantage of the characteristics of the operations and/or the data at hand, or ensure correctness of results when there are imprecisions and ambiguities. We argue that data systems instead need to be proactive: they need to be given more agency – armed with the power of LLMs – to understand and rework the user inputs and the data and to make decisions on how the operations and the data should be represented and processed. By allowing the data system to parse, rewrite, and decompose user inputs and data, or to interact with the user in ways that go beyond the standard single-shot query-result paradigm, the data system is able to address user needs more efficiently and effectively. These new capabilities lead to a rich design space where the data system takes more initiative: they are empowered to perform optimization based on the transformation operations, data characteristics, and user intent. We discuss various successful examples of how this framework has been and can be applied in real-world tasks, and present future directions for this ambitious research agenda.

[AI-12] HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit

链接: https://arxiv.org/abs/2502.13013
作者: Qingwei Ben,Feiyu Jia,Jia Zeng,Junting Dong,Dahua Lin,Jiangmiao Pang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Current humanoid teleoperation systems either lack reliable low-level control policies, or struggle to acquire accurate whole-body control commands, making it difficult to teleoperate humanoids for loco-manipulation tasks. To solve these issues, we propose HOMIE, a novel humanoid teleoperation cockpit integrates a humanoid loco-manipulation policy and a low-cost exoskeleton-based hardware system. The policy enables humanoid robots to walk and squat to specific heights while accommodating arbitrary upper-body poses. This is achieved through our novel reinforcement learning-based training framework that incorporates upper-body pose curriculum, height-tracking reward, and symmetry utilization, without relying on any motion priors. Complementing the policy, the hardware system integrates isomorphic exoskeleton arms, a pair of motion-sensing gloves, and a pedal, allowing a single operator to achieve full control of the humanoid robot. Our experiments show our cockpit facilitates more stable, rapid, and precise humanoid loco-manipulation teleoperation, accelerating task completion and eliminating retargeting errors compared to inverse kinematics-based methods. We also validate the effectiveness of the data collected by our cockpit for imitation learning. Our project is fully open-sourced, demos and code can be found in this https URL.

[AI-13] Integrating Reinforcement Learning Action Model Learning and Numeric Planning for Tackling Complex Tasks

链接: https://arxiv.org/abs/2502.13006
作者: Yarin Benyamin,Argaman Mordoch,Shahaf S. Shperberg,Roni Stern
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automated Planning algorithms require a model of the domain that specifies the preconditions and effects of each action. Obtaining such a domain model is notoriously hard. Algorithms for learning domain models exist, yet it remains unclear whether learning a domain model and planning is an effective approach for numeric planning environments, i.e., where states include discrete and numeric state variables. In this work, we explore the benefits of learning a numeric domain model and compare it with alternative model-free solutions. As a case study, we use two tasks in Minecraft, a popular sandbox game that has been used as an AI challenge. First, we consider an offline learning setting, where a set of expert trajectories are available to learn from. This is the standard setting for learning domain models. We used the Numeric Safe Action Model Learning (NSAM) algorithm to learn a numeric domain model and solve new problems with the learned domain model and a numeric planner. We call this model-based solution NSAM_(+p), and compare it to several model-free Imitation Learning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical results show that some IL algorithms can learn faster to solve simple tasks, while NSAM_(+p) allows solving tasks that require long-term planning and enables generalizing to solve problems in larger environments. Then, we consider an online learning setting, where learning is done by moving an agent in the environment. For this setting, we introduce RAMP. In RAMP, observations collected during the agent’s execution are used to simultaneously train an RL policy and learn a planning domain action model. This forms a positive feedback loop between the RL policy and the learned domain model. We demonstrate experimentally the benefits of using RAMP, showing that it finds more efficient plans and solves more problems than several RL baselines.

[AI-14] Personalized Top-k Set Queries Over Predicted Scores

链接: https://arxiv.org/abs/2502.12998
作者: Sohrab Namazi Nia,Subhodeep Ghosh,Senjuti Basu Roy,Sihem Amer-Yahia
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work studies the applicability of expensive external oracles such as large language models in answering top-k queries over predicted scores. Such scores are incurred by user-defined functions to answer personalized queries over multi-modal data. We propose a generic computational framework that handles arbitrary set-based scoring functions, as long as the functions could be decomposed into constructs, each of which sent to an oracle (in our case an LLM) to predict partial scores. At a given point in time, the framework assumes a set of responses and their partial predicted scores, and it maintains a collection of possible sets that are likely to be the true top-k. Since calling oracles is costly, our framework judiciously identifies the next construct, i.e., the next best question to ask the oracle so as to maximize the likelihood of identifying the true top-k. We present a principled probabilistic model that quantifies that likelihood. We study efficiency opportunities in designing algorithms. We run an evaluation with three large scale datasets, scoring functions, and baselines. Experiments indicate the efficacy of our framework, as it achieves an order of magnitude improvement over baselines in requiring LLM calls while ensuring result accuracy. Scalability experiments further indicate that our framework could be used in large-scale applications.

[AI-15] Free Argumentative Exchanges for Explaining Image Classifiers AAMAS2025

链接: https://arxiv.org/abs/2502.12995
作者: Avinash Kori,Antonio Rago,Francesca Toni
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures. To be published at AAMAS 2025

点击查看摘要

Abstract:Deep learning models are powerful image classifiers but their opacity hinders their trustworthiness. Explanation methods for capturing the reasoning process within these classifiers faithfully and in a clear manner are scarce, due to their sheer complexity and size. We provide a solution for this problem by defining a novel method for explaining the outputs of image classifiers with debates between two agents, each arguing for a particular class. We obtain these debates as concrete instances of Free Argumentative eXchanges (FAXs), a novel argumentation-based multi-agent framework allowing agents to internalise opinions by other agents differently than originally stated. We define two metrics (consensus and persuasion rate) to assess the usefulness of FAXs as argumentative explanations for image classifiers. We then conduct a number of empirical experiments showing that FAXs perform well along these metrics as well as being more faithful to the image classifiers than conventional, non-argumentative explanation methods. All our implementations can be found at this https URL.

[AI-16] owards more Contextual Agents : An extractor-Generator Optimization Framework

链接: https://arxiv.org/abs/2502.12926
作者: Mourad Aouini,Jinan Loubani
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents have demonstrated remarkable success in solving complex tasks across a wide range of general-purpose applications. However, their performance often degrades in context-specific scenarios, such as specialized industries or research domains, where the absence of domain-relevant knowledge leads to imprecise or suboptimal outcomes. To address this challenge, our work introduces a systematic approach to enhance the contextual adaptability of LLM-based agents by optimizing their underlying prompts-critical components that govern agent behavior, roles, and interactions. Manually crafting optimized prompts for context-specific tasks is labor-intensive, error-prone, and lacks scalability. In this work, we introduce an Extractor-Generator framework designed to automate the optimization of contextual LLM-based agents. Our method operates through two key stages: (i) feature extraction from a dataset of gold-standard input-output examples, and (ii) prompt generation via a high-level optimization strategy that iteratively identifies underperforming cases and applies self-improvement techniques. This framework substantially improves prompt adaptability by enabling more precise generalization across diverse inputs, particularly in context-specific tasks where maintaining semantic consistency and minimizing error propagation are critical for reliable performance. Although developed with single-stage workflows in mind, the approach naturally extends to multi-stage workflows, offering broad applicability across various agent-based systems. Empirical evaluations demonstrate that our framework significantly enhances the performance of prompt-optimized agents, providing a structured and efficient approach to contextual LLM-based agents.

[AI-17] Keep what you need : extracting efficient subnetworks from large audio representation models

链接: https://arxiv.org/abs/2502.12925
作者: David Genova,Philippe Esling,Tom Hurlin
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation models. Specifically, we introduce learnable binary masks in-between the layers of a pretrained representation model. When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective, hence learning a compact subnetwork specialized on a single task. Importantly, the weights of the foundation model are kept frozen, resulting into low additional training costs. Once trained, the masked computational units can then be removed from the network, implying significant performance gains. We assess our method on three widespread audio foundation models, each based on a different backbone architecture, and illustrate its effectiveness on common audio representation evaluation tasks, as well as its versatility on both speech, music, and general audio. Code for reproducing the results and supporting webpage are available at this https URL

[AI-18] Graph Neural Networks for Databases: A Survey

链接: https://arxiv.org/abs/2502.12908
作者: Ziming Li,Youhuan Li,Yuyu Luo,Guoliang Li,Chuxu Zhang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: A survey focus on GNNs and databases. 9 pages, 4 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful deep learning models for graph-structured data, demonstrating remarkable success across diverse domains. Recently, the database (DB) community has increasingly recognized the potentiality of GNNs, prompting a surge of researches focusing on improving database systems through GNN-based approaches. However, despite notable advances, There is a lack of a comprehensive review and understanding of how GNNs could improve DB systems. Therefore, this survey aims to bridge this gap by providing a structured and in-depth overview of GNNs for DB systems. Specifically, we propose a new taxonomy that classifies existing methods into two key categories: (1) Relational Databases, which includes tasks like performance prediction, query optimization, and text-to-SQL, and (2) Graph Databases, addressing challenges like efficient graph query processing and graph similarity computation. We systematically review key methods in each category, highlighting their contributions and practical implications. Finally, we suggest promising avenues for integrating GNNs into Database systems.

[AI-19] Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning

链接: https://arxiv.org/abs/2502.12876
作者: Nandakishor M,Anjali M
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Creating personalized and adaptable conversational AI remains a key challenge. This paper introduces a Continuous Learning Conversational AI (CLCA) approach, implemented using A2C reinforcement learning, to move beyond static Large Language Models (LLMs). We use simulated sales dialogues, generated by LLMs, to train an A2C agent. This agent learns to optimize conversation strategies for personalization, focusing on engagement and delivering value. Our system architecture integrates reinforcement learning with LLMs for both data creation and response selection. This method offers a practical way to build personalized AI companions that evolve through continuous learning, advancing beyond traditional static LLM techniques.

[AI-20] owards Adaptive Feedback with AI: Comparing the Feedback Quality of LLM s and Teachers on Experimentation Protocols

链接: https://arxiv.org/abs/2502.12842
作者: Kathrin Seßler,Arne Bewersdorff,Claudia Nerdel,Enkelejda Kasneci
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: This work has been submitted to the IJAIED for possible publication

点击查看摘要

Abstract:Effective feedback is essential for fostering students’ success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent’s performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student’s work context. Qualitative analysis highlighted the LLM agent’s limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

[AI-21] Envious Explore and Exploit

链接: https://arxiv.org/abs/2502.12798
作者: Omer Ben-Porat,Yotam Gafni,Or Markovetzki
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explore-and-exploit tradeoffs play a key role in recommendation systems (RSs), aiming at serving users better by learning from previous interactions. Despite their commercial success, the societal effects of explore-and-exploit mechanisms are not well understood, especially regarding the utility discrepancy they generate between different users. In this work, we measure such discrepancy using the economic notion of envy. We present a multi-armed bandit-like model in which every round consists of several sessions, and rewards are realized once per round. We call the latter property reward consistency, and show that the RS can leverage this property for better societal outcomes. On the downside, doing so also generates envy, as late-to-arrive users enjoy the information gathered by early-to-arrive users. We examine the generated envy under several arrival order mechanisms and virtually any anonymous algorithm, i.e., any algorithm that treats all similar users similarly without leveraging their identities. We provide tight envy bounds on uniform arrival and upper bound the envy for nudged arrival, in which the RS can affect the order of arrival by nudging its users. Furthermore, we study the efficiency-fairness trade-off by devising an algorithm that allows constant envy and approximates the optimal welfare in restricted settings. Finally, we validate our theoretical results empirically using simulations.

[AI-22] VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

链接: https://arxiv.org/abs/2502.12782
作者: Xinlong Chen,Yuanxing Zhang,Chongling Rao,Yushuo Guan,Jiaheng Liu,Fuzheng Zhang,Chengru Song,Qiang Liu,Di Zhang,Tieniu Tan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at this https URL.

[AI-23] Evaluating link prediction: New perspectives and recommendations

链接: https://arxiv.org/abs/2502.12777
作者: Bhargavi Kalyani I,A Rama Prasad Mathi,Niladri Sett
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Link prediction (LP) is an important problem in network science and machine learning research. The state-of-the-art LP methods are usually evaluated in a uniform setup, ignoring several factors associated with the data and application specific needs. We identify a number of such factors, such as, network-type, problem-type, geodesic distance between the end nodes and its distribution over the classes, nature and applicability of LP methods, class imbalance and its impact on early retrieval, evaluation metric, etc., and present an experimental setup which allows us to evaluate LP methods in a rigorous and controlled manner. We perform extensive experiments with a variety of LP methods over real network datasets in this controlled setup, and gather valuable insights on the interactions of these factors with the performance of LP through an array of carefully designed hypotheses. Following the insights, we provide recommendations to be followed as best practice for evaluating LP methods.

[AI-24] Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models

链接: https://arxiv.org/abs/2502.12776
作者: Daiki Chijiwa,Taku Hasegawa,Kyosuke Nishida,Kuniko Saito,Susumu Takeuchi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While foundation models have been exploited for various expert tasks through fine-tuning, any foundation model will become outdated due to its old knowledge or limited capability. Thus the underlying foundation model should be eventually replaced by new ones, which leads to repeated cost of fine-tuning these new models. Existing work addresses this problem by inference-time tuning, i.e., modifying the output probabilities from the new foundation model with the outputs from the old foundation model and its fine-tuned model, which involves an additional overhead in inference by the latter two models. In this paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT), that reduces the inference overhead by its nature, based on the reformulation of fine-tuning as the reward maximization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss function as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, covering both vision and language models, demonstrate that the PRT-trained model can achieve comparable accuracy to the existing work of inference-time tuning, with less inference cost.

[AI-25] REND: A Whitespace Replacement Information Hiding Method

链接: https://arxiv.org/abs/2502.12710
作者: Malte Hellmeier,Hendrik Norkowski,Ernst-Christoph Schrewe,Haydar Qarawlus,Falk Howar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant popularity in recent years. Differentiating between a text written by a human and a text generated by an LLM has become almost impossible. Information hiding techniques such as digital watermarking or steganography can help by embedding information inside text without being noticed. However, existing techniques, such as linguistic-based or format-based methods, change the semantics or do not work on pure, unformatted text. In this paper, we introduce a novel method for information hiding termed TREND, which is able to conceal any byte-encoded sequence within a cover text. The proposed method is implemented as a multi-platform library using the Kotlin programming language, accompanied by a command-line tool and a web interface provided as examples of usage. By substituting conventional whitespace characters with visually similar Unicode whitespace characters, our proposed scheme preserves the semantics of the cover text without increasing the number of characters. Furthermore, we propose a specified structure for secret messages that enables configurable compression, encryption, hashing, and error correction. Our experimental benchmark comparison on a dataset of one million Wikipedia articles compares ten algorithms from literature and practice. It proves the robustness of our proposed method in various applications while remaining imperceptible to humans. We discuss the limitations of limited embedding capacity and further robustness, which guide implications for future work.

[AI-26] Perovskite-LLM : Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

链接: https://arxiv.org/abs/2502.12669
作者: Xiang Liu,Penglei Sun,Shuyan Chen,Longhan Zhang,Peijie Dong,Huajie You,Yongqi Zhang,Chang Yan,Xiaowen Chu,Tong-yi Zhang
类目: Artificial Intelligence (cs.AI)
*备注: 23pages

点击查看摘要

Abstract:The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge-enhanced system for PSCs that integrates three key components. First, we develop Perovskite-KG, a domain-specific knowledge graph constructed from 1,517 research papers, containing 23,789 entities and 22,272 relationships. Second, we create two complementary datasets: Perovskite-Chat, comprising 55,101 high-quality question-answer pairs generated through a novel multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully curated materials science problems. Third, we introduce two specialized large language models: Perovskite-Chat-LLM for domain-specific knowledge assistance and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental results demonstrate that our system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks, providing researchers with effective tools for literature review, experimental design, and complex problem-solving in PSC research.

[AI-27] he Hidden Risks of Large Reasoning Models: A Safety Assessment of R1

链接: https://arxiv.org/abs/2502.12659
作者: Kaiwen Zhou,Chengzhi Liu,Xuandong Zhao,Shreedhar Jangam,Jayanth Srinivasa,Gaowen Liu,Dawn Song,Xin Eric Wang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of large reasoning models, such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source R1 models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on R1 is needed. (2) The distilled reasoning model shows poorer safety performance compared to its safety-aligned base models. (3) The stronger the model’s reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (4) The thinking process in R1 models pose greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models’ safety to close the gap.

[AI-28] Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

链接: https://arxiv.org/abs/2502.12631
作者: Mingyang Sun,Pengxiang Ding,Weinan Zhang,Donglin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR’s superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning. The code will be released at this https URL.

[AI-29] Automating Prompt Leakage Attacks on Large Language Models Using Agent ic Approach

链接: https://arxiv.org/abs/2502.12630
作者: Tvrtko Sternak,Davor Runje,Dorian Granoša,Chi Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach to evaluating the security of large language models (LLMs) against prompt leakage-the exposure of system-level prompts or proprietary configurations. We define prompt leakage as a critical threat to secure LLM deployment and introduce a framework for testing the robustness of LLMs using agentic teams. Leveraging AG2 (formerly AutoGen), we implement a multi-agent system where cooperative agents are tasked with probing and exploiting the target LLM to elicit its prompt. Guided by traditional definitions of security in cryptography, we further define a prompt leakage-safe system as one in which an attacker cannot distinguish between two agents: one initialized with an original prompt and the other with a prompt stripped of all sensitive information. In a safe system, the agents’ outputs will be indistinguishable to the attacker, ensuring that sensitive information remains secure. This cryptographically inspired framework provides a rigorous standard for evaluating and designing secure LLMs. This work establishes a systematic methodology for adversarial testing of prompt leakage, bridging the gap between automated threat modeling and practical LLM security. You can find the implementation of our prompt leakage probing on GitHub. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.12630 [cs.CR] (or arXiv:2502.12630v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2502.12630 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Davor Runje [view email] [v1] Tue, 18 Feb 2025 08:17:32 UTC (102 KB)

[AI-30] A Graph-Enhanced Deep-Reinforcement Learning Framework for the Aircraft Landing Problem

链接: https://arxiv.org/abs/2502.12617
作者: Vatsal Maru
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This paper presents a novel deep reinforcement learning framework combining graph neural networks with actor-critic architectures to address the aircraft landing problem. The framework achieves a 99.95% reduction in computational time compared to Mixed Integer Programming while maintaining safety compliance, and 38% higher runway throughput over First Come First Serve

点击查看摘要

Abstract:The Aircraft Landing Problem (ALP) is one of the challenging problems in aircraft transportation and management. The challenge is to schedule the arriving aircraft in a sequence so that the cost and delays are optimized. There are various solution approaches to solving this problem, most of which are based on operations research algorithms and meta-heuristics. Although traditional methods perform better on one or the other factors, there remains a problem of solving real-time rescheduling and computational scalability altogether. This paper presents a novel deep reinforcement learning (DRL) framework that combines graph neural networks with actor-critic architectures to address the ALP. This paper introduces three key contributions: A graph-based state representation that efficiently captures temporal and spatial relationships between aircraft, a specialized actor-critic architecture designed to handle multiple competing objectives in landing scheduling, and a runway balance strategy that ensures efficient resource utilization while maintaining safety constraints. The results show that the trained algorithm can be tested on different problem sets and the results are competitive to operation research algorithms. The experimental results on standard benchmark data sets demonstrate a 99.95 reduction in computational time compared to Mixed Integer Programming (MIP) and 38 higher runway throughput over First Come First Serve (FCFS) approaches. Therefore, the proposed solution is competitive to traditional approaches and achieves substantial advancements. Notably, it does not require retraining, making it particularly suitable for industrial deployment. The frameworks capability to generate solutions within 1 second enables real-time rescheduling, addressing critical requirements of air traffic management.

[AI-31] Unveiling Mode Connectivity in Graph Neural Networks

链接: https://arxiv.org/abs/2502.12608
作者: Bingheng Li,Zhikai Chen,Haoyu Han,Shenglai Zeng,Jingzhe Liu,Jiliang Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A fundamental challenge in understanding graph neural networks (GNNs) lies in characterizing their optimization dynamics and loss landscape geometry, critical for improving interpretability and robustness. While mode connectivity, a lens for analyzing geometric properties of loss landscapes has proven insightful for other deep learning architectures, its implications for GNNs remain unexplored. This work presents the first investigation of mode connectivity in GNNs. We uncover that GNNs exhibit distinct non-linear mode connectivity, diverging from patterns observed in fully-connected networks or CNNs. Crucially, we demonstrate that graph structure, rather than model architecture, dominates this behavior, with graph properties like homophily correlating with mode connectivity patterns. We further establish a link between mode connectivity and generalization, proposing a generalization bound based on loss barriers and revealing its utility as a diagnostic tool. Our findings further bridge theoretical insights with practical implications: they rationalize domain alignment strategies in graph learning and provide a foundation for refining GNN training paradigms.

[AI-32] Disentangling Long-Short Term State Under Unknown Interventions for Online Time Series Forecasting

链接: https://arxiv.org/abs/2502.12603
作者: Ruichu Cai,Haiqin Huang,Zhifang Jiang,Zijian Li,Changze Zhou,Yuequn Liu,Yuming Liu,Zhifeng Hao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current methods for time series forecasting struggle in the online scenario, since it is difficult to preserve long-term dependency while adapting short-term changes when data are arriving sequentially. Although some recent methods solve this problem by controlling the updates of latent states, they cannot disentangle the long/short-term states, leading to the inability to effectively adapt to nonstationary. To tackle this challenge, we propose a general framework to disentangle long/short-term states for online time series forecasting. Our idea is inspired by the observations where short-term changes can be led by unknown interventions like abrupt policies in the stock market. Based on this insight, we formalize a data generation process with unknown interventions on short-term states. Under mild assumptions, we further leverage the independence of short-term states led by unknown interventions to establish the identification theory to achieve the disentanglement of long/short-term states. Built on this theory, we develop a long short-term disentanglement model (LSTD) to extract the long/short-term states with long/short-term encoders, respectively. Furthermore, the LSTD model incorporates a smooth constraint to preserve the long-term dependencies and an interrupted dependency constraint to enforce the forgetting of short-term dependencies, together boosting the disentanglement of long/short-term states. Experimental results on several benchmark datasets show that our \textbfLSTD model outperforms existing methods for online time series forecasting, validating its efficacy in real-world applications.

[AI-33] RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts

链接: https://arxiv.org/abs/2502.12589
作者: Yu Zhang,Shujun Peng,Nengwu Wu,Xinhan Lin,Yang Hu,Jie Tang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, substantial advancements have been made in training language models to carry out step-by-step reasoning for solving intricate numerical reasoning tasks. Beyond the methods used to solve these problems, the structure and formulation of the problems themselves also play a crucial role in determining the performance of large language models. We observe that even small changes in the surface form of mathematical problems can have a profound impact on both the answer distribution and solve rate. This highlights the vulnerability of LLMs to surface-level variations, revealing its limited robustness when reasoning through complex problems. In this paper, we propose RM-PoT, a three-stage framework that integrates problem reformulation (RM), code-aided reasoning (PoT), and domain-aware few-shot learning to address these limitations. Our approach first reformulates the input problem into diverse surface forms to reduce structural bias, then retrieves five semantically aligned examples from a pre-constructed domain-specific question bank to provide contextual guidance, and finally generates executable Python code for precise computation.

[AI-34] Enhancing Semi-supervised Learning with Noisy Zero-shot Pseudolabels ICML2025

链接: https://arxiv.org/abs/2502.12584
作者: Jichan Chung,Irene Y. Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review for ICML 2025

点击查看摘要

Abstract:Semi-supervised learning (SSL) leverages limited labeled data alongside abundant unlabeled data to address labeling costs in machine learning. While recent foundation models enable zero-shot inference, attempts to integrate these capabilities into SSL through pseudo-labeling have shown mixed results due to unreliable zero-shot predictions. We present ZMT (Zero-Shot Multi-Task Learning), a framework that jointly optimizes zero-shot pseudo-labels and unsupervised representation learning objectives from contemporary SSL approaches. Our method introduces a multi-task learning-based mechanism that incorporates pseudo-labels while ensuring robustness to varying pseudo-label quality. Experiments across 8 datasets in vision, language, and audio domains demonstrate that ZMT reduces error by up to 56% compared to traditional SSL methods, with particularly compelling results when pseudo-labels are noisy and unreliable. ZMT represents a significant step toward making semi-supervised learning more effective and accessible in resource-constrained environments.

[AI-35] DemonAgent : Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM -based Agent

链接: https://arxiv.org/abs/2502.12575
作者: Pengyu Zhu,Zhenhong Zhou,Yuanhe Zhang,Shilinlu Yan,Kun Wang,Sen Su
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As LLM-based agents become increasingly prevalent, backdoors can be implanted into agents through user queries or environment feedback, raising critical concerns regarding safety vulnerabilities. However, backdoor attacks are typically detectable by safety audits that analyze the reasoning process of agents. To this end, we propose a novel backdoor implantation strategy called \textbfDynamically Encrypted Multi-Backdoor Implantation Attack. Specifically, we introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits. To enhance stealthiness, we further decompose the backdoor into multiple sub-backdoor fragments. Based on these advancements, backdoors are allowed to bypass safety audits significantly. Additionally, we present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks. Experimental results across multiple datasets demonstrate that our method achieves an attack success rate nearing 100% while maintaining a detection rate of 0%, illustrating its effectiveness in evading safety audits. Our findings highlight the limitations of existing safety mechanisms in detecting advanced attacks, underscoring the urgent need for more robust defenses against backdoor threats. Code and data are available at this https URL.

[AI-36] HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

链接: https://arxiv.org/abs/2502.12574
作者: Cheng Luo,Zefan Cai,Hanshi Sun,Jinqi Xiao,Bo Yuan,Wen Xiao,Junjie Hu,Jiawei Zhao,Beidi Chen,Anima Anandkumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline inference. Notably, HEADINFER enables 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without approximation methods.

[AI-37] Exploring the Impact of Personality Traits on LLM Bias and Toxicity

链接: https://arxiv.org/abs/2502.12566
作者: Shuo Wang,Renhao Li,Xi Chen,Yulin Yuan,Derek F. Wong,Min Yang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the different roles that AI is expected to play in human life, imbuing large language models (LLMs) with different personalities has attracted increasing research interests. While the “personification” enhances human experiences of interactivity and adaptability of LLMs, it gives rise to critical concerns about content safety, particularly regarding bias, sentiment and toxicity of LLM generation. This study explores how assigning different personality traits to LLMs affects the toxicity and biases of their outputs. Leveraging the widely accepted HEXACO personality framework developed in social psychology, we design experimentally sound prompts to test three LLMs’ performance on three toxic and bias benchmarks. The findings demonstrate the sensitivity of all three models to HEXACO personality traits and, more importantly, a consistent variation in the biases, negative sentiment and toxicity of their output. In particular, adjusting the levels of several personality traits can effectively reduce bias and toxicity in model performance, similar to humans’ correlations between personality traits and toxic behaviors. The findings highlight the additional need to examine content safety besides the efficiency of training or fine-tuning methods for LLM personification. They also suggest a potential for the adjustment of personalities to be a simple and low-cost method to conduct controlled text generation.

[AI-38] LLM Safety for Children

链接: https://arxiv.org/abs/2502.12552
作者: Prasanjit Rath,Hari Shrawgi,Parag Agrawal,Sandipan Dandapat
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper analyzes the safety of Large Language Models (LLMs) in interactions with children below age of 18 years. Despite the transformative applications of LLMs in various aspects of children’s lives such as education and therapy, there remains a significant gap in understanding and mitigating potential content harms specific to this demographic. The study acknowledges the diverse nature of children often overlooked by standard safety evaluations and proposes a comprehensive approach to evaluating LLM safety specifically for children. We list down potential risks that children may encounter when using LLM powered applications. Additionally we develop Child User Models that reflect the varied personalities and interests of children informed by literature in child care and psychology. These user models aim to bridge the existing gap in child safety literature across various fields. We utilize Child User Models to evaluate the safety of six state of the art LLMs. Our observations reveal significant safety gaps in LLMs particularly in categories harmful to children but not adults

[AI-39] Improving the Stability of GNN Force Field Models by Reducing Feature Correlation

链接: https://arxiv.org/abs/2502.12548
作者: Yujie Zeng,Wenlong He,Ihor Vasyltsov,Jiaxin Wei,Ying Zhang,Lin Chen,Yuehua Dai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, Graph Neural Network based Force Field (GNNFF) models are widely used in Molecular Dynamics (MD) simulation, which is one of the most cost-effective means in semiconductor material research. However, even such models provide high accuracy in energy and force Mean Absolute Error (MAE) over trained (in-distribution) datasets, they often become unstable during long-time MD simulation when used for out-of-distribution datasets. In this paper, we propose a feature correlation based method for GNNFF models to enhance the stability of MD simulation. We reveal the negative relationship between feature correlation and the stability of GNNFF models, and design a loss function with a dynamic loss coefficient scheduler to reduce edge feature correlation that can be applied in general GNNFF training. We also propose an empirical metric to evaluate the stability in MD simulation. Experiments show our method can significantly improve stability for GNNFF models especially in out-of-distribution data with less than 3% computational overhead. For example, we can ensure the stable MD simulation time from 0.03ps to 10ps for Allegro model.

[AI-40] Computing Voting Rules with Improvement Feedback

链接: https://arxiv.org/abs/2502.12542
作者: Evi Micha,Vasilis Varsamis
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aggregating preferences under incomplete or constrained feedback is a fundamental problem in social choice and related domains. While prior work has established strong impossibility results for pairwise comparisons, this paper extends the inquiry to improvement feedback, where voters express incremental adjustments rather than complete preferences. We provide a complete characterization of the positional scoring rules that can be computed given improvement feedback. Interestingly, while plurality is learnable under improvement feedback–unlike with pairwise feedback–strong impossibility results persist for many other positional scoring rules. Furthermore, we show that improvement feedback, unlike pairwise feedback, does not suffice for the computation of any Condorcet-consistent rule. We complement our theoretical findings with experimental results, providing further insights into the practical implications of improvement feedback for preference aggregation.

[AI-41] Finding Optimal Trading History in Reinforcement Learning for Stock Market Trading

链接: https://arxiv.org/abs/2502.12537
作者: Sina Montazeria,Haseebullah Jumakhanb,Amir Mirzaeinia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the optimization of temporal windows in Financial Deep Reinforcement Learning (DRL) models using 2D Convolutional Neural Networks (CNNs). We introduce a novel approach to treating the temporal field as a hyperparameter and examine its impact on model performance across various datasets and feature arrangements. We introduce a new hyperparameter for the CNN policy, proposing that this temporal field can and should be treated as a hyperparameter for these models. We examine the significance of this temporal field by iteratively expanding the window of observations presented to the CNN policy during the deep reinforcement learning process. Our iterative process involves progressively increasing the observation period from two weeks to twelve weeks, allowing us to examine the effects of different temporal windows on the model’s performance. This window expansion is implemented in two settings. In one setting, we rearrange the features in the dataset to group them by company, allowing the model to have a full view of company data in its observation window and CNN kernel. In the second setting, we do not group the features by company, and features are arranged by category. Our study reveals that shorter temporal windows are most effective when no feature rearrangement to group per company is in effect. However, the model will utilize longer temporal windows and yield better performance once we introduce the feature rearrangement. To examine the consistency of our findings, we repeated our experiment on two datasets containing the same thirty companies from the Dow Jones Index but with different features in each dataset and consistently observed the above-mentioned patterns. The result is a trading model significantly outperforming global financial services firms such as the Global X Guru by the established Mirae Asset.

[AI-42] An Algorithm Board in Neural Decoding

链接: https://arxiv.org/abs/2502.12536
作者: Jingyi Feng,Kai Yang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 16 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Understanding the mechanisms of neural encoding and decoding has always been a highly interesting research topic in fields such as neuroscience and cognitive intelligence. In prior studies, some researchers identified a symmetry in neural data decoded by unsupervised methods in motor scenarios and constructed a cognitive learning system based on this pattern (i.e., symmetry). Nevertheless, the distribution state of the data flow that significantly influences neural decoding positions still remains a mystery within the system, which further restricts the enhancement of the system’s interpretability. Based on this, this paper mainly explores changes in the distribution state within the system from the machine learning and mathematical statistics perspectives. In the experiment, we assessed the correctness of this symmetry using various tools and indicators commonly utilized in mathematics and statistics. According to the experimental results, the normal distribution (or Gaussian distribution) plays a crucial role in the decoding of prediction positions within the system. Eventually, an algorithm board similar to the Galton board was built to serve as the mathematical foundation of the discovered symmetry.

[AI-43] CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

链接: https://arxiv.org/abs/2502.12532
作者: Yong Zhao,Kai Xu,Zhengqiu Zhu,Yue Hu,Zhiheng Zheng,Yingfeng Chen,Yatai Ji,Chen Gao,Yong Li,Jincai Huang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings - spanning environment, action, and perception - largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming frontier-based baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at this https URL.

[AI-44] GSCE: A Prompt Framework with Enhanced Reasoning for Reliable LLM -driven Drone Control

链接: https://arxiv.org/abs/2502.12531
作者: Wenhao Wang,Yanyan Li,Long Jiao,Jiawei Yuan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into robotic control, including drones, has the potential to revolutionize autonomous systems. Research studies have demonstrated that LLMs can be leveraged to support robotic operations. However, when facing tasks with complex reasoning, concerns and challenges are raised about the reliability of solutions produced by LLMs. In this paper, we propose a prompt framework with enhanced reasoning to enable reliable LLM-driven control for drones. Our framework consists of novel technical components designed using Guidelines, Skill APIs, Constraints, and Examples, namely GSCE. GSCE is featured by its reliable and constraint-compliant code generation. We performed thorough experiments using GSCE for the control of drones with a wide level of task complexities. Our experiment results demonstrate that GSCE can significantly improve task success rates and completeness compared to baseline approaches, highlighting its potential for reliable LLM-driven autonomous drone systems.

[AI-45] From Abstract to Actionable: Pairwise Shapley Values for Explainable AI

链接: https://arxiv.org/abs/2502.12525
作者: Jiaxin Xu,Hung Chau,Angela Burden
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) is critical for ensuring transparency, accountability, and trust in machine learning systems as black-box models are increasingly deployed within high-stakes domains. Among XAI methods, Shapley values are widely used for their fairness and consistency axioms. However, prevalent Shapley value approximation methods commonly rely on abstract baselines or computationally intensive calculations, which can limit their interpretability and scalability. To address such challenges, we propose Pairwise Shapley Values, a novel framework that grounds feature attributions in explicit, human-relatable comparisons between pairs of data instances proximal in feature space. Our method introduces pairwise reference selection combined with single-value imputation to deliver intuitive, model-agnostic explanations while significantly reducing computational overhead. Here, we demonstrate that Pairwise Shapley Values enhance interpretability across diverse regression and classification scenarios–including real estate pricing, polymer property prediction, and drug discovery datasets. We conclude that the proposed methods enable more transparent AI systems and advance the real-world applicability of XAI.

[AI-46] Inference-Time Computations for LLM Reasoning and Planning : A Benchmark and Insights

链接: https://arxiv.org/abs/2502.12521
作者: Shubham Parashar,Blake Olson,Sambhav Khurana,Eric Li,Hongyi Ling,James Caverlee,Shuiwang Ji
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by exploring intermediate steps during inference. Notably, OpenAI’s o1 model shows promising performance through its novel use of multi-step reasoning and verification. Here, we explore how scaling inference-time techniques can improve reasoning and planning, focusing on understanding the tradeoff between computational cost and performance. To this end, we construct a comprehensive benchmark, known as Sys2Bench, and perform extensive experiments evaluating existing inference-time techniques on eleven diverse tasks across five categories, including arithmetic reasoning, logical reasoning, common sense reasoning, algorithmic reasoning, and planning. Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks.

[AI-47] Myna: Masking-Based Contrastive Learning of Musical Representations ICML2025

链接: https://arxiv.org/abs/2502.12511
作者: Ori Yonay,Tracy Hammond,Tianbao Yang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to ICML 2025

点击查看摘要

Abstract:We present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone and (2) a novel data augmentation strategy, token masking, that masks 90 percent of spectrogram tokens. These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in prior methods (CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations, Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results. Trained on a single GPU, it outperforms MULE (62M) on average and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data. We release our code and models to promote reproducibility and facilitate future research.

[AI-48] Mixture of Attention Yields Accurate Results for Tabular Data

链接: https://arxiv.org/abs/2502.12507
作者: Xuechen Li,Yupeng Li,Jian Liu,Xiaolin Jin,Tian Yang,Xin Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Mixture of Attention (MOA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and compare it with other state-of-the-art transformer-based methods. Extensive experiments demonstrate that our model achieves superior performance among transformer-based methods in both tabular classification and regression tasks.

[AI-49] EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness

链接: https://arxiv.org/abs/2502.12494
作者: Yunxiao Zhang,Guanming Xiong,Haochen Li,Wen Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities as AI agents. However, existing methods for enhancing LLM-agent abilities often lack a focus on data quality, leading to inefficiencies and suboptimal results in both fine-tuning and prompt engineering. To address this issue, we introduce EDGE, a novel approach for identifying informative samples without needing golden answers. We propose the Guideline Effectiveness (GE) metric, which selects challenging samples by measuring the impact of human-provided guidelines in multi-turn interaction tasks. A low GE score indicates that the human expertise required for a sample is missing from the guideline, making the sample more informative. By selecting samples with low GE scores, we can improve the efficiency and outcomes of both prompt engineering and fine-tuning processes for LLMs. Extensive experiments validate the performance of our method. Our method achieves competitive results on the HotpotQA and WebShop and datasets, requiring 75% and 50% less data, respectively, while outperforming existing methods. We also provide a fresh perspective on the data quality of LLM-agent fine-tuning.

[AI-50] Boost Disentangle and Customize: A Robust System2-to-System1 Pipeline for Code Generation

链接: https://arxiv.org/abs/2502.12492
作者: Kounianhua Du,Hanjing Wang,Jianxing Liu,Jizheng Chen,Xinyi Dai,Yasheng Wang,Ruiming Tang,Yong Yu,Jun Wang,Weinan Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in various domains, particularly in system 1 tasks, yet the intricacies of their problem-solving mechanisms in system 2 tasks are not sufficiently explored. Recent research on System2-to-System1 methods surge, exploring the System 2 reasoning knowledge via inference-time computation and compressing the explored knowledge into System 1 process. In this paper, we focus on code generation, which is a representative System 2 task, and identify two primary challenges: (1) the complex hidden reasoning processes and (2) the heterogeneous data distributions that complicate the exploration and training of robust LLM solvers. To tackle these issues, we propose a novel BDC framework that explores insightful System 2 knowledge of LLMs using a MC-Tree-Of-Agents algorithm with mutual \textbfBoosting, \textbfDisentangles the heterogeneous training data for composable LoRA-experts, and obtain \textbfCustomized problem solver for each data instance with an input-aware hypernetwork to weight over the LoRA-experts, offering effectiveness, flexibility, and robustness. This framework leverages multiple LLMs through mutual verification and boosting, integrated into a Monte-Carlo Tree Search process enhanced by reflection-based pruning and refinement. Additionally, we introduce the DisenLora algorithm, which clusters heterogeneous data to fine-tune LLMs into composable Lora experts, enabling the adaptive generation of customized problem solvers through an input-aware hypernetwork. This work lays the groundwork for advancing LLM capabilities in complex reasoning tasks, offering a novel System2-to-System1 solution.

[AI-51] LocalEscaper: A Weakly-supervised Framework with Regional Reconstruction for Scalable Neural TSP Solvers

链接: https://arxiv.org/abs/2502.12484
作者: Junrui Wen,Yifei Li,Bart Selman,Kun He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural solvers have shown significant potential in solving the Traveling Salesman Problem (TSP), yet current approaches face significant challenges. Supervised learning (SL)-based solvers require large amounts of high-quality labeled data, while reinforcement learning (RL)-based solvers, though less dependent on such data, often suffer from inefficiencies. To address these limitations, we propose LocalEscaper, a novel weakly-supervised learning framework for large-scale TSP. LocalEscaper effectively combines the advantages of both SL and RL, enabling effective training on datasets with low-quality labels. To further enhance solution quality, we introduce a regional reconstruction strategy, which mitigates the problem of local optima, a common issue in existing local reconstruction methods. Additionally, we propose a linear-complexity attention mechanism that reduces computational overhead, enabling the efficient solution of large-scale TSPs without sacrificing performance. Experimental results on both synthetic and real-world datasets demonstrate that LocalEscaper outperforms existing neural solvers, achieving state-of-the-art results. Notably, it sets a new benchmark for scalability and efficiency, solving TSP instances with up to 50,000 cities.

[AI-52] MCTS-Judge: Test-Time Scaling in LLM -as-a-Judge for Code Correctness Evaluation

链接: https://arxiv.org/abs/2502.12468
作者: Yutong Wang,Pengliang Ji,Chaoqun Yang,Kaixin Li,Ming Hu,Jiaoyang Li,Guillaume Sartoretti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model’s accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

[AI-53] UniMatch: Universal Matching from Atom to Task for Few-Shot Drug Discovery ICLR2025

链接: https://arxiv.org/abs/2502.12453
作者: Ruifeng Li,Mingqian Li,Wei Liu,Yuhua Zhou,Xiangxin Zhou,Yuan Yao,Qiang Zhang,Hongyang Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: accepted as ICLR 2025 Spotlight

点击查看摘要

Abstract:Drug discovery is crucial for identifying candidate drugs for various this http URL, its low success rate often results in a scarcity of annotations, posing a few-shot learning problem. Existing methods primarily focus on single-scale features, overlooking the hierarchical molecular structures that determine different molecular properties. To address these issues, we introduce Universal Matching Networks (UniMatch), a dual matching framework that integrates explicit hierarchical molecular matching with implicit task-level matching via meta-learning, bridging multi-level molecular representations and task-level generalization. Specifically, our approach explicitly captures structural features across multiple levels, such as atoms, substructures, and molecules, via hierarchical pooling and matching, facilitating precise molecular representation and comparison. Additionally, we employ a meta-learning strategy for implicit task-level matching, allowing the model to capture shared patterns across tasks and quickly adapt to new ones. This unified matching framework ensures effective molecular alignment while leveraging shared meta-knowledge for fast adaptation. Our experimental results demonstrate that UniMatch outperforms state-of-the-art methods on the MoleculeNet and FS-Mol benchmarks, achieving improvements of 2.87% in AUROC and 6.52% in delta AUPRC. UniMatch also shows excellent generalization ability on the Meta-MolNet benchmark.

[AI-54] Investigating and Extending Homans Social Exchange Theory with Large Language Model based Agents

链接: https://arxiv.org/abs/2502.12450
作者: Lei Wang,Zheqing Zhang,Xu Chen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Homans’ Social Exchange Theory (SET) is widely recognized as a basic framework for understanding the formation and emergence of human civilizations and social structures. In social science, this theory is typically studied based on simple simulation experiments or real-world human studies, both of which either lack realism or are too expensive to control. In artificial intelligence, recent advances in large language models (LLMs) have shown promising capabilities in simulating human behaviors. Inspired by these insights, we adopt an interdisciplinary research perspective and propose using LLM-based agents to study Homans’ SET. Specifically, we construct a virtual society composed of three LLM agents and have them engage in a social exchange game to observe their behaviors. Through extensive experiments, we found that Homans’ SET is well validated in our agent society, demonstrating the consistency between the agent and human behaviors. Building on this foundation, we intentionally alter the settings of the agent society to extend the traditional Homans’ SET, making it more comprehensive and detailed. To the best of our knowledge, this paper marks the first step in studying Homans’ SET with LLM-based agents. More importantly, it introduces a novel and feasible research paradigm that bridges the fields of social science and computer science through LLM-based agents. Code is available at this https URL.

[AI-55] Computational Safety for Generative AI: A Signal Processing Perspective

链接: https://arxiv.org/abs/2502.12445
作者: Pin-Yu Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: preprint for an invited paper

点击查看摘要

Abstract:AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of creating realistic and high-quality content through text prompts. Examples of such tools include large language models (LLMs) and text-to-image (T2I) diffusion models. As the performance of various leading GenAI models approaches saturation due to similar training data sources and neural network architecture designs, the development of reliable safety guardrails has become a key differentiator for responsibility and sustainability. This paper presents a formalization of the concept of computational safety, which is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI through the lens of signal processing theory and methods. In particular, we explore two exemplary categories of computational safety challenges in GenAI that can be formulated as hypothesis testing problems. For the safety of model input, we show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. For the safety of model output, we elucidate how statistical signal processing and adversarial learning can be used to detect AI-generated content. Finally, we discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.

[AI-56] SparAMX: Accelerating Compressed LLM s Token Generation on AMX-powered CPUs

链接: https://arxiv.org/abs/2502.12444
作者: Ahmed F. AbouElhamayed,Jordan Dotzel,Yash Akhauri,Chi-Chih Chang,Sameh Gobriel,J. Pablo Muñoz,Vui Seng Chua,Nilesh Jain,Mohamed S. Abdelfattah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a 1.42 \times reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a 1.14 \times speedup over the current systems without compromising accuracy. Code: this https URL

[AI-57] Bridge the Gaps between Machine Unlearning and AI Regulation

链接: https://arxiv.org/abs/2502.12430
作者: Bill Marino,Meghdad Kurmanji,Nicholas D. Lane
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The “right to be forgotten” and the data privacy laws that encode it have motivated machine unlearning since its earliest days. Now, an inbound wave of artificial intelligence regulations - like the European Union’s Artificial Intelligence Act (AIA) - potentially offer important new use cases for machine unlearning. However, this position paper argues, this opportunity will only be realized if researchers, aided by policymakers, proactively bridge the (sometimes sizable) gaps between machine unlearning’s state of the art and its potential applications to AI regulation. To demonstrate this point, we use the AIA as an example. Specifically, we deliver a “state of the union” as regards machine unlearning’s current potential for aiding compliance with the AIA. This starts with a precise cataloging of the potential applications of machine unlearning to AIA compliance. For each, we flag any legal ambiguities clouding the potential application and, moreover, flag the technical gaps that exist between the potential application and the state of the art of machine unlearning. Finally, we end with a call to action: for both machine learning researchers and policymakers, to, respectively, solve the open technical and legal questions that will unlock machine unlearning’s potential to assist compliance with the AIA - and other AI regulation like it.

[AI-58] Solving the Cold Start Problem on Ones Own as an End User via Preference Transfer

链接: https://arxiv.org/abs/2502.12398
作者: Ryoma Sato
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:We propose a new approach that enables end users to directly solve the cold start problem by themselves. The cold start problem is a common issue in recommender systems, and many methods have been proposed to address the problem on the service provider’s side. However, when the service provider does not take action, users are left with poor recommendations and no means to improve their experience. We propose an algorithm, Pretender, that allows end users to proactively solve the cold start problem on their own. Pretender does not require any special support from the service provider and can be deployed independently by users. We formulate the problem as minimizing the distance between the source and target distributions and optimize item selection from the target service accordingly. Furthermore, we establish theoretical guarantees for Pretender based on a discrete quadrature problem. We conduct experiments on real-world datasets to demonstrate the effectiveness of Pretender.

[AI-59] Could AI Leapfrog the Web? Evidence from Teachers in Sierra Leone

链接: https://arxiv.org/abs/2502.12397
作者: Daniel Björkegren,Jun Ho Choi,Divya Budihal,Dominic Sobhani,Oliver Garrod,Paul Atherton
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Access to digital information is a driver of economic development. But although 85% of sub-Saharan Africa’s population is covered by mobile broadband signal, only 37% use the internet, and those who do seldom use the web. We investigate whether AI can bridge this gap by analyzing how 469 teachers use an AI chatbot in Sierra Leone. The chatbot, accessible via a common messaging app, is compared against traditional web search. Teachers use AI more frequently than web search for teaching assistance. Data cost is the most frequently cited reason for low internet usage across Africa. The average web search result consumes 3,107 times more data than an AI response, making AI 87% less expensive than web search. Additionally, only 2% of results for corresponding web searches contain content from Sierra Leone. In blinded evaluations, an independent sample of teachers rate AI responses as more relevant, helpful, and correct than web search results. These findings suggest that AI-driven solutions can cost-effectively bridge information gaps in low-connectivity regions.

[AI-60] Hybrid Machine Learning Models for Intrusion Detection in IoT: Leverag ing a Real-World IoT Dataset

链接: https://arxiv.org/abs/2502.12382
作者: Md Ahnaf Akif,Ismail Butun,Andre Williams,Imadeldin Mahgoub
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 8 figures, 2 tables, journal submission

点击查看摘要

Abstract:The rapid growth of the Internet of Things (IoT) has revolutionized industries, enabling unprecedented connectivity and functionality. However, this expansion also increases vulnerabilities, exposing IoT networks to increasingly sophisticated cyberattacks. Intrusion Detection Systems (IDS) are crucial for mitigating these threats, and recent advancements in Machine Learning (ML) offer promising avenues for improvement. This research explores a hybrid approach, combining several standalone ML models such as Random Forest (RF), XGBoost, K-Nearest Neighbors (KNN), and AdaBoost, in a voting-based hybrid classifier for effective IoT intrusion detection. This ensemble method leverages the strengths of individual algorithms to enhance accuracy and address challenges related to data complexity and scalability. Using the widely-cited IoT-23 dataset, a prominent benchmark in IoT cybersecurity research, we evaluate our hybrid classifiers for both binary and multi-class intrusion detection problems, ensuring a fair comparison with existing literature. Results demonstrate that our proposed hybrid models, designed for robustness and scalability, outperform standalone approaches in IoT environments. This work contributes to the development of advanced, intelligent IDS frameworks capable of addressing evolving cyber threats.

[AI-61] Soft Robotics for Search and Rescue: Advancements Challenges and Future Directions

链接: https://arxiv.org/abs/2502.12373
作者: Abhishek Sebastian
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Soft robotics has emerged as a transformative technology in Search and Rescue (SAR) operations, addressing challenges in navigating complex, hazardous environments that often limit traditional rigid robots. This paper critically examines advancements in soft robotic technologies tailored for SAR applications, focusing on their unique capabilities in adaptability, safety, and efficiency. By leveraging bio-inspired designs, flexible materials, and advanced locomotion mechanisms, such as crawling, rolling, and shape morphing, soft robots demonstrate exceptional potential in disaster scenarios. However, significant barriers persist, including material durability, power inefficiency, sensor integration, and control complexity. This comprehensive review highlights the current state of soft robotics in SAR, discusses simulation methodologies and hardware validations, and introduces performance metrics essential for their evaluation. By bridging the gap between theoretical advancements and practical deployment, this study underscores the potential of soft robotic systems to revolutionize SAR missions and advocates for continued interdisciplinary innovation to overcome existing limitations.

[AI-62] IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation

链接: https://arxiv.org/abs/2502.12371
作者: Krishan Rana,Robert Lee,David Pershouse,Niko Suenderhauf
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Videos and code are available at this https URL

点击查看摘要

Abstract:Recent advances in imitation learning, particularly using generative modelling techniques like diffusion, have enabled policies to capture complex multi-modal action distributions. However, these methods often require large datasets and multiple inference steps for action generation, posing challenges in robotics where the cost for data collection is high and computation resources are limited. To address this, we introduce IMLE Policy, a novel behaviour cloning approach based on Implicit Maximum Likelihood Estimation (IMLE). IMLE Policy excels in low-data regimes, effectively learning from minimal demonstrations and requiring 38% less data on average to match the performance of baseline methods in learning complex multi-modal behaviours. Its simple generator-based architecture enables single-step action generation, improving inference speed by 97.3% compared to Diffusion Policy, while outperforming single-step Flow Matching. We validate our approach across diverse manipulation tasks in simulated and real-world environments, showcasing its ability to capture complex behaviours under data constraints. Videos and code are provided on our project page: this https URL.

[AI-63] Human-centered explanation does not fit all: The interplay of sociotechnical cognitive and individual factors in the effect AI explanations in algorithmic decision-making

链接: https://arxiv.org/abs/2502.12354
作者: Yongsu Ahn,Yu-Run Lin,Malihe Alikhani,Eunjeong Cheon
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Recent XAI studies have investigated what constitutes a \textitgood explanation in AI-assisted decision-making. Despite the widely accepted human-friendly properties of explanations, such as contrastive and selective, existing studies have yielded inconsistent findings. To address these gaps, our study focuses on the cognitive dimensions of explanation evaluation, by evaluating six explanations with different contrastive strategies and information selectivity and scrutinizing factors behind their valuation process. Our analysis results find that contrastive explanations are not the most preferable or understandable in general; Rather, different contrastive and selective explanations were appreciated to a different extent based on who they are, when, how, and what to explain – with different level of cognitive load and engagement and sociotechnical contexts. Given these findings, we call for a nuanced view of explanation strategies, with implications for designing AI interfaces to accommodate individual and contextual differences in AI-assisted decision-making.

[AI-64] owards Mechanistic Interpretability of Graph Transformers via Attention Graphs

链接: https://arxiv.org/abs/2502.12352
作者: Batu El,Deepro Choudhury,Pietro Liò,Chaitanya K. Joshi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: this https URL

[AI-65] QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

链接: https://arxiv.org/abs/2502.12346
作者: Jiajun Zhou,Yifan Yang,Kai Zhen,Ziyue Liu,Yequan Zhao,Ershad Banijamali,Athanasios Mouchtaris,Ngai Wong,Zheng Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which are error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method can avoid the error-prone low-precision straight-through estimator, and utilizes optimized stochastic rounding to mitigate the increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in \rm FP8 and superior accuracy in \rm INT8 and \rm INT4 training. Experiments demonstrate that low-bit training QuZO achieves performance comparable to MeZO optimization on GLUE, Multi-Choice, and Generation tasks, while reducing memory cost by 2.94 \times in LLaMA2-7B fine-tuning compared to quantized first-order methods.

[AI-66] A Novel Unified Parametric Assumption for Nonconvex Optimization

链接: https://arxiv.org/abs/2502.12329
作者: Artem Riabinin,Ahmed Khaled,Peter Richtárik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Nonconvex optimization is central to modern machine learning, but the general framework of nonconvex optimization yields weak convergence guarantees that are too pessimistic compared to practice. On the other hand, while convexity enables efficient optimization, it is of limited applicability to many practical problems. To bridge this gap and better understand the practical success of optimization algorithms in nonconvex settings, we introduce a novel unified parametric assumption. Our assumption is general enough to encompass a broad class of nonconvex functions while also being specific enough to enable the derivation of a unified convergence theorem for gradient-based methods. Notably, by tuning the parameters of our assumption, we demonstrate its versatility in recovering several existing function classes as special cases and in identifying functions amenable to efficient optimization. We derive our convergence theorem for both deterministic and stochastic optimization, and conduct experiments to verify that our assumption can hold practically over optimization trajectories.

[AI-67] Connecting Large Language Model Agent to High Performance Computing Resource

链接: https://arxiv.org/abs/2502.12280
作者: Heng Ma,Alexander Brace,Carlo Siebenschuh,Greg Pauloski,Ian Foster,Arvind Ramanathan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:The Large Language Model agent workflow enables the LLM to invoke tool functions to increase the performance on specific scientific domain questions. To tackle large scale of scientific research, it requires access to computing resource and parallel computing setup. In this work, we implemented Parsl to the LangChain/LangGraph tool call setup, to bridge the gap between the LLM agent to the computing resource. Two tool call implementations were set up and tested on both local workstation and HPC environment on Polaris/ALCF. The first implementation with Parsl-enabled LangChain tool node queues the tool functions concurrently to the Parsl workers for parallel execution. The second configuration is implemented by converting the tool functions into Parsl ensemble functions, and is more suitable for large task on super computer environment. The LLM agent workflow was prompted to run molecular dynamics simulations, with different protein structure and simulation conditions. These results showed the LLM agent tools were managed and executed concurrently by Parsl on the available computing resource.

[AI-68] owards Practical First-Order Model Counting

链接: https://arxiv.org/abs/2502.12278
作者: Ananth K. Kidambi,Guramrit Singh,Paulius Dilkas,Kuldeep S. Meel
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 18 pages, 1 figure

点击查看摘要

Abstract:First-order model counting (FOMC) is the problem of counting the number of models of a sentence in first-order logic. Since lifted inference techniques rely on reductions to variants of FOMC, the design of scalable methods for FOMC has attracted attention from both theoreticians and practitioners over the past decade. Recently, a new approach based on first-order knowledge compilation was proposed. This approach, called Crane, instead of simply providing the final count, generates definitions of (possibly recursive) functions that can be evaluated with different arguments to compute the model count for any domain size. However, this approach is not fully automated, as it requires manual evaluation of the constructed functions. The primary contribution of this work is a fully automated compilation algorithm, called Gantry, which transforms the function definitions into C++ code equipped with arbitrary-precision arithmetic. These additions allow the new FOMC algorithm to scale to domain sizes over 500,000 times larger than the current state of the art, as demonstrated through experimental results.

[AI-69] NeuroStrata: Harnessing Neurosymbolic Paradigms for Improved Design Testability and Verifiability of Autonomous CPS

链接: https://arxiv.org/abs/2502.12267
作者: Xi Zheng,Ziyang Li,Ivan Ruchkin,Ruzica Piskac,Miroslav Pajic
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous cyber-physical systems (CPSs) leverage AI for perception, planning, and control but face trust and safety certification challenges due to inherent uncertainties. The neurosymbolic paradigm replaces stochastic layers with interpretable symbolic AI, enabling determinism. While promising, challenges like multisensor fusion, adaptability, and verification remain. This paper introduces NeuroStrata, a neurosymbolic framework to enhance the testing and verification of autonomous CPS. We outline its key components, present early results, and detail future plans.

[AI-70] Identifying the Best Transition Law

链接: https://arxiv.org/abs/2502.12227
作者: Mehrasa Ahmadipour,élise Crepon,Aurélien Garivier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Motivated by recursive learning in Markov Decision Processes, this paper studies best-arm identification in bandit problems where each arm’s reward is drawn from a multinomial distribution with a known support. We compare the performance reached by strategies including notably LUCB without and with use of this knowledge. In the first case, we use classical non-parametric approaches for the confidence intervals. In the second case, where a probability distribution is to be estimated, we first use classical deviation bounds (Hoeffding and Bernstein) on each dimension independently, and then the Empirical Likelihood method (EL-LUCB) on the joint probability vector. The effectiveness of these methods is demonstrated through simulations on scenarios with varying levels of structural complexity.

[AI-71] On Creating a Causally Grounded Usable Rating Method for Assessing the Robustness of Foundation Models Supporting Time Series

链接: https://arxiv.org/abs/2502.12226
作者: Kausik Lakkaraju,Rachneet Kaur,Parisa Zehtabi,Sunandita Patra,Siva Likitha Valluru,Zhen Zeng,Biplav Srivastava,Marco Valtorta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation Models (FMs) have improved time series forecasting in various sectors, such as finance, but their vulnerability to input disturbances can hinder their adoption by stakeholders, such as investors and analysts. To address this, we propose a causally grounded rating framework to study the robustness of Foundational Models for Time Series (FMTS) with respect to input perturbations. We evaluate our approach to the stock price prediction problem, a well-studied problem with easily accessible public data, evaluating six state-of-the-art (some multi-modal) FMTS across six prominent stocks spanning three industries. The ratings proposed by our framework effectively assess the robustness of FMTS and also offer actionable insights for model selection and deployment. Within the scope of our study, we find that (1) multi-modal FMTS exhibit better robustness and accuracy compared to their uni-modal versions and, (2) FMTS pre-trained on time series forecasting task exhibit better robustness and forecasting accuracy compared to general-purpose FMTS pre-trained across diverse settings. Further, to validate our framework’s usability, we conduct a user study showcasing FMTS prediction errors along with our computed ratings. The study confirmed that our ratings reduced the difficulty for users in comparing the robustness of different systems.

[AI-72] Subjective Logic Encodings

链接: https://arxiv.org/abs/2502.12225
作者: Jake Vasilakes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: We make our code publicly available at this https URL

点击查看摘要

Abstract:Many existing approaches for learning from labeled data assume the existence of gold-standard labels. According to these approaches, inter-annotator disagreement is seen as noise to be removed, either through refinement of annotation guidelines, label adjudication, or label filtering. However, annotator disagreement can rarely be totally eradicated, especially on more subjective tasks such as sentiment analysis or hate speech detection where disagreement is natural. Therefore, a new approach to learning from labeled data, called data perspectivism, seeks to leverage inter-annotator disagreement to learn models that stay true to the inherent uncertainty of the task by treating annotations as opinions of the annotators, rather than gold-standard facts. Despite this conceptual grounding, existing methods under data perspectivism are limited to using disagreement as the sole source of annotation uncertainty. To expand the possibilities of data perspectivism, we introduce Subjective Logic Encodings (SLEs), a flexible framework for constructing classification targets that explicitly encodes annotations as opinions of the annotators. Based on Subjective Logic Theory, SLEs encode labels as Dirichlet distributions and provide principled methods for encoding and aggregating various types of annotation uncertainty – annotator confidence, reliability, and disagreement – into the targets. We show that SLEs are a generalization of other types of label encodings as well as how to estimate models to predict SLEs using a distribution matching objective.

[AI-73] Accurate Expert Predictions in MoE Inference via Cross-Layer Gate

链接: https://arxiv.org/abs/2502.12224
作者: Zhiyuan Fang,Zicong Hong,Yuegui Huang,Yufeng Lyu,Wuhui Chen,Yue Yu,Fan Yu,Zibin Zheng
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional GPU overhead. Furthermore, Fate employs a shallow-favoring expert caching strategy that increases the expert hit rate to 99%. Additionally, Fate integrates tailored quantization strategies for cache optimization and IO efficiency. Experimental results show that, compared to Load on Demand and Expert Activation Path-based method, Fate achieves up to 4.5x and 1.9x speedups in prefill speed and up to 4.1x and 2.2x speedups in decoding speed, respectively, while maintaining inference quality. Moreover, Fate’s performance improvements are scalable across different memory budgets.

[AI-74] IMPACTX: Improving Model Performance by Appropriately predicting CorrecT eXplanations

链接: https://arxiv.org/abs/2502.12222
作者: Andrea Apicella,Salvatore Giugliano,Francesco Isgrò,Roberto Prevete
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: in peer review

点击查看摘要

Abstract:The eXplainable Artificial Intelligence (XAI) research predominantly concentrates to provide explainations about AI model decisions, especially Deep Learning (DL) models. However, there is a growing interest in using XAI techniques to automatically improve the performance of the AI systems themselves. This paper proposes IMPACTX, a novel approach that leverages XAI as a fully automated attention mechanism, without requiring external knowledge or human feedback. Experimental results show that IMPACTX has improved performance respect to the standalone ML model by integrating an attention mechanism based an XAI method outputs during the model training. Furthermore, IMPACTX directly provides proper feature attribution maps for the model’s decisions, without relying on external XAI methods during the inference process. Our proposal is evaluated using three widely recognized DL models (EfficientNet-B2, MobileNet, and LeNet-5) along with three standard image datasets: CIFAR-10, CIFAR-100, and STL-10. The results show that IMPACTX consistently improves the performance of all the inspected DL models across all evaluated datasets, and it directly provides appropriate explanations for its responses. Comments: in peer review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.12222 [cs.LG] (or arXiv:2502.12222v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.12222 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-75] Revisiting the Test-Time Scaling of o1 -like Models: Do they Truly Possess Test-Time Scaling Capabilities?

链接: https://arxiv.org/abs/2502.12215
作者: Zhiyuan Zeng,Qinyuan Cheng,Zhangyue Yin,Yunhua Zhou,Xipeng Qiu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI’s o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models’ self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models’ test-time scalability compared to conventional majority voting approaches.

[AI-76] Spatiotemporal-aware Trend-Seasonality Decomposition Network for Traffic Flow Forecasting

链接: https://arxiv.org/abs/2502.12213
作者: Lingxiao Cao,Bin Wang,Guiyuan Jiang,Yanwei Yu,Junyu Dong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic prediction is critical for optimizing travel scheduling and enhancing public safety, yet the complex spatial and temporal dynamics within traffic data present significant challenges for accurate forecasting. In this paper, we introduce a novel model, the Spatiotemporal-aware Trend-Seasonality Decomposition Network (STDN). This model begins by constructing a dynamic graph structure to represent traffic flow and incorporates novel spatio-temporal embeddings to jointly capture global traffic dynamics. The representations learned are further refined by a specially designed trend-seasonality decomposition module, which disentangles the trend-cyclical component and seasonal component for each traffic node at different times within the graph. These components are subsequently processed through an encoder-decoder network to generate the final predictions. Extensive experiments conducted on real-world traffic datasets demonstrate that STDN achieves superior performance with remarkable computation cost. Furthermore, we have released a new traffic dataset named JiNan, which features unique inner-city dynamics, thereby enriching the scenario comprehensiveness in traffic prediction evaluation.

[AI-77] PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN

链接: https://arxiv.org/abs/2502.12207
作者: Jiayu Zhang,Zhiyu Zhu,Xinyi Wang,Silin Liao,Zhibo Jin,Flora D. Salim,Huaming Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original this http URL, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: this https URL

[AI-78] An Interpretable Automated Mechanism Design Framework with Large Language Models

链接: https://arxiv.org/abs/2502.12203
作者: Jiayuan Liu,Mingyu Guo,Vincent Conitzer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Mechanism design has long been a cornerstone of economic theory, with traditional approaches relying on mathematical derivations. Recently, automated approaches, including differentiable economics with neural networks, have emerged for designing payments and allocations. While both analytical and automated methods have advanced the field, they each face significant weaknesses: mathematical derivations are not automated and often struggle to scale to complex problems, while automated and especially neural-network-based approaches suffer from limited interpretability. To address these challenges, we introduce a novel framework that reformulates mechanism design as a code generation task. Using large language models (LLMs), we generate heuristic mechanisms described in code and evolve them to optimize over some evaluation metrics while ensuring key design criteria (e.g., strategy-proofness) through a problem-specific fixing process. This fixing process ensures any mechanism violating the design criteria is adjusted to satisfy them, albeit with some trade-offs in performance metrics. These trade-offs are factored in during the LLM-based evolution process. The code generation capabilities of LLMs enable the discovery of novel and interpretable solutions, bridging the symbolic logic of mechanism design and the generative power of modern AI. Through rigorous experimentation, we demonstrate that LLM-generated mechanisms achieve competitive performance while offering greater interpretability compared to previous approaches. Notably, our framework can rediscover existing manually designed mechanisms and provide insights into neural-network based solutions through Programming-by-Example. These results highlight the potential of LLMs to not only automate but also enhance the transparency and scalability of mechanism design, ensuring safe deployment of the mechanisms in society.

[AI-79] Maximize Your Diffusion: A Study into Reward Maximization and Alignment for Diffusion-based Control

链接: https://arxiv.org/abs/2502.12198
作者: Dom Huh,Prasant Mohapatra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based planning, learning, and control methods present a promising branch of powerful and expressive decision-making solutions. Given the growing interest, such methods have undergone numerous refinements over the past years. However, despite these advancements, existing methods are limited in their investigations regarding general methods for reward maximization within the decision-making process. In this work, we study extensions of fine-tuning approaches for control applications. Specifically, we explore extensions and various design choices for four fine-tuning approaches: reward alignment through reinforcement learning, direct preference optimization, supervised fine-tuning, and cascading diffusion. We optimize their usage to merge these independent efforts into one unified paradigm. We show the utility of such propositions in offline RL settings and demonstrate empirical improvements over a rich array of control tasks.

[AI-80] Boosting Generalization in Diffusion-Based Neural Combinatorial Solver via Energy-guided Sampling

链接: https://arxiv.org/abs/2502.12188
作者: Haoyu Lei,Kaiwen Zhou,Yinchuan Li,Zhitang Chen,Farzan Farnia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion-based Neural Combinatorial Optimization (NCO) has demonstrated effectiveness in solving NP-complete (NPC) problems by learning discrete diffusion models for solution generation, eliminating hand-crafted domain knowledge. Despite their success, existing NCO methods face significant challenges in both cross-scale and cross-problem generalization, and high training costs compared to traditional solvers. While recent studies have introduced training-free guidance approaches that leverage pre-defined guidance functions for zero-shot conditional generation, such methodologies have not been extensively explored in combinatorial optimization. To bridge this gap, we propose a general energy-guided sampling framework during inference time that enhances both the cross-scale and cross-problem generalization capabilities of diffusion-based NCO solvers without requiring additional training. We provide theoretical analysis that helps understanding the cross-problem transfer capability. Our experimental results demonstrate that a diffusion solver, trained exclusively on the Traveling Salesman Problem (TSP), can achieve competitive zero-shot solution generation on TSP variants, such as Prize Collecting TSP (PCTSP) and the Orienteering Problem (OP), through energy-guided sampling across different problem scales.

[AI-81] E2CB2former: Effecitve and Explainable Transformer for CB2 Receptor Ligand Activity Prediction

链接: https://arxiv.org/abs/2502.12186
作者: Jiacheng Xie,Yingrui Ji,Linghuan Zeng,Xi Xiao,Gaofei Chen,Lijing Zhu,Joyanta Jyoti Mondal,Jiansheng Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurate prediction of CB2 receptor ligand activity is pivotal for advancing drug discovery targeting this receptor, which is implicated in inflammation, pain management, and neurodegenerative conditions. Although conventional machine learning and deep learning techniques have shown promise, their limited interpretability remains a significant barrier to rational drug design. In this work, we introduce CB2former, a framework that combines a Graph Convolutional Network with a Transformer architecture to predict CB2 receptor ligand activity. By leveraging the Transformer’s self attention mechanism alongside the GCN’s structural learning capability, CB2former not only enhances predictive performance but also offers insights into the molecular features underlying receptor activity. We benchmark CB2former against diverse baseline models including Random Forest, Support Vector Machine, K Nearest Neighbors, Gradient Boosting, Extreme Gradient Boosting, Multilayer Perceptron, Convolutional Neural Network, and Recurrent Neural Network and demonstrate its superior performance with an R squared of 0.685, an RMSE of 0.675, and an AUC of 0.940. Moreover, attention weight analysis reveals key molecular substructures influencing CB2 receptor activity, underscoring the model’s potential as an interpretable AI tool for drug discovery. This ability to pinpoint critical molecular motifs can streamline virtual screening, guide lead optimization, and expedite therapeutic development. Overall, our results showcase the transformative potential of advanced AI approaches exemplified by CB2former in delivering both accurate predictions and actionable molecular insights, thus fostering interdisciplinary collaboration and innovation in drug discovery.

[AI-82] n Challenging Problems in Federated Foundation Models

链接: https://arxiv.org/abs/2502.12176
作者: Tao Fan,Hanlin Gu,Xuemei Cao,Chee Seng Chan,Qian Chen,Yiqiang Chen,Yihui Feng,Yang Gu,Jiaxiang Geng,Bing Luo,Shuoling Liu,Win Kent Ong,Chao Ren,Jiaqi Shao,Chuan Sun,Xiaoli Tang,Hong Xi Tae,Yongxin Tong,Shuyue Wei,Fan Wu,Wei Xi,Mingcong Xu,He Yang,Xin Yang,Jiangpeng Yan,Hao Yu,Han Yu,Teng Zhang,Yifei Zhang,Xiaojin Zhang,Zhenzhe Zheng,Lixin Fan,Qiang Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Foundation Models (FedFMs) represent a distributed learning paradigm that fuses general competences of foundation models as well as privacy-preserving capabilities of federated learning. This combination allows the large foundation models and the small local domain models at the remote clients to learn from each other in a teacher-student learning setting. This paper provides a comprehensive summary of the ten challenging problems inherent in FedFMs, encompassing foundational theory, utilization of private data, continual learning, unlearning, Non-IID and graph data, bidirectional knowledge transfer, incentive mechanism design, game mechanism design, model watermarking, and efficiency. The ten challenging problems manifest in five pivotal aspects: Foundational Theory," which aims to establish a coherent and unifying theoretical framework for FedFMs. Data," addressing the difficulties in leveraging domain-specific knowledge from private data while maintaining privacy; Heterogeneity," examining variations in data, model, and computational resources across clients; Security and Privacy," focusing on defenses against malicious attacks and model theft; and ``Efficiency," highlighting the need for improvements in training, communication, and parameter efficiency. For each problem, we offer a clear mathematical definition on the objective function, analyze existing methods, and discuss the key challenges and potential solutions. This in-depth exploration aims to advance the theoretical foundations of FedFMs, guide practical implementations, and inspire future research to overcome these obstacles, thereby enabling the robust, efficient, and privacy-preserving FedFMs in various real-world applications.

[AI-83] Spatiotemporal Graph Neural Networks in short term load forecasting: Does adding Graph Structure in Consumption Data Improve Predictions?

链接: https://arxiv.org/abs/2502.12175
作者: Quoc Viet Nguyen,Joaquin Delgado Fernandez,Sergio Potenciano Menci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, conference

点击查看摘要

Abstract:Short term Load Forecasting (STLF) plays an important role in traditional and modern power systems. Most STLF models predominantly exploit temporal dependencies from historical data to predict future consumption. Nowadays, with the widespread deployment of smart meters, their data can contain spatiotemporal dependencies. In particular, their consumption data is not only correlated to historical values but also to the values of neighboring smart meters. This new characteristic motivates researchers to explore and experiment with new models that can effectively integrate spatiotemporal interrelations to increase forecasting performance. Spatiotemporal Graph Neural Networks (STGNNs) can leverage such interrelations by modeling relationships between smart meters as a graph and using these relationships as additional features to predict future energy consumption. While extensively studied in other spatiotemporal forecasting domains such as traffic, environments, or renewable energy generation, their application to load forecasting remains relatively unexplored, particularly in scenarios where the graph structure is not inherently available. This paper overviews the current literature focusing on STGNNs with application in STLF. Additionally, from a technical perspective, it also benchmarks selected STGNN models for STLF at the residential and aggregate levels. The results indicate that incorporating graph features can improve forecasting accuracy at the residential level; however, this effect is not reflected at the aggregate level

[AI-84] nanoML for Human Activity Recognition

链接: https://arxiv.org/abs/2502.12173
作者: Alan T. L. Bacellar,Mugdha P. Jadhao,Shashank Nag,Priscila M. V. Lima,Felipe M. G. Franca,Lizy K. John
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

点击查看摘要

Abstract:Human Activity Recognition (HAR) is critical for applications in healthcare, fitness, and IoT, but deploying accurate models on resource-constrained devices remains challenging due to high energy and memory demands. This paper demonstrates the application of Differentiable Weightless Neural Networks (DWNs) to HAR, achieving competitive accuracies of 96.34% and 96.67% while consuming only 56nJ and 104nJ per sample, with an inference time of just 5ns per sample. The DWNs were implemented and evaluated on an FPGA, showcasing their practical feasibility for energy-efficient hardware deployment. DWNs achieve up to 926,000x energy savings and 260x memory reduction compared to state-of-the-art deep learning methods. These results position DWNs as a nano-machine learning nanoML model for HAR, setting a new benchmark in energy efficiency and compactness for edge and wearable devices, paving the way for ultra-efficient edge AI.

[AI-85] astepepAI An artificial intelligence platform for taste peptide de novo design

链接: https://arxiv.org/abs/2502.12167
作者: Jianda Yue,Tingting Li,Jian Ouyang,Jiawei Xu,Hua Tan,Zihui Chen,Changsheng Han,Huanyu Li,Songping Liang,Zhonghua Liu,Zhonghua Liu,Ying Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 40 pages, 6 figures, research article

点击查看摘要

Abstract:Taste peptides have emerged as promising natural flavoring agents attributed to their unique organoleptic properties, high safety profile, and potential health benefits. However, the de novo identification of taste peptides derived from animal, plant, or microbial sources remains a time-consuming and resource-intensive process, significantly impeding their widespread application in the food industry. Here, we present TastePepAI, a comprehensive artificial intelligence framework for customized taste peptide design and safety assessment. As the key element of this framework, a loss-supervised adaptive variational autoencoder (LA-VAE) is implemented to efficiently optimizes the latent representation of sequences during training and facilitates the generation of target peptides with desired taste profiles. Notably, our model incorporates a novel taste-avoidance mechanism, allowing for selective flavor exclusion. Subsequently, our in-house developed toxicity prediction algorithm (SpepToxPred) is integrated in the framework to undergo rigorous safety evaluation of generated peptides. Using this integrated platform, we successfully identified 73 peptides exhibiting sweet, salty, and umami, significantly expanding the current repertoire of taste peptides. This work demonstrates the potential of TastePepAI in accelerating taste peptide discovery for food applications and provides a versatile framework adaptable to broader peptide engineering challenges.

[AI-86] Calibration Error Estimation Using Fuzzy Binning

链接: https://arxiv.org/abs/2305.00543
作者: Geetanjali Bihani,Julia Taylor Rayz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 4 figures, Accepted at NAFIPS 2023

点击查看摘要

Abstract:Neural network-based decisions tend to be overconfident, where their raw outcome probabilities do not align with the true decision probabilities. Calibration of neural networks is an essential step towards more reliable deep learning frameworks. Prior metrics of calibration error primarily utilize crisp bin membership-based measures. This exacerbates skew in model probabilities and portrays an incomplete picture of calibration error. In this work, we propose a Fuzzy Calibration Error metric (FCE) that utilizes a fuzzy binning approach to calculate calibration error. This approach alleviates the impact of probability skew and provides a tighter estimate while measuring calibration error. We compare our metric with ECE across different data populations and class memberships. Our results show that FCE offers better calibration error estimation, especially in multi-class settings, alleviating the effects of skew in model confidence scores on calibration error estimation. We make our code and supplementary materials available at: this https URL

[AI-87] Performance Evaluation of Large Language Models in Statistical Programming

链接: https://arxiv.org/abs/2502.13117
作者: Xinyi Song,Kexin Xie,Lina Lee,Ruizhe Chen,Jared M. Clark,Hao He,Haoran He,Jie Min,Xinlei Zhang,Simin Zheng,Zhiyang Zhang,Xinwei Deng,Yili Hong
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
*备注: 27 pages, 8 figures

点击查看摘要

Abstract:The programming capabilities of large language models (LLMs) have revolutionized automatic code generation and opened new avenues for automatic statistical analysis. However, the validity and quality of these generated codes need to be systematically evaluated before they can be widely adopted. Despite their growing prominence, a comprehensive evaluation of statistical code generated by LLMs remains scarce in the literature. In this paper, we assess the performance of LLMs, including two versions of ChatGPT and one version of Llama, in the domain of SAS programming for statistical analysis. Our study utilizes a set of statistical analysis tasks encompassing diverse statistical topics and datasets. Each task includes a problem description, dataset information, and human-verified SAS code. We conduct a comprehensive assessment of the quality of SAS code generated by LLMs through human expert evaluation based on correctness, effectiveness, readability, executability, and the accuracy of output results. The analysis of rating scores reveals that while LLMs demonstrate usefulness in generating syntactically correct code, they struggle with tasks requiring deep domain understanding and may produce redundant or incorrect results. This study offers valuable insights into the capabilities and limitations of LLMs in statistical programming, providing guidance for future advancements in AI-assisted coding systems for statistical analysis.

[AI-88] Likelihood-Ratio Regularized Quantile Regression: Adapting Conformal Prediction to High-Dimensional Covariate Shifts

链接: https://arxiv.org/abs/2502.13030
作者: Sunay Joshi,Shayan Kiyani,George Pappas,Edgar Dobriban,Hamed Hassani
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of conformal prediction under covariate shift. Given labeled data from a source domain and unlabeled data from a covariate shifted target domain, we seek to construct prediction sets with valid marginal coverage in the target domain. Most existing methods require estimating the unknown likelihood ratio function, which can be prohibitive for high-dimensional data such as images. To address this challenge, we introduce the likelihood ratio regularized quantile regression (LR-QR) algorithm, which combines the pinball loss with a novel choice of regularization in order to construct a threshold function without directly estimating the unknown likelihood ratio. We show that the LR-QR method has coverage at the desired level in the target domain, up to a small error term that we can control. Our proofs draw on a novel analysis of coverage via stability bounds from learning theory. Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, and an image classification task from the WILDS repository.

[AI-89] me-series attribution maps with regularized contrastive learning AISTATS2025

链接: https://arxiv.org/abs/2502.12977
作者: Steffen Schneider,Rodrigo González Laiz,Anastasiia Filippova,Markus Frey,Mackenzie Weygandt Mathis
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted at The 28th International Conference on Artificial Intelligence and Statistics (AISTATS 2025). Code is available at this https URL

点击查看摘要

Abstract:Gradient-based attribution methods aim to explain decisions of deep learning models but so far lack identifiability guarantees. Here, we propose a method to generate attribution maps with identifiability guarantees by developing a regularized contrastive learning algorithm trained on time-series data plus a new attribution method called Inverted Neuron Gradient (collectively named xCEBRA). We show theoretically that xCEBRA has favorable properties for identifying the Jacobian matrix of the data generating process. Empirically, we demonstrate robust approximation of zero vs. non-zero entries in the ground-truth attribution map on synthetic datasets, and significant improvements across previous attribution methods based on feature ablation, Shapley values, and other gradient-based methods. Our work constitutes a first example of identifiable inference of time-series attribution maps and opens avenues to a better understanding of time-series data, such as for neural dynamics and decision-processes within neural networks.

[AI-90] Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

链接: https://arxiv.org/abs/2502.12793
作者: Eduardo Fernandes Montesuma,Adel El Habazi,Fred Ngole Mboula
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 1 table, under review

点击查看摘要

Abstract:Detecting anomalies in datasets is a longstanding problem in machine learning. In this context, anomalies are defined as a sample that significantly deviates from the remaining data. Meanwhile, optimal transport (OT) is a field of mathematics concerned with the transportation, between two probability measures, at least effort. In classical OT, the optimal transportation strategy of a measure to itself is the identity. In this paper, we tackle anomaly detection by forcing samples to displace its mass, while keeping the least effort objective. We call this new transportation problem Mass Repulsing Optimal Transport (MROT). Naturally, samples lying in low density regions of space will be forced to displace mass very far, incurring a higher transportation cost. We use these concepts to design a new anomaly score. Through a series of experiments in existing benchmarks, and fault detection problems, we show that our algorithm improves over existing methods.

[AI-91] he Majority Vote Paradigm Shift: When Popular Meets Optimal

链接: https://arxiv.org/abs/2502.12581
作者: Antonio Purificato,Maria Sofia Bucarelli,Anil Kumar Nelakanti,Andrea Bacciu,Fabrizio Silvestri,Amin Mantrach
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 33 pages, 7 figures

点击查看摘要

Abstract:Reliably labelling data typically requires annotations from multiple human workers. However, humans are far from being perfect. Hence, it is a common practice to aggregate labels gathered from multiple annotators to make a more confident estimate of the true label. Among many aggregation methods, the simple and well known Majority Vote (MV) selects the class label polling the highest number of votes. However, despite its importance, the optimality of MV’s label aggregation has not been extensively studied. We address this gap in our work by characterising the conditions under which MV achieves the theoretically optimal lower bound on label estimation error. Our results capture the tolerable limits on annotation noise under which MV can optimally recover labels for a given class distribution. This certificate of optimality provides a more principled approach to model selection for label aggregation as an alternative to otherwise inefficient practices that sometimes include higher experts, gold labels, etc., that are all marred by the same human uncertainty despite huge time and monetary costs. Experiments on both synthetic and real world data corroborate our theoretical findings.

[AI-92] A Comprehensive Survey on Generative AI for Video-to-Music Generation

链接: https://arxiv.org/abs/2502.12489
作者: Shulei Ji,Songruoyao Wu,Zihao Wang,Shuyu Li,Kejun Zhang
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The burgeoning growth of video-to-music generation can be attributed to the ascendancy of multimodal generative models. However, there is a lack of literature that comprehensively combs through the work in this field. To fill this gap, this paper presents a comprehensive review of video-to-music generation using deep generative AI techniques, focusing on three key components: visual feature extraction, music generation frameworks, and conditioning mechanisms. We categorize existing approaches based on their designs for each component, clarifying the roles of different strategies. Preceding this, we provide a fine-grained classification of video and music modalities, illustrating how different categories influence the design of components within the generation pipelines. Furthermore, we summarize available multimodal datasets and evaluation metrics while highlighting ongoing challenges in the field.

[AI-93] me Series Treatment Effects Analysis with Always-Missing Controls

链接: https://arxiv.org/abs/2502.12393
作者: Juan Shu,Qiyu Han,George Chen,Xihao Cao,Kangming Luo,Dan Pallotta,Shivam Agrawal,Yuping Lu,Xiaoyu Zhang,Jawad Mansoor,Jyoti Anand
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating treatment effects in time series data presents a significant challenge, especially when the control group is always unobservable. For example, in analyzing the effects of Christmas on retail sales, we lack direct observation of what would have occurred in late December without the Christmas impact. To address this, we try to recover the control group in the event period while accounting for confounders and temporal dependencies. Experimental results on the M5 Walmart retail sales data demonstrate robust estimation of the potential outcome of the control group as well as accurate predicted holiday effect. Furthermore, we provided theoretical guarantees for the estimated treatment effect, proving its consistency and asymptotic normality. The proposed methodology is applicable not only to this always-missing control scenario but also in other conventional time series causal inference settings.

[AI-94] Bridging the Data Gap in AI Reliability Research and Establishing DR-AIR a Comprehensive Data Repository for AI Reliability

链接: https://arxiv.org/abs/2502.12386
作者: Simin Zheng,Jared M. Clark,Fatemeh Salboukh,Priscila Silva,Karen da Mata,Fenglian Pan,Jie Min,Jiayi Lian,Caleb B. King,Lance Fiondella,Jian Liu,Xinwei Deng,Yili Hong
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
*备注: 34 pages, 12 figures

点击查看摘要

Abstract:Artificial intelligence (AI) technology and systems have been advancing rapidly. However, ensuring the reliability of these systems is crucial for fostering public confidence in their use. This necessitates the modeling and analysis of reliability data specific to AI systems. A major challenge in AI reliability research, particularly for those in academia, is the lack of readily available AI reliability data. To address this gap, this paper focuses on conducting a comprehensive review of available AI reliability data and establishing DR-AIR: a data repository for AI reliability. Specifically, we introduce key measurements and data types for assessing AI reliability, along with the methodologies used to collect these data. We also provide a detailed description of the currently available datasets with illustrative examples. Furthermore, we outline the setup of the DR-AIR repository and demonstrate its practical applications. This repository provides easy access to datasets specifically curated for AI reliability research. We believe these efforts will significantly benefit the AI research community by facilitating access to valuable reliability data and promoting collaboration across various academic domains within AI. We conclude our paper with a call to action, encouraging the research community to contribute and share AI reliability data to further advance this critical field of study.

[AI-95] Learning Plasma Dynamics and Robust Rampdown Trajectories with Predict-First Experiments at TCV

链接: https://arxiv.org/abs/2502.12327
作者: Allen M. Wang,Alessandro Pau,Cristina Rea,Oswin So,Charles Dawson,Olivier Sauter,Mark D. Boyer,Anna Vu,Cristian Galperti,Chuchu Fan,Antoine Merle,Yoeri Poels,Cristina Venturini,Stefano Marchioni, theTCV Team
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The rampdown in tokamak operations is a difficult to simulate phase during which the plasma is often pushed towards multiple instability limits. To address this challenge, and reduce the risk of disrupting operations, we leverage recent advances in Scientific Machine Learning (SciML) to develop a neural state-space model (NSSM) that predicts plasma dynamics during Tokamak à Configuration Variable (TCV) rampdowns. By integrating simple physics structure and data-driven models, the NSSM efficiently learns plasma dynamics during the rampdown from a modest dataset of 311 pulses with only five pulses in the reactor relevant high performance regime. The NSSM is parallelized across uncertainties, and reinforcement learning (RL) is applied to design trajectories that avoid multiple instability limits with high probability. Experiments at TCV ramping down high performance plasmas show statistically significant improvements in current and energy at plasma termination, with improvements in speed through continuous re-training. A predict-first experiment, increasing plasma current by 20% from baseline, demonstrates the NSSM’s ability to make small extrapolations with sufficient accuracy to design trajectories that successfully terminate the pulse. The developed approach paves the way for designing tokamak controls with robustness to considerable uncertainty, and demonstrates the relevance of the SciML approach to learning plasma dynamics for rapidly developing robust trajectories and controls during the incremental campaigns of upcoming burning plasma tokamaks.

[AI-96] Suboptimal Shapley Value Explanations

链接: https://arxiv.org/abs/2502.12209
作者: Xiaolei Lu
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have demonstrated strong capacity in supporting a wide variety of applications. Shapley value has emerged as a prominent tool to analyze feature importance to help people understand the inference process of deep neural models. Computing Shapley value function requires choosing a baseline to represent feature’s missingness. However, existing random and conditional baselines could negatively influence the explanation. In this paper, by analyzing the suboptimality of different baselines, we identify the problematic baseline where the asymmetric interaction between \bmx’_i (the replacement of the faithful influential feature) and other features has significant directional bias toward the model’s output, and conclude that p(y|\bmx’_i) = p(y) potentially minimizes the asymmetric interaction involving \bmx’_i . We further generalize the uninformativeness of \bmx’_i toward the label space L to avoid estimating p(y) and design a simple uncertainty-based reweighting mechanism to accelerate the computation process. We conduct experiments on various NLP tasks and our quantitative analysis demonstrates the effectiveness of the proposed uncertainty-based reweighting mechanism. Furthermore, by measuring the consistency of explanations generated by explainable methods and human, we highlight the disparity between model inference and human understanding.

[AI-97] owards Transparent and Accurate Plasma State Monitoring at JET

链接: https://arxiv.org/abs/2502.12182
作者: Andrin Bürli,Alessandro Pau,Thomas Koller,Olivier Sauter,JET Contributors
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Controlling and monitoring plasma within a tokamak device is complex and challenging. Plasma off-normal events, such as disruptions, are hindering steady-state operation. For large devices, they can even endanger the machine’s integrity and it represents in general one of the most serious concerns for the exploitation of the tokamak concept for future power plants. Effective plasma state monitoring carries the potential to enable an understanding of such phenomena and their evolution which is crucial for the successful operation of tokamaks. This paper presents the application of a transparent and data-driven methodology to monitor the plasma state in a tokamak. Compared to previous studies in the field, supervised and unsupervised learning techniques are combined. The dataset consisted of 520 expert-validated discharges from JET. The goal was to provide an interpretable plasma state representation for the JET operational space by leveraging multi-task learning for the first time in the context of plasma state monitoring. When evaluated as disruption predictors, a sequence-based approach showed significant improvements compared to the state-based models. The best resulting network achieved a promising cross-validated success rate when combined with a physical indicator and accounting for nearby instabilities. Qualitative evaluations of the learned latent space uncovered operational and disruptive regions as well as patterns related to learned dynamics and global feature importance. The applied methodology provides novel possibilities for the definition of triggers to switch between different control scenarios, data analysis, and learning as well as exploring latent dynamics for plasma state monitoring. It also showed promising quantitative and qualitative results with warning times suitable for avoidance purposes and distributions that are consistent with known physical mechanisms.

[AI-98] Integrating Artificial Intelligence and Geophysical Insights for Earthquake Forecasting: A Cross-Disciplinary Review

链接: https://arxiv.org/abs/2502.12161
作者: Zhang Ying,Wen Congcong,Sornette Didier,Zhan Chengxiang
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Earthquake forecasting remains a significant scientific challenge, with current methods falling short of achieving the performance necessary for meaningful societal benefits. Traditional models, primarily based on past seismicity and geomechanical data, struggle to capture the complexity of seismic patterns and often overlook valuable non-seismic precursors such as geophysical, geochemical, and atmospheric anomalies. The integration of such diverse data sources into forecasting models, combined with advancements in AI technologies, offers a promising path forward. AI methods, particularly deep learning, excel at processing complex, large-scale datasets, identifying subtle patterns, and handling multidimensional relationships, making them well-suited for overcoming the limitations of conventional approaches. This review highlights the importance of combining AI with geophysical knowledge to create robust, physics-informed forecasting models. It explores current AI methods, input data types, loss functions, and practical considerations for model development, offering guidance to both geophysicists and AI researchers. While many AI-based studies oversimplify earthquake prediction, neglecting critical features such as data imbalance and spatio-temporal clustering, the integration of specialized geophysical insights into AI models can address these shortcomings. We emphasize the importance of interdisciplinary collaboration, urging geophysicists to experiment with AI architectures thoughtfully and encouraging AI experts to deepen their understanding of seismology. By bridging these disciplines, we can develop more accurate, reliable, and societally impactful earthquake forecasting tools. Subjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.12161 [physics.geo-ph] (or arXiv:2502.12161v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2502.12161 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations

链接: https://arxiv.org/abs/2502.13134
作者: Jingxiao Chen,Xinyao Li,Jiahang Cao,Zhengbang Zhu,Wentao Dong,Minghuan Liu,Ying Wen,Yong Yu,Liqing Zhang,Weinan Zhang
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Humanoid robots have shown success in locomotion and manipulation. Despite these basic abilities, humanoids are still required to quickly understand human instructions and react based on human interaction signals to become valuable assistants in human daily life. Unfortunately, most existing works only focus on multi-stage interactions, treating each task separately, and neglecting real-time feedback. In this work, we aim to empower humanoid robots with real-time reaction abilities to achieve various tasks, allowing human to interrupt robots at any time, and making robots respond to humans immediately. To support such abilities, we propose a general humanoid-human-object interaction framework, named RHINO, i.e., Real-time Humanoid-human Interaction and Object manipulation. RHINO provides a unified view of reactive motion, instruction-based manipulation, and safety concerns, over multiple human signal modalities, such as languages, images, and motions. RHINO is a hierarchical learning framework, enabling humanoids to learn reaction skills from human-human-object demonstrations and teleoperation data. In particular, it decouples the interaction process into two levels: 1) a high-level planner inferring human intentions from real-time human behaviors; and 2) a low-level controller achieving reactive motion behaviors and object manipulation skills based on the predicted intentions. We evaluate the proposed framework on a real humanoid robot and demonstrate its effectiveness, flexibility, and safety in various scenarios.

[LG-1] Constrained Online Convex Optimization with Polyak Feasibility Steps

链接: https://arxiv.org/abs/2502.13112
作者: Spencer Hutchinson,Mahnoosh Alizadeh
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 20 pages, 2 figures

点击查看摘要

Abstract:In this work, we study online convex optimization with a fixed constraint function g : \mathbbR^d \rightarrow \mathbbR . Prior work on this problem has shown O(\sqrtT) regret and cumulative constraint satisfaction \sum_t=1^T g(x_t) \leq 0 , while only accessing the constraint value and subgradient at the played actions g(x_t), \partial g(x_t) . Using the same constraint information, we show a stronger guarantee of anytime constraint satisfaction g(x_t) \leq 0 \ \forall t \in [T] , and matching O(\sqrtT) regret guarantees. These contributions are thanks to our approach of using Polyak feasibility steps to ensure constraint satisfaction, without sacrificing regret. Specifically, after each step of online gradient descent, our algorithm applies a subgradient descent step on the constraint function where the step-size is chosen according to the celebrated Polyak step-size. We further validate this approach with numerical experiments.

[LG-2] MLPs at the EOC: Dynamics of Feature Learning

链接: https://arxiv.org/abs/2502.13110
作者: Dávid Terjék
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:Since infinitely wide neural networks in the kernel regime are random feature models, the success of contemporary deep learning lies in the rich regime, where a satisfying theory should explain not only the convergence of gradient descent but the learning of features along the way. Such a theory should also cover phenomena observed by practicioners including the Edge of Stability (EOS) and the catapult mechanism. For a practically relevant theory in the limit, neural network parameterizations have to efficiently reproduce limiting behavior as width and depth are scaled up. While widthwise scaling is mostly settled, depthwise scaling is solved only at initialization by the Edge of Chaos (EOC). During training, scaling up depth is either done by inversely scaling the learning rate or adding residual connections. We propose (1) the Normalized Update Parameterization ( \nu P) to solve this issue by growing hidden layer sizes depthwise inducing the regularized evolution of preactivations, (2) a hypothetical explanation for feature learning via the cosine of new and cumulative parameter updates and (3) a geometry-aware learning rate schedule that is able to prolong the catapult phase indefinitely. We support our hypotheses and demonstrate the usefulness of \nu P and the learning rate schedule by empirical evidence.

[LG-3] Enhanced uncertainty quantification variational autoencoders for the solution of Bayesian inverse problems

链接: https://arxiv.org/abs/2502.13105
作者: Andrea Tonini,Luca Dede’
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Among other uses, neural networks are a powerful tool for solving deterministic and Bayesian inverse problems in real-time. In the Bayesian framework, variational autoencoders, a specialized type of neural network, enable the estimation of model parameters and their distribution based on observational data allowing to perform real-time inverse uncertainty quantification. In this work, we build upon existing research [Goh, H. et al., Proceedings of Machine Learning Research, 2022] by proposing a novel loss function to train variational autoencoders for Bayesian inverse problems. When the forward map is affine, we provide a theoretical proof of the convergence of the latent states of variational autoencoders to the posterior distribution of the model parameters. We validate this theoretical result through numerical tests and we compare the proposed variational autoencoder with the existing one in the literature. Finally, we test the proposed variational autoencoder on the Laplace equation.

[LG-4] n4ml: Tensor Network Training and Customization for Machine Learning

链接: https://arxiv.org/abs/2502.13090
作者: Ema Puljak,Sergio Sanchez-Ramirez,Sergi Masot-Llima,Jofre Vallès-Muns,Artur Garcia-Saez,Maurizio Pierini
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Tensor Networks have emerged as a prominent alternative to neural networks for addressing Machine Learning challenges in foundational sciences, paving the way for their applications to real-life problems. This paper introduces tn4ml, a novel library designed to seamlessly integrate Tensor Networks into optimization pipelines for Machine Learning tasks. Inspired by existing Machine Learning frameworks, the library offers a user-friendly structure with modules for data embedding, objective function definition, and model training using diverse optimization strategies. We demonstrate its versatility through two examples: supervised learning on tabular data and unsupervised learning on an image dataset. Additionally, we analyze how customizing the parts of the Machine Learning pipeline for Tensor Networks influences performance metrics.

[LG-5] k-Graph: A Graph Embedding for Interpretable Time Series Clustering

链接: https://arxiv.org/abs/2502.13049
作者: Paul Boniol,Donato Tiano,Angela Bonifati,Themis Palpanas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series clustering poses a significant challenge with diverse applications across domains. A prominent drawback of existing solutions lies in their limited interpretability, often confined to presenting users with centroids. In addressing this gap, our work presents k -Graph, an unsupervised method explicitly crafted to augment interpretability in time series clustering. Leveraging a graph representation of time series subsequences, k -Graph constructs multiple graph representations based on different subsequence lengths. This feature accommodates variable-length time series without requiring users to predetermine subsequence lengths. Our experimental results reveal that k -Graph outperforms current state-of-the-art time series clustering algorithms in accuracy, while providing users with meaningful explanations and interpretations of the clustering outcomes.

[LG-6] Frag ility-aware Classification for Understanding Risk and Improving Generalization

链接: https://arxiv.org/abs/2502.13024
作者: Chen Yang,Zheng Cui,Daniel Zhuoyu Long,Jin Qi,Ruohan Zhan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Classification models play a critical role in data-driven decision-making applications such as medical diagnosis, user profiling, recommendation systems, and default detection. Traditional performance metrics, such as accuracy, focus on overall error rates but fail to account for the confidence of incorrect predictions, thereby overlooking the risk of confident misjudgments. This risk is particularly significant in cost-sensitive and safety-critical domains like medical diagnosis and autonomous driving, where overconfident false predictions may cause severe consequences. To address this issue, we introduce the Fragility Index (FI), a novel metric that evaluates classification performance from a risk-averse perspective by explicitly capturing the tail risk of confident misjudgments. To enhance generalizability, we define FI within the robust satisficing (RS) framework, incorporating data uncertainty. We further develop a model training approach that optimizes FI while maintaining tractability for common loss functions. Specifically, we derive exact reformulations for cross-entropy loss, hinge-type loss, and Lipschitz loss, and extend the approach to deep learning models. Through synthetic experiments and real-world medical diagnosis tasks, we demonstrate that FI effectively identifies misjudgment risk and FI-based training improves model robustness and generalizability. Finally, we extend our framework to deep neural network training, further validating its effectiveness in enhancing deep learning models.

[LG-7] Efficient and Sharp Off-Policy Learning under Unobserved Confounding

链接: https://arxiv.org/abs/2502.13022
作者: Konstantin Hess,Dennis Frauen,Valentyn Melnychuk,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a statistically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is statistically efficient. (3) We prove that our estimator leads to the optimal confounding-robust policy. Finally, we extend our theory to the related task of policy improvement under unobserved confounding, i.e., when a baseline policy such as the standard of care is available. We show in experiments with synthetic and real-world data that our method outperforms simple plug-in approaches and existing baselines. Our method is highly relevant for decision-making where unobserved confounding can be problematic, such as in healthcare and public policy.

[LG-8] Edge-Colored Clustering in Hypergraphs: Beyond Minimizing Unsatisfied Edges

链接: https://arxiv.org/abs/2502.13000
作者: Alex Crane,Thomas Stanley,Blair D. Sullivan,Nate Veldt
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a framework for clustering edge-colored hypergraphs, where the goal is to cluster (equivalently, to color) objects based on the primary type of multiway interactions they participate in. One well-studied objective is to color nodes to minimize the number of unsatisfied hyperedges – those containing one or more nodes whose color does not match the hyperedge color. We motivate and present advances for several directions that extend beyond this minimization problem. We first provide new algorithms for maximizing satisfied edges, which is the same at optimality but is much more challenging to approximate, with all prior work restricted to graphs. We develop the first approximation algorithm for hypergraphs, and then refine it to improve the best-known approximation factor for graphs. We then introduce new objective functions that incorporate notions of balance and fairness, and provide new hardness results, approximations, and fixed-parameter tractability results.

[LG-9] Approximate Tree Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees

链接: https://arxiv.org/abs/2502.12993
作者: Nate Veldt,Thomas Stanley,Benjamin W. Priest,Trevor Steil,Keita Iwabuchi,T.S. Jayram,Geoffrey Sanders
类目: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finding a minimum spanning tree (MST) for n points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes \Omega(n^2) time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of disconnected components using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components of the forest into a spanning tree. We prove that optimally solving the second step still takes \Omega(n^2) time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.

[LG-10] Ensemble Kalman filter in latent space using a variational autoencoder pair

链接: https://arxiv.org/abs/2502.12987
作者: Ivo Pasmans,Yumeng Chen,Tobias Sebastian Finn,Marc Bocquet,Alberto Carrassi
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Popular (ensemble) Kalman filter data assimilation (DA) approaches assume that the errors in both the a priori estimate of the state and those in the observations are Gaussian. For constrained variables, e.g. sea ice concentration or stress, such an assumption does not hold. The variational autoencoder (VAE) is a machine learning (ML) technique that allows to map an arbitrary distribution to/from a latent space in which the distribution is supposedly closer to a Gaussian. We propose a novel hybrid DA-ML approach in which VAEs are incorporated in the DA procedure. Specifically, we introduce a variant of the popular ensemble transform Kalman filter (ETKF) in which the analysis is applied in the latent space of a single VAE or a pair of VAEs. In twin experiments with a simple circular model, whereby the circle represents an underlying submanifold to be respected, we find that the use of a VAE ensures that a posteri ensemble members lie close to the manifold containing the truth. Furthermore, online updating of the VAE is necessary and achievable when this manifold varies in time, i.e. when it is non-stationary. We demonstrate that introducing an additional second latent space for the observational innovations improves robustness against detrimental effects of non-Gaussianity and bias in the observational errors but it slightly lessens the performance if observational errors are strictly Gaussian.

[LG-11] owards Variational Flow Matching on General Geometries

链接: https://arxiv.org/abs/2502.12981
作者: Olga Zaghen,Floor Eijkelboom,Alison Pouplin,Erik J. Bekkers
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:We introduce Riemannian Gaussian Variational Flow Matching (RG-VFM), an extension of Variational Flow Matching (VFM) that leverages Riemannian Gaussian distributions for generative modeling on structured manifolds. We derive a variational objective for probability flows on manifolds with closed-form geodesics, making RG-VFM comparable - though fundamentally different to Riemannian Flow Matching (RFM) in this geometric setting. Experiments on a checkerboard dataset wrapped on the sphere demonstrate that RG-VFM captures geometric structure more effectively than Euclidean VFM and baseline methods, establishing it as a robust framework for manifold-aware generative modeling.

[LG-12] Electron flow matching for generative reaction mechanism prediction obeying conservation laws

链接: https://arxiv.org/abs/2502.12979
作者: Joonyoung F. Joung,Mun Hong Fong,Nicholas Casetti,Jordan P. Liles,Ne S. Dassanayake,Connor W. Coley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Central to our understanding of chemical reactivity is the principle of mass conservation, which is fundamental for ensuring physical consistency, balancing equations, and guiding reaction design. However, data-driven computational models for tasks such as reaction product prediction rarely abide by this most basic constraint. In this work, we recast the problem of reaction prediction as a problem of electron redistribution using the modern deep generative framework of flow matching. Our model, FlowER, overcomes limitations inherent in previous approaches by enforcing exact mass conservation, thereby resolving hallucinatory failure modes, recovering mechanistic reaction sequences for unseen substrate scaffolds, and generalizing effectively to out-of-domain reaction classes with extremely data-efficient fine-tuning. FlowER additionally enables estimation of thermodynamic or kinetic feasibility and manifests a degree of chemical intuition in reaction prediction tasks. This inherently interpretable framework represents a significant step in bridging the gap between predictive accuracy and mechanistic understanding in data-driven reaction outcome prediction.

[LG-13] Does Training with Synthetic Data Truly Protect Privacy? ICLR2025

链接: https://arxiv.org/abs/2502.12976
作者: Yunpeng Zhao,Jie Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:As synthetic data becomes increasingly popular in machine learning tasks, numerous methods–without formal differential privacy guarantees–use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

[LG-14] Preventing the Popular Item Embedding Based Attack in Federated Recommendations ICDE2024

链接: https://arxiv.org/abs/2502.12958
作者: Jun Zhang,Huan Li,Dazhong Rong,Yan Zhao,Ke Chen,Lidan Shou
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted at ICDE 2024, Extension

点击查看摘要

Abstract:Privacy concerns have led to the rise of federated recommender systems (FRS), which can create personalized models across distributed clients. However, FRS is vulnerable to poisoning attacks, where malicious users manipulate gradients to promote their target items intentionally. Existing attacks against FRS have limitations, as they depend on specific models and prior knowledge, restricting their real-world applicability. In our exploration of practical FRS vulnerabilities, we devise a model-agnostic and prior-knowledge-free attack, named PIECK (Popular Item Embedding based Attack). The core module of PIECK is popular item mining, which leverages embedding changes during FRS training to effectively identify the popular items. Built upon the core module, PIECK branches into two diverse solutions: The PIECKIPE solution employs an item popularity enhancement module, which aligns the embeddings of targeted items with the mined popular items to increase item exposure. The PIECKUEA further enhances the robustness of the attack by using a user embedding approximation module, which approximates private user embeddings using mined popular items. Upon identifying PIECK, we evaluate existing federated defense methods and find them ineffective against PIECK, as poisonous gradients inevitably overwhelm the cold target items. We then propose a novel defense method by introducing two regularization terms during user training, which constrain item popularity enhancement and user embedding approximation while preserving FRS performance. We evaluate PIECK and its defense across two base models, three real datasets, four top-tier attacks, and six general defense methods, affirming the efficacy of both PIECK and its defense.

[LG-15] Guaranteed Conditional Diffusion: 3D Block-based Models for Scientific Data Compression

链接: https://arxiv.org/abs/2502.12951
作者: Jaemoon Lee,Xiao Li,Liangji Zhu,Sanjay Ranka,Anand Rangarajan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a new compression paradigm – Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) – for lossy scientific data compression. The framework is based on recent conditional diffusion (CD) generative models, and it consists of a conditional diffusion model, tensor correction, and error guarantee. Our diffusion model is a mixture of 3D conditioning and 2D denoising U-Net. The approach leverages a 3D block-based compressing module to address spatiotemporal correlations in structured scientific data. Then, the reverse diffusion process for 2D spatial data is conditioned on the ``slices’’ of content latent variables produced by the compressing module. After training, the denoising decoder reconstructs the data with zero noise and content latent variables, and thus it is entirely deterministic. The reconstructed outputs of the CD model are further post-processed by our tensor correction and error guarantee steps to control and ensure a maximum error distortion, which is an inevitable requirement in lossy scientific data compression. Our experiments involving two datasets generated by climate and chemical combustion simulations show that our framework outperforms standard convolutional autoencoders and yields competitive compression quality with an existing scientific data compression algorithm.

[LG-16] Efficient Learning Under Density Shift in Incremental Settings Using Cramér-Rao-Based Regularization

链接: https://arxiv.org/abs/2502.12949
作者: Behraj Khan,Behroz Mirza,Nouman Durrani,Tahir Syed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The continuous surge in data volume and velocity is often dealt with using data orchestration and distributed processing approaches, abstracting away the machine learning challenges that exist at the algorithmic level. With growing interest in automating the learning loop, training with data that arrive in a sequence rather than in the classical in-memory training data form will face a machine learning challenge because of evolving feature distributions across batches of training data biasing the cross-validation step (\citesugiyama2012machine). This work takes a distributed density estimation angle to the problem where data are temporally distributed. It processes data in batches and allows a neural network to treat a batch as training data. The method accumulates knowledge about the data density via posterior probability absorption using the Fisher Information Matrix, which contains information about the local optimization gradients for the batch. This is then used as a regularizer for the loss in the following batch, and therefore the density estimate for the entire dataset constructively gets more robust to the non-iid distribution shift. This needs the presence of a pair of batches in memory at a time, so the space cost is not a function of the size of the complete, distributed dataset. We proposed a novel regularization-based approach Covariate Shift Correction C^2A that leverages Fisher information and Kullback-Leibler divergence to adapt to both natural and sequential covariate shift caused by dataset fragmentation. C^2A achieves 19% accuracy at maximum against state-of-the-art methods.

[LG-17] Performance of Zero-Shot Time Series Foundation Models on Cloud Data

链接: https://arxiv.org/abs/2502.12944
作者: William Toner,Thomas L. Lee,Artjom Joosen,Rajkarn Singh,Martin Asenov
类目: Machine Learning (cs.LG)
*备注: 5 pages, Preprint

点击查看摘要

Abstract:Time series foundation models (FMs) have emerged as a popular paradigm for zero-shot multi-domain forecasting. FMs are trained on numerous diverse datasets and claim to be effective forecasters across multiple different time series domains, including cloud data. In this work we investigate this claim, exploring the effectiveness of FMs on cloud data. We demonstrate that many well-known FMs fail to generate meaningful or accurate zero-shot forecasts in this setting. We support this claim empirically, showing that FMs are outperformed consistently by simple linear baselines. We also illustrate a number of interesting pathologies, including instances where FMs suddenly output seemingly erratic, random-looking forecasts. Our results suggest a widespread failure of FMs to model cloud data.

[LG-18] uning Algorithmic and Architectural Hyperparameters in Graph-Based Semi-Supervised Learning with Provable Guarantees

链接: https://arxiv.org/abs/2502.12937
作者: Ally Yalei Du,Eric Huang,Dravyansh Sharma
类目: Machine Learning (cs.LG)
*备注: 31 pages (11 pages main body), 2 figures

点击查看摘要

Abstract:Graph-based semi-supervised learning is a powerful paradigm in machine learning for modeling and exploiting the underlying graph structure that captures the relationship between labeled and unlabeled data. A large number of classical as well as modern deep learning based algorithms have been proposed for this problem, often having tunable hyperparameters. We initiate a formal study of tuning algorithm hyperparameters from parameterized algorithm families for this problem. We obtain novel O(\log n) pseudo-dimension upper bounds for hyperparameter selection in three classical label propagation-based algorithm families, where n is the number of nodes, implying bounds on the amount of data needed for learning provably good parameters. We further provide matching \Omega(\log n) pseudo-dimension lower bounds, thus asymptotically characterizing the learning-theoretic complexity of the parameter tuning problem. We extend our study to selecting architectural hyperparameters in modern graph neural networks. We bound the Rademacher complexity for tuning the self-loop weighting in recently proposed Simplified Graph Convolution (SGC) networks. We further propose a tunable architecture that interpolates graph convolutional neural networks (GCN) and graph attention networks (GAT) in every layer, and provide Rademacher complexity bounds for tuning the interpolation coefficient.

[LG-19] Universal Embedding Function for Traffic Classification via QUIC Domain Recognition Pretraining: A Transfer Learning Success

链接: https://arxiv.org/abs/2502.12930
作者: Jan Luxemburk,Karel Hynek,Richard Plný,Tomáš Čejka
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Encrypted traffic classification (TC) methods must adapt to new protocols and extensions as well as to advancements in other machine learning fields. In this paper, we follow a transfer learning setup best known from computer vision. We first pretrain an embedding model on a complex task with a large number of classes and then transfer it to five well-known TC datasets. The pretraining task is recognition of SNI domains in encrypted QUIC traffic, which in itself is a problem for network monitoring due to the growing adoption of TLS Encrypted Client Hello. Our training pipeline – featuring a disjoint class setup, ArcFace loss function, and a modern deep learning architecture – aims to produce universal embeddings applicable across tasks. The proposed solution, based on nearest neighbors search in the embedding space, surpasses SOTA performance on four of the five TC datasets. A comparison with a baseline method utilizing raw packet sequences revealed unexpected findings with potential implications for the broader TC field. We published the model architecture, trained weights, and transfer learning experiments.

[LG-20] Lightweight Online Adaption for Time Series Foundation Model Forecasts

链接: https://arxiv.org/abs/2502.12920
作者: Thomas L. Lee,William Toner,Rajkarn Singh,Artjom Joosem,Martin Asenov
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages, Preprint

点击查看摘要

Abstract:Foundation models (FMs) have emerged as a promising approach for time series forecasting. While effective, FMs typically remain fixed during deployment due to the high computational costs of learning them online. Consequently, deployed FMs fail to adapt their forecasts to current data characteristics, despite the availability of online feedback from newly arriving data. This raises the question of whether FM performance can be enhanced by the efficient usage of this feedback. We propose AdapTS to answer this question. AdapTS is a lightweight mechanism for the online adaption of FM forecasts in response to online feedback. AdapTS consists of two parts: a) the AdapTS-Forecaster which is used to learn the current data distribution; and b) the AdapTS-Weighter which is used to combine the forecasts of the FM and the AdapTS-Forecaster. We evaluate the performance of AdapTS in conjunction with several recent FMs across a suite of standard time series datasets. In all of our experiments we find that using AdapTS improves performance. This work demonstrates how efficient usage of online feedback can be used to improve FM forecasts. Comments: 8 pages, Preprint Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.12920 [cs.LG] (or arXiv:2502.12920v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.12920 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] A Smooth Transition Between Induction and Deduction: Fast Abductive Learning Based on Probabilistic Symbol Perception

链接: https://arxiv.org/abs/2502.12919
作者: Lin-Han Jia,Si-Yu Han,Lan-Zhe Guo,Zhi Zhou,Zhao-Long Li,Yu-Feng Li,Zhi-Hua Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Abductive learning (ABL) that integrates strengths of machine learning and logical reasoning to improve the learning generalization, has been recently shown effective. However, its efficiency is affected by the transition between numerical induction and symbolical deduction, leading to high computational costs in the worst-case scenario. Efforts on this issue remain to be limited. In this paper, we identified three reasons why previous optimization algorithms for ABL were not effective: insufficient utilization of prediction, symbol relationships, and accumulated experience in successful abductive processes, resulting in redundant calculations to the knowledge base. To address these challenges, we introduce an optimization algorithm named as Probabilistic Symbol Perception (PSP), which makes a smooth transition between induction and deduction and keeps the correctness of ABL unchanged. We leverage probability as a bridge and present an efficient data structure, achieving the transfer from a continuous probability sequence to discrete Boolean sequences with low computational complexity. Experiments demonstrate the promising results.

[LG-22] Probabilistic neural operators for functional uncertainty quantification

链接: https://arxiv.org/abs/2502.12902
作者: Christopher Bülte,Philipp Scholl,Gitta Kutyniok
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural operators aim to approximate the solution operator of a system of differential equations purely from data. They have shown immense success in modeling complex dynamical systems across various domains. However, the occurrence of uncertainties inherent in both model and data has so far rarely been taken into account\textemdasha critical limitation in complex, chaotic systems such as weather forecasting. In this paper, we introduce the probabilistic neural operator (PNO), a framework for learning probability distributions over the output function space of neural operators. PNO extends neural operators with generative modeling based on strictly proper scoring rules, integrating uncertainty information directly into the training process. We provide a theoretical justification for the approach and demonstrate improved performance in quantifying uncertainty across different domains and with respect to different baselines. Furthermore, PNO requires minimal adjustment to existing architectures, shows improved performance for most probabilistic prediction tasks, and leads to well-calibrated predictive distributions and adequate uncertainty representations even for long dynamical trajectories. Implementing our approach into large-scale models for physical applications can lead to improvements in corresponding uncertainty quantification and extreme event identification, ultimately leading to a deeper understanding of the prediction of such surrogate models.

[LG-23] he Relationship Between Head Injury and Alzheimers Disease: A Causal Analysis with Bayesian Networks

链接: https://arxiv.org/abs/2502.12898
作者: Andrei Lixandru
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study examines the potential causal relationship between head injury and the risk of developing Alzheimer’s disease (AD) using Bayesian networks and regression models. Using a dataset of 2,149 patients, we analyze key medical history variables, including head injury history, memory complaints, cardiovascular disease, and diabetes. Logistic regression results suggest an odds ratio of 0.88 for head injury, indicating a potential but statistically insignificant protective effect against AD. In contrast, memory complaints exhibit a strong association with AD, with an odds ratio of 4.59. Linear regression analysis further confirms the lack of statistical significance for head injury (coefficient: -0.0245, p = 0.469) while reinforcing the predictive importance of memory complaints. These findings highlight the complex interplay of medical history factors in AD risk assessment and underscore the need for further research utilizing larger datasets and advanced causal modeling techniques.

[LG-24] Pushing the Limits of the Reactive Affine Shaker Algorithm to Higher Dimensions

链接: https://arxiv.org/abs/2502.12877
作者: Roberto Battiti,Mauro Brunato
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: Submitted to: the 19th Learning and Intelligent Optimization Conference (LION19), June 15-19 2025, Prague, Czech Republic ( this https URL )

点击查看摘要

Abstract:Bayesian Optimization (BO) for the minimization of expensive functions of continuous variables uses all the knowledge acquired from previous samples ( \boldsymbol x_i and f(\boldsymbol x_i) values) to build a surrogate model based on Gaussian processes. The surrogate is then exploited to define the next point to sample, through a careful balance of exploration and exploitation. Initially intended for low-dimensional spaces, BO has recently been modified and used also for very large-dimensional spaces (up to about one thousand dimensions). In this paper we consider a much simpler algorithm, called “Reactive Affine Shaker” (RAS). The next sample is always generated with a uniform probability distribution inside a parallelepiped (the “box”). At each iteration, the form of the box is adapted during the search through an affine transformation, based only on the point \boldsymbol x position and on the success or failure in improving the function. The function values are therefore not used directly to modify the search area and to generate the next sample. The entire dimensionality is kept (no active subspaces). Despite its extreme simplicity and its use of only stochastic local search, surprisingly the produced results are comparable to and not too far from the state-of-the-art results of high-dimensional versions of BO, although with some more function evaluations. An ablation study and an analysis of probability distribution of directions (improving steps and prevailing box orientation) in very large-dimensional spaces are conducted to understand more about the behavior of RAS and to assess the relative importance of the algorithmic building blocks for the final results. Comments: Submitted to: the 19th Learning and Intelligent Optimization Conference (LION19), June 15-19 2025, Prague, Czech Republic (this https URL) Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) ACMclasses: G.1.6; I.2.8 Cite as: arXiv:2502.12877 [math.NA] (or arXiv:2502.12877v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2502.12877 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mauro Brunato [view email] [v1] Tue, 18 Feb 2025 14:06:20 UTC (1,319 KB)

[LG-25] sting for Causal Fairness

链接: https://arxiv.org/abs/2502.12874
作者: Jiarun Fu,LiZhong Ding,Pengqi Li,Qiuning Wei,Yurong Cheng,Xu Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causality is widely used in fairness analysis to prevent discrimination on sensitive attributes, such as genders in career recruitment and races in crime prediction. However, the current data-based Potential Outcomes Framework (POF) often leads to untrustworthy fairness analysis results when handling high-dimensional data. To address this, we introduce a distribution-based POF that transform fairness analysis into Distributional Closeness Testing (DCT) by intervening on sensitive attributes. We define counterfactual closeness fairness as the null hypothesis of DCT, where a sensitive attribute is considered fair if its factual and counterfactual potential outcome distributions are sufficiently close. We introduce the Norm-Adaptive Maximum Mean Discrepancy Treatment Effect (N-TE) as a statistic for measuring distributional closeness and apply DCT using the empirical estimator of NTE, referred to Counterfactual Fairness-CLOseness Testing ( \textrmCF-CLOT ). To ensure the trustworthiness of testing results, we establish the testing consistency of N-TE through rigorous theoretical analysis. \textrmCF-CLOT demonstrates sensitivity in fairness analysis through the flexibility of the closeness parameter \epsilon . Unfair sensitive attributes have been successfully tested by \textrmCF-CLOT in extensive experiments across various real-world scenarios, which validate the consistency of the testing.

[LG-26] Malware Detection based on API calls

链接: https://arxiv.org/abs/2502.12863
作者: Christofer Fellicious,Manuel Bischof,Kevin Mayer,Dorian Eikenberg,Stefan Hausotte,Hans P. Reiser,Michael Granitzer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Malware attacks pose a significant threat in today’s interconnected digital landscape, causing billions of dollars in damages. Detecting and identifying families as early as possible provides an edge in protecting against such malware. We explore a lightweight, order-invariant approach to detecting and mitigating malware threats: analyzing API calls without regard to their sequence. We publish a public dataset of over three hundred thousand samples and their function call parameters for this task, annotated with labels indicating benign or malicious activity. The complete dataset is above 550GB uncompressed in size. We leverage machine learning algorithms, such as random forests, and conduct behavioral analysis by examining patterns and anomalies in API call sequences. By investigating how the function calls occur regardless of their order, we can identify discriminating features that can help us identify malware early on. The models we’ve developed are not only effective but also efficient. They are lightweight and can run on any machine with minimal performance overhead, while still achieving an impressive F1-Score of over 85%. We also empirically show that we only need a subset of the function call sequence, specifically calls to the this http URL library, to identify malware. Our research demonstrates the efficacy of this approach through empirical evaluations, underscoring its accuracy and scalability. The code is open source and available at Github along with the dataset on Zenodo.

[LG-27] MOLLM : Multi-Objective Large Language Model for Molecular Design – Optimizing with Experts

链接: https://arxiv.org/abs/2502.12845
作者: Nian Ran,Yue Wang,Richard Allmendinger
类目: Machine Learning (cs.LG)
*备注: 8 pages, under review

点击查看摘要

Abstract:Molecular design plays a critical role in advancing fields such as drug discovery, materials science, and chemical engineering. This work introduces the Multi-Objective Large Language Model for Molecular Design (MOLLM), a novel framework that combines domain-specific knowledge with the adaptability of Large Language Models to optimize molecular properties across multiple objectives. Leveraging in-context learning and multi-objective optimization, MOLLM achieves superior efficiency, innovation, and performance, significantly surpassing state-of-the-art (SOTA) methods. Recognizing the substantial impact of initial populations on evolutionary algorithms, we categorize them into three types: best initial, worst initial, and random initial, to ensure the initial molecules are the same for each method across experiments. Our results demonstrate that MOLLM consistently outperforms SOTA models in all of our experiments. We also provide extensive ablation studies to evaluate the superiority of our components.

[LG-28] NTP-INT: Network Traffic Prediction-Driven In-band Network Telemetry for High-load Switches

链接: https://arxiv.org/abs/2502.12834
作者: Penghui Zhang,Hua Zhang,Yuqi Dai,Cheng Zeng,Jingyu Wang,Jianxin Liao
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-band network telemetry (INT) is essential to network management due to its real-time visibility. However, because of the rapid increase in network devices and services, it has become crucial to have targeted access to detailed network information in a dynamic network environment. This paper proposes an intelligent network telemetry system called NTP-INT to obtain more fine-grained network information on high-load switches. Specifically, NTP-INT consists of three modules: network traffic prediction module, network pruning module, and probe path planning module. Firstly, the network traffic prediction module adopts a Multi-Temporal Graph Neural Network (MTGNN) to predict future network traffic and identify high-load switches. Then, we design the network pruning algorithm to generate a subnetwork covering all high-load switches to reduce the complexity of probe path planning. Finally, the probe path planning module uses an attention-mechanism-based deep reinforcement learning (DEL) model to plan efficient probe paths in the network slice. The experimental results demonstrate that NTP-INT can acquire more precise network information on high-load switches while decreasing the control overhead by 50%.

[LG-29] Frequency-domain alignment of heterogeneous multidimensional separations data through complex orthogonal Procrustes analysis

链接: https://arxiv.org/abs/2502.12810
作者: Michael Sorochan Armstrong
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:Multidimensional separations data have the capacity to reveal detailed information about complex biological samples. However, data analysis has been an ongoing challenge in the area since the peaks that represent chemical factors may drift over the course of several analytical runs along the first and second dimension retention times. This makes higher-level analyses of the data difficult, since a 1-1 comparison of samples is seldom possible without sophisticated pre-processing routines. Further complicating the issue is the fact that closely co-eluting components will need to be resolved, typically using some variants of Parallel Factor Analysis (PARAFAC), Multivariate Curve Resolution (MCR), or the recently explored Shift-Invariant Multi-linearity. These algorithms work with a user-specified number of components, and regions of interest that are then summarized as a peak table that is invariant to shift. However, identifying regions of interest across truly heterogeneous data remains an ongoing issue, for automated deployment of these algorithms. This work offers a very simple solution to the alignment problem through a orthogonal Procrustes analysis of the frequency-domain representation of synthetic multidimensional separations data, for peaks that are logarithmically transformed to simulate shift while preserving the underlying topology of the data. Using this very simple method for analysis, two synthetic chromatograms can be compared under close to the worst possible scenarios for alignment.

[LG-30] An improved wind power prediction via a novel wind ramp identification algorithm

链接: https://arxiv.org/abs/2502.12807
作者: Yifan Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Authors: Yifan Xu Abstract: Conventional wind power prediction methods often struggle to provide accurate and reliable predictions in the presence of sudden changes in wind speed and power output. To address this challenge, this study proposes an integrated algorithm that combines a wind speed mutation identification algorithm, an optimized similar period matching algorithm and a wind power prediction algorithm. By exploiting the convergence properties of meteorological events, the method significantly improves the accuracy of wind power prediction under sudden meteorological changes. Firstly, a novel adaptive model based on variational mode decomposition, the VMD-IC model, is developed for identifying and labelling key turning points in the historical wind power data, representing abrupt meteorological environments. At the same time, this paper proposes Ramp Factor (RF) indicators and wind speed similarity coefficient to optimize the definition algorithm of the current wind power ramp event (WPRE). After innovating the definition of climbing and denoising algorithm, this paper uses the Informer deep learning algorithm to output the first two models as well as multimodal data such as NWP numerical weather forecasts to achieve accurate wind forecasts. The experimental results of the ablation study confirm the effectiveness and reliability of the proposed wind slope identification method. Compared with existing methods, the proposed model exhibits excellent performance and provides valuable guidance for the safe and cost-effective operation of power systems.

[LG-31] Reinforcement Learning for Dynamic Resource Allocation in Optical Networks: Hype or Hope?

链接: https://arxiv.org/abs/2502.12804
作者: Michael Doherty,Robin Matzner,Rasoul Sadeghi,Polina Bayvel,Alejandra Beghelli
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The application of reinforcement learning (RL) to dynamic resource allocation in optical networks has been the focus of intense research activity in recent years, with almost 100 peer-reviewed papers. We present a review of progress in the field, and identify significant gaps in benchmarking practices and reproducibility. To determine the strongest benchmark algorithms, we systematically evaluate several heuristics across diverse network topologies. We find that path count and sort criteria for path selection significantly affect the benchmark performance. We meticulously recreate the problems from five landmark papers and apply the improved benchmarks. Our comparisons demonstrate that simple heuristics consistently match or outperform the published RL solutions, often with an order of magnitude lower blocking probability. Furthermore, we present empirical lower bounds on network blocking using a novel defragmentation-based method, revealing that potential improvements over the benchmark heuristics are limited to 19–36% increased traffic load for the same blocking performance in our examples. We make our simulation framework and results publicly available to promote reproducible research and standardized evaluation this https URL.

[LG-32] PPGF: Probability Pattern-Guided Time Series Forecasting

链接: https://arxiv.org/abs/2502.12802
作者: Yanru Sun,Zongxia Xie,Haoyu Xing,Hualong Yu,Qinghua Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting (TSF) is an essential branch of machine learning with various applications. Most methods for TSF focus on constructing different networks to extract better information and improve performance. However, practical application data contain different internal mechanisms, resulting in a mixture of multiple patterns. That is, the model’s ability to fit different patterns is different and generates different errors. In order to solve this problem, we propose an end-to-end framework, namely probability pattern-guided time series forecasting (PPGF). PPGF reformulates the TSF problem as a forecasting task guided by probabilistic pattern classification. Firstly, we propose the grouping strategy to approach forecasting problems as classification and alleviate the impact of data imbalance on classification. Secondly, we predict in the corresponding class interval to guarantee the consistency of classification and forecasting. In addition, True Class Probability (TCP) is introduced to pay more attention to the difficult samples to improve the classification accuracy. Detailedly, PPGF classifies the different patterns to determine which one the target value may belong to and estimates it accurately in the corresponding interval. To demonstrate the effectiveness of the proposed framework, we conduct extensive experiments on real-world datasets, and PPGF achieves significant performance improvements over several baseline methods. Furthermore, the effectiveness of TCP and the necessity of consistency between classification and forecasting are proved in the experiments. All data and codes are available online: this https URL.

[LG-33] Learning Counterfactually Fair Models via Improved Generation with Neural Causal Models

链接: https://arxiv.org/abs/2502.12796
作者: Krishn Vishwas Kher,Aditya Varun V,Shantanu Das,SakethaNath Jagarlapudi
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:One of the main concerns while deploying machine learning models in real-world applications is fairness. Counterfactual fairness has emerged as an intuitive and natural definition of fairness. However, existing methodologies for enforcing counterfactual fairness seem to have two limitations: (i) generating counterfactual samples faithful to the underlying causal graph, and (ii) as we argue in this paper, existing regularizers are mere proxies and do not directly enforce the exact definition of counterfactual fairness. In this work, our aim is to mitigate both issues. Firstly, we propose employing Neural Causal Models (NCMs) for generating the counterfactual samples. For implementing the abduction step in NCMs, the posteriors of the exogenous variables need to be estimated given a counterfactual query, as they are not readily available. As a consequence, \mathcalL_3 consistency with respect to the underlying causal graph cannot be guaranteed in practice due to the estimation errors involved. To mitigate this issue, we propose a novel kernel least squares loss term that enforces the \mathcalL_3 constraints explicitly. Thus, we obtain an improved counterfactual generation suitable for the counterfactual fairness task. Secondly, we propose a new MMD-based regularizer term that explicitly enforces the counterfactual fairness conditions into the base model while training. We show an improved trade-off between counterfactual fairness and generalization over existing baselines on synthetic and benchmark datasets.

[LG-34] One-bit Compressed Sensing using Generative Models

链接: https://arxiv.org/abs/2502.12762
作者: Swatantra Kafle,Geethu Joseph,Pramod K. Varshney
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper addresses the classical problem of one-bit compressed sensing using a deep learning-based reconstruction algorithm that leverages a trained generative model to enhance the signal reconstruction performance. The generator, a pre-trained neural network, learns to map from a low-dimensional latent space to a higher-dimensional set of sparse vectors. This generator is then used to reconstruct sparse vectors from their one-bit measurements by searching over its range. The presented algorithm provides an excellent reconstruction performance because the generative model can learn additional structural information about the signal beyond sparsity. Furthermore, we provide theoretical guarantees on the reconstruction accuracy and sample complexity of the algorithm. Through numerical experiments using three publicly available image datasets, MNIST, Fashion-MNIST, and Omniglot, we demonstrate the superior performance of the algorithm compared to other existing algorithms and show that our algorithm can recover both the amplitude and the direction of the signal from one-bit measurements.

[LG-35] High-Fidelity Music Vocoder using Neural Audio Codecs ICASSP2025

链接: https://arxiv.org/abs/2502.12759
作者: Luca A. Lanzendörfer,Florian Grötschla,Michael Ungersböck,Roger Wattenhofer
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.

[LG-36] Navigating Demand Uncertainty in Container Shipping: Deep Reinforcement Learning for Enabling Adaptive and Feasible Master Stowage Planning IJCAI2025

链接: https://arxiv.org/abs/2502.12756
作者: Jaike van Twiller,Yossiri Adulyasak,Erick Delage,Djordje Grbic,Rune Møller Jensen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: This paper is currently under review for IJCAI 2025

点击查看摘要

Abstract:Reinforcement learning (RL) has shown promise in solving various combinatorial optimization problems. However, conventional RL faces challenges when dealing with real-world constraints, especially when action space feasibility is explicit and dependent on the corresponding state or trajectory. In this work, we focus on using RL in container shipping, often considered the cornerstone of global trade, by dealing with the critical challenge of master stowage planning. The main objective is to maximize cargo revenue and minimize operational costs while navigating demand uncertainty and various complex operational constraints, namely vessel capacity and stability, which must be dynamically updated along the vessel’s voyage. To address this problem, we implement a deep reinforcement learning framework with feasibility projection to solve the master stowage planning problem (MPP) under demand uncertainty. The experimental results show that our architecture efficiently finds adaptive, feasible solutions for this multi-stage stochastic optimization problem, outperforming traditional mixed-integer programming and RL with feasibility regularization. Our AI-driven decision-support policy enables adaptive and feasible planning under uncertainty, optimizing operational efficiency and capacity utilization while contributing to sustainable and resilient global supply chains.

[LG-37] Architect of the Bits World: Masked Autoregressive Modeling for Circuit Generation Guided by Truth Table

链接: https://arxiv.org/abs/2502.12751
作者: Haoyuan Wu,Haisheng Zheng,Shoubo Hu,Zhuolun He,Bei Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Logic synthesis, a critical stage in electronic design automation (EDA), optimizes gate-level circuits to minimize power consumption and area occupancy in integrated circuits (ICs). Traditional logic synthesis tools rely on human-designed heuristics, often yielding suboptimal results. Although differentiable architecture search (DAS) has shown promise in generating circuits from truth tables, it faces challenges such as high computational complexity, convergence to local optima, and extensive hyperparameter tuning. Consequently, we propose a novel approach integrating conditional generative models with DAS for circuit generation. Our approach first introduces CircuitVQ, a circuit tokenizer trained based on our Circuit AutoEncoder We then develop CircuitAR, a masked autoregressive model leveraging CircuitVQ as the tokenizer. CircuitAR can generate preliminary circuit structures from truth tables, which guide DAS in producing functionally equivalent circuits. Notably, we observe the scalability and emergent capability in generating complex circuit structures of our CircuitAR models. Extensive experiments also show the superior performance of our method. This research bridges the gap between probabilistic generative models and precise circuit generation, offering a robust solution for logic synthesis.

[LG-38] Circuit Representation Learning with Masked Gate Modeling and Verilog-AIG Alignment

链接: https://arxiv.org/abs/2502.12732
作者: Haoyuan Wu,Haisheng Zheng,Yuan Pu,Bei Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the structure and function of circuits is crucial for electronic design automation (EDA). Circuits can be formulated as And-Inverter graphs (AIGs), enabling efficient implementation of representation learning through graph neural networks (GNNs). Masked modeling paradigms have been proven effective in graph representation learning. However, masking augmentation to original circuits will destroy their logical equivalence, which is unsuitable for circuit representation learning. Moreover, existing masked modeling paradigms often prioritize structural information at the expense of abstract information such as circuit function. To address these limitations, we introduce MGVGA, a novel constrained masked modeling paradigm incorporating masked gate modeling (MGM) and Verilog-AIG alignment (VGA). Specifically, MGM preserves logical equivalence by masking gates in the latent space rather than in the original circuits, subsequently reconstructing the attributes of these masked gates. Meanwhile, large language models (LLMs) have demonstrated an excellent understanding of the Verilog code functionality. Building upon this capability, VGA performs masking operations on original circuits and reconstructs masked gates under the constraints of equivalent Verilog codes, enabling GNNs to learn circuit functions from LLMs. We evaluate MGVGA on various logic synthesis tasks for EDA and show the superior performance of MGVGA compared to previous state-of-the-art methods. Our code is available at this https URL.

[LG-39] Learning the symmetric group: large from small

链接: https://arxiv.org/abs/2502.12717
作者: Max Petschack,Alexandr Garbali,Jan de Gier
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Representation Theory (math.RT)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Machine learning explorations can make significant inroads into solving difficult problems in pure mathematics. One advantage of this approach is that mathematical datasets do not suffer from noise, but a challenge is the amount of data required to train these models and that this data can be computationally expensive to generate. Key challenges further comprise difficulty in a posteriori interpretation of statistical models and the implementation of deep and abstract mathematical problems. We propose a method for scalable tasks, by which models trained on simpler versions of a task can then generalize to the full task. Specifically, we demonstrate that a transformer neural-network trained on predicting permutations from words formed by general transpositions in the symmetric group S_10 can generalize to the symmetric group S_25 with near 100% accuracy. We also show that S_10 generalizes to S_16 with similar performance if we only use adjacent transpositions. We employ identity augmentation as a key tool to manage variable word lengths, and partitioned windows for training on adjacent transpositions. Finally we compare variations of the method used and discuss potential challenges with extending the method to other tasks. Comments: 15 pages, 8 figures Subjects: Machine Learning (cs.LG); Combinatorics (math.CO); Representation Theory (math.RT) MSC classes: 05E99, 68T20, 68U99 Cite as: arXiv:2502.12717 [cs.LG] (or arXiv:2502.12717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.12717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] CausalMan: A physics-based simulator for large-scale causality

链接: https://arxiv.org/abs/2502.12707
作者: Nicholas Tagliapietra,Juergen Luettin,Lavdim Halilaj,Moritz Willig,Tim Pychynski,Kristian Kersting
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A comprehensive understanding of causality is critical for navigating and operating within today’s complex real-world systems. The absence of realistic causal models with known data generating processes complicates fair benchmarking. In this paper, we present the CausalMan simulator, modeled after a real-world production line. The simulator features a diverse range of linear and non-linear mechanisms and challenging-to-predict behaviors, such as discrete mode changes. We demonstrate the inadequacy of many state-of-the-art approaches and analyze the significant differences in their performance and tractability, both in terms of runtime and memory complexity. As a contribution, we will release the CausalMan large-scale simulator. We present two derived datasets, and perform an extensive evaluation of both.

[LG-41] Scalable Model Merging with Progressive Layer-wise Distillation

链接: https://arxiv.org/abs/2502.12706
作者: Jing Xu,Jiazheng Li,Jingzhao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

[LG-42] SATA: Safe and Adaptive Torque-Based Locomotion Policies Inspired by Animal Learning

链接: https://arxiv.org/abs/2502.12674
作者: Peizhuo Li,Hongyi Li,Ge Sun,Jin Cheng,Xinrong Yang,Guillaume Bellegarda,Milad Shafiee,Yuhong Cao,Auke Ijspeert,Guillaume Sartoretti
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite recent advances in learning-based controllers for legged robots, deployments in human-centric environments remain limited by safety concerns. Most of these approaches use position-based control, where policies output target joint angles that must be processed by a low-level controller (e.g., PD or impedance controllers) to compute joint torques. Although impressive results have been achieved in controlled real-world scenarios, these methods often struggle with compliance and adaptability when encountering environments or disturbances unseen during training, potentially resulting in extreme or unsafe behaviors. Inspired by how animals achieve smooth and adaptive movements by controlling muscle extension and contraction, torque-based policies offer a promising alternative by enabling precise and direct control of the actuators in torque space. In principle, this approach facilitates more effective interactions with the environment, resulting in safer and more adaptable behaviors. However, challenges such as a highly nonlinear state space and inefficient exploration during training have hindered their broader adoption. To address these limitations, we propose SATA, a bio-inspired framework that mimics key biomechanical principles and adaptive learning mechanisms observed in animal locomotion. Our approach effectively addresses the inherent challenges of learning torque-based policies by significantly improving early-stage exploration, leading to high-performance final policies. Remarkably, our method achieves zero-shot sim-to-real transfer. Our experimental results indicate that SATA demonstrates remarkable compliance and safety, even in challenging environments such as soft/slippery terrain or narrow passages, and under significant external disturbances, highlighting its potential for practical deployments in human-centric and safety-critical scenarios.

[LG-43] Implicit Repair with Reinforcement Learning in Emergent Communication AAMAS2025

链接: https://arxiv.org/abs/2502.12624
作者: Fábio Vital,Alberto Sardinha,Francisco S. Melo
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: AAMAS 2025 - full paper

点击查看摘要

Abstract:Conversational repair is a mechanism used to detect and resolve miscommunication and misinformation problems when two or more agents interact. One particular and underexplored form of repair in emergent communication is the implicit repair mechanism, where the interlocutor purposely conveys the desired information in such a way as to prevent misinformation from any other interlocutor. This work explores how redundancy can modify the emergent communication protocol to continue conveying the necessary information to complete the underlying task, even with additional external environmental pressures such as noise. We focus on extending the signaling game, called the Lewis Game, by adding noise in the communication channel and inputs received by the agents. Our analysis shows that agents add redundancy to the transmitted messages as an outcome to prevent the negative impact of noise on the task success. Additionally, we observe that the emerging communication protocol’s generalization capabilities remain equivalent to architectures employed in simpler games that are entirely deterministic. Additionally, our method is the only one suitable for producing robust communication protocols that can handle cases with and without noise while maintaining increased generalization performance levels.

[LG-44] Uncertainty-Aware Graph Structure Learning

链接: https://arxiv.org/abs/2502.12618
作者: Shen Han,Zhiyao Zhou,Jiawei Chen,Zhezheng Hao,Sheng Zhou,Gang Wang,Yan Feng,Chun Chen,Can Wang
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by TheWebConf 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become a prominent approach for learning from graph-structured data. However, their effectiveness can be significantly compromised when the graph structure is suboptimal. To address this issue, Graph Structure Learning (GSL) has emerged as a promising technique that refines node connections adaptively. Nevertheless, we identify two key limitations in existing GSL methods: 1) Most methods primarily focus on node similarity to construct relationships, while overlooking the quality of node information. Blindly connecting low-quality nodes and aggregating their ambiguous information can degrade the performance of other nodes. 2) The constructed graph structures are often constrained to be symmetric, which may limit the model’s flexibility and effectiveness. To overcome these limitations, we propose an Uncertainty-aware Graph Structure Learning (UnGSL) strategy. UnGSL estimates the uncertainty of node information and utilizes it to adjust the strength of directional connections, where the influence of nodes with high uncertainty is adaptively this http URL, UnGSL serves as a plug-in module that can be seamlessly integrated into existing GSL methods with minimal additional computational cost. In our experiments, we implement UnGSL into six representative GSL methods, demonstrating consistent performance improvements. The code is available at this https URL.

[LG-45] Hypernetwork-based approach for optimal composition design in partially controlled multi-agent systems

链接: https://arxiv.org/abs/2502.12605
作者: Kyeonghyeon Park,David Molina Concha,Hyun-Rok Lee,Chi-Guhn Lee,Taesik Lee
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partially Controlled Multi-Agent Systems (PCMAS) are comprised of controllable agents, managed by a system designer, and uncontrollable agents, operating autonomously. This study addresses an optimal composition design problem in PCMAS, which involves the system designer’s problem, determining the optimal number and policies of controllable agents, and the uncontrollable agents’ problem, identifying their best-response policies. Solving this bi-level optimization problem is computationally intensive, as it requires repeatedly solving multi-agent reinforcement learning problems under various compositions for both types of agents. To address these challenges, we propose a novel hypernetwork-based framework that jointly optimizes the system’s composition and agent policies. Unlike traditional methods that train separate policy networks for each composition, the proposed framework generates policies for both controllable and uncontrollable agents through a unified hypernetwork. This approach enables efficient information sharing across similar configurations, thereby reducing computational overhead. Additional improvements are achieved by incorporating reward parameter optimization and mean action networks. Using real-world New York City taxi data, we demonstrate that our framework outperforms existing methods in approximating equilibrium policies. Our experimental results show significant improvements in key performance metrics, such as order response rate and served demand, highlighting the practical utility of controlling agents and their potential to enhance decision-making in PCMAS.

[LG-46] Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum

链接: https://arxiv.org/abs/2502.12599
作者: Yihong Liu,Dongyeop Kang,Sehoon Ha
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous robotic wiping is an important task in various industries, ranging from industrial manufacturing to sanitization in healthcare. Deep reinforcement learning (Deep RL) has emerged as a promising algorithm, however, it often suffers from a high demand for repetitive reward engineering. Instead of relying on manual tuning, we first analyze the convergence of quality-critical robotic wiping, which requires both high-quality wiping and fast task completion, to show the poor convergence of the problem and propose a new bounded reward formulation to make the problem feasible. Then, we further improve the learning process by proposing a novel visual-language model (VLM) based curriculum, which actively monitors the progress and suggests hyperparameter tuning. We demonstrate that the combined method can find a desirable wiping policy on surfaces with various curvatures, frictions, and waypoints, which cannot be learned with the baseline formulation. The demo of this project can be found at: this https URL.

[LG-47] Sample Efficient Omniprediction and Downstream Swap Regret for Non-Linear Losses

链接: https://arxiv.org/abs/2502.12564
作者: Jiuyao Lu,Aaron Roth,Mirah Shi
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We define “decision swap regret” which generalizes both prediction for downstream swap regret and omniprediction, and give algorithms for obtaining it for arbitrary multi-dimensional Lipschitz loss functions in online adversarial settings. We also give sample complexity bounds in the batch setting via an online-to-batch reduction. When applied to omniprediction, our algorithm gives the first polynomial sample-complexity bounds for Lipschitz loss functions – prior bounds either applied only to linear loss (or binary outcomes) or scaled exponentially with the error parameter even under the assumption that the loss functions were convex. When applied to prediction for downstream regret, we give the first algorithm capable of guaranteeing swap regret bounds for all downstream agents with non-linear loss functions over a multi-dimensional outcome space: prior work applied only to linear loss functions, modeling risk neutral agents. Our general bounds scale exponentially with the dimension of the outcome space, but we give improved regret and sample complexity bounds for specific families of multidimensional functions of economic interest: constant elasticity of substitution (CES), Cobb-Douglas, and Leontief utility functions.

[LG-48] Design and Implementation of a Dual Uncrewed Surface Vessel Platform for Bathymetry Research under High-flow Conditions

链接: https://arxiv.org/abs/2502.12539
作者: Dinesh Kumar,Amin Ghorbanpour,Kin Yen,Iman Soltani
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Corresponding author: Iman Soltani (isoltani@ucdavis.edu)

点击查看摘要

Abstract:Bathymetry, the study of underwater topography, relies on sonar mapping of submerged structures. These measurements, critical for infrastructure health monitoring, often require expensive instrumentation. The high financial risk associated with sensor damage or vessel loss creates a reluctance to deploy uncrewed surface vessels (USVs) for bathymetry. However, the crewed-boat bathymetry operations, are costly, pose hazards to personnel, and frequently fail to achieve the stable conditions necessary for bathymetry data collection, especially under high currents. Further research is essential to advance autonomous control, navigation, and data processing technologies, with a particular focus on bathymetry. There is a notable lack of accessible hardware platforms that allow for integrated research in both bathymetry-focused autonomous control and navigation, as well as data evaluation and processing. This paper addresses this gap through the design and implementation of two complementary USV systems tailored for uncrewed bathymetry research. This includes a low-cost USV for Navigation And Control research (NAC-USV) and a second, high-end USV equipped with a high-resolution multi-beam sonar and the associated hardware for Bathymetry data quality Evaluation and Post-processing research (BEP-USV). The NAC-USV facilitates the investigation of autonomous, fail-safe navigation and control, emphasizing the stability requirements for high-quality bathymetry data collection while minimizing the risk to equipment. The BEP-USV, which mirrors the NAC-USV hardware, is then used for additional control validation and in-depth exploration of bathymetry data evaluation and post-processing methodologies. We detail the design and implementation of both systems, and open source the design. Furthermore, we demonstrate the system’s effectiveness in a range of operational scenarios.

[LG-49] Alternating Regret for Online Convex Optimization

链接: https://arxiv.org/abs/2502.12529
作者: Soumita Hait,Ping Li,Haipeng Luo,Mengxiao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by alternating learning dynamics in two-player games, a recent work by Cevher et al.(2024) shows that o(\sqrtT) alternating regret is possible for any T -round adversarial Online Linear Optimization (OLO) problem, and left as an open question whether the same is true for general Online Convex Optimization (OCO). We answer this question in the affirmative by showing that the continuous Hedge algorithm achieves \tilde\mathcalO(d^\frac23T^\frac13) alternating regret for any adversarial d -dimensional OCO problems. We show that this implies an alternating learning dynamic that finds a Nash equilibrium for any convex-concave zero-sum games or a coarse correlated equilibrium for any convex two-player general-sum games at a rate of \tilde\mathcalO(d^\frac23/T^\frac23) . To further improve the time complexity and/or the dimension dependence, we propose another simple algorithm, Follow-the-Regularized-Leader with a regularizer whose convex conjugate is 3rd-order smooth, for OCO with smooth and self-concordant loss functions (such as linear or quadratic losses). We instantiate our algorithm with different regularizers and show that, for example, when the decision set is the \ell_2 ball, our algorithm achieves \tilde\mathcalO(T^\frac25) alternating regret with no dimension dependence (and a better \tilde\mathcalO(T^\frac13) bound for quadratic losses). We complement our results by showing some algorithm-specific alternating regret lower bounds, including a somewhat surprising \Omega(\sqrtT) lower bound for a Regret Matching variant that is widely used in alternating learning dynamics.

[LG-50] Contextual Linear Bandits with Delay as Payoff

链接: https://arxiv.org/abs/2502.12528
作者: Mengxiao Zhang,Yingfei Wang,Haipeng Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most D\Delta_\max\log T , where T is the total horizon, D is the maximum delay, and \Delta_\max is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

[LG-51] Understanding Generalization in Transformers: Error Bounds and Training Dynamics Under Benign and Harmful Overfitting

链接: https://arxiv.org/abs/2502.12508
作者: Yingying Zhang,Zhenyu Wu,Jian Li,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers serve as the foundational architecture for many successful large-scale models, demonstrating the ability to overfit the training data while maintaining strong generalization on unseen data, a phenomenon known as benign overfitting. However, research on how the training dynamics influence error bounds within the context of benign overfitting has been limited. This paper addresses this gap by developing a generalization theory for a two-layer transformer with labeled flip noise. Specifically, we present generalization error bounds for both benign and harmful overfitting under varying signal-to-noise ratios (SNR), where the training dynamics are categorized into three distinct stages, each with its corresponding error bounds. Additionally, we conduct extensive experiments to identify key factors that influence test errors in transformers. Our experimental results align closely with the theoretical predictions, validating our findings.

[LG-52] GPU Memory Usage Optimization for Backward Propagation in Deep Network Training

链接: https://arxiv.org/abs/2502.12499
作者: Ding-Yong Hong,Tzu-Hsien Tsai,Ning Wang,Pangfeng Liu,Jan-Jan Wu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: To appear in JPDC 2025

点击查看摘要

Abstract:In modern Deep Learning, it has been a trend to design larger Deep Neural Networks (DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks (CNNs) have become the standard method for most of computer vision tasks. However, the memory allocation for the intermediate data in convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, a general methodology rematerialization can reduce GPU memory usage by trading computation for memory efficiently. The idea is to select a set of intermediate results during the forward phase as checkpoints, and only save them in memory to reduce memory usage. The backward phase recomputes the intermediate data from the closest checkpoints in memory as needed. This recomputation increases execution time but saves memory by not storing all intermediate results in memory during the forward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity O(n3) to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an O(n)-time algorithm for finding the optimal checkpoint subset.

[LG-53] MotifBench: A standardized protein design benchmark for motif-scaffolding problems

链接: https://arxiv.org/abs/2502.12479
作者: Zhuoqi Zheng,Bo Zhang,Kieran Didi,Kevin K. Yang,Jason Yim,Joseph L. Watson,Hai-Feng Chen,Brian L. Trippe
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Associated content available at this http URL

点击查看摘要

Abstract:The motif-scaffolding problem is a central task in computational protein design: Given the coordinates of atoms in a geometry chosen to confer a desired biochemical function (a motif), the task is to identify diverse protein structures (scaffolds) that include the motif and maintain its geometry. Significant recent progress on motif-scaffolding has been made due to computational evaluation with reliable protein structure prediction and fixed-backbone sequence design methods. However, significant variability in evaluation strategies across publications has hindered comparability of results, challenged reproducibility, and impeded robust progress. In response we introduce MotifBench, comprising (1) a precisely specified pipeline and evaluation metrics, (2) a collection of 30 benchmark problems, and (3) an implementation of this benchmark and leaderboard at this http URL. The MotifBench test cases are more difficult compared to earlier benchmarks, and include protein design problems for which solutions are known but on which, to the best of our knowledge, state-of-the-art methods fail to identify any solution.

[LG-54] Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

链接: https://arxiv.org/abs/2502.12465
作者: Dhruv Rohatgi,Adam Block,Audrey Huang,Akshay Krishnamurthy,Dylan J. Foster
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 75 pages

点击查看摘要

Abstract:Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length H increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification – where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor C\geq 1 – we confirm that C indeed grows with H for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve C=O(1) . (2) Next-token prediction can be made robust so as to achieve C=\tilde O(H) , representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer C=\Omega(H) . (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor C=e^(\log H)^1-\Omega(1) ; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on C=\Omega(H) in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning algorithm generalizes next-token prediction. Comments: 75 pages Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2502.12465 [cs.LG] (or arXiv:2502.12465v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.12465 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dhruv Rohatgi [view email] [v1] Tue, 18 Feb 2025 02:52:00 UTC (91 KB) Full-text links: Access Paper: View a PDF of the paper titled Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification, by Dhruv Rohatgi and 4 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-02 Change to browse by: cs cs.DS References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-55] LMN: A Tool for Generating Machine Enforceable Policies from Natural Language Access Control Rules using LLM s

链接: https://arxiv.org/abs/2502.12460
作者: Pratik Sonune(Indian Institute of Technology Kharagpur, India),Ritwik Rai(Indian Institute of Technology Kharagpur, India),Shamik Sural(Indian Institute of Technology Kharagpur, India),Vijayalakshmi Atluri(Rutgers University, Newark, USA),Ashish Kundu(CISCO Research, USA)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Organizations often lay down rules or guidelines called Natural Language Access Control Policies (NLACPs) for specifying who gets access to which information and when. However, these cannot be directly used in a target access control model like Attribute-based Access Control (ABAC). Manually translating the NLACP rules into Machine Enforceable Security Policies (MESPs) is both time consuming and resource intensive, rendering it infeasible especially for large organizations. Automated machine translation workflows, on the other hand, require information security officers to be adept at using such processes. To effectively address this problem, we have developed a free web-based publicly accessible tool called LMN (LLMs for generating MESPs from NLACPs) that takes an NLACP as input and converts it into a corresponding MESP. Internally, LMN uses the GPT 3.5 API calls and an appropriately chosen prompt. Extensive experiments with different prompts and performance metrics firmly establish the usefulness of LMN.

[LG-56] DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization

链接: https://arxiv.org/abs/2502.12413
作者: Jiaqi Wang,Yuhang Zhou,Zhixiong Zhang,Qiguang Chen,Yongqiang Chen,James Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution generalization is a common problem that expects the model to perform well in the different distributions even far from the train data. A popular approach to addressing this issue is invariant learning (IL), in which the model is compiled to focus on invariant features instead of spurious features by adding strong constraints during training. However, there are some potential pitfalls of strong invariant constraints. Due to the limited number of diverse environments and over-regularization in the feature space, it may lead to a loss of important details in the invariant features while alleviating the spurious correlations, namely the over-invariance, which can also degrade the generalization performance. We theoretically define the over-invariance and observe that this issue occurs in various classic IL methods. To alleviate this issue, we propose a simple approach Diverse Invariant Learning (DivIL) by adding the unsupervised contrastive learning and the random masking mechanism compensatory for the invariant constraints, which can be applied to various IL methods. Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our DivIL framework. Our code is available at this https URL.

[LG-57] Incomplete Graph Learning: A Comprehensive Survey

链接: https://arxiv.org/abs/2502.12412
作者: Riting Xia,Huibo Liu,Anchen Li,Xueyan Liu,Yan Zhang,Chunxu Zhang,Bo Yang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve more accurate and representative results. In this paper, we conducted a comprehensive review of the literature on incomplete graph learning. Initially, we categorize incomplete graphs and provide precise definitions of relevant concepts, terminologies, and techniques, thereby establishing a solid understanding for readers. Subsequently, we classify incomplete graph learning methods according to the types of incompleteness: (1) attribute-incomplete graph learning methods, (2) attribute-missing graph learning methods, and (3) hybrid-absent graph learning methods. By systematically classifying and summarizing incomplete graph learning methods, we highlight the commonalities and differences among existing approaches, aiding readers in selecting methods and laying the groundwork for further advancements. In addition, we summarize the datasets, incomplete processing modes, evaluation metrics, and application domains used by the current methods. Lastly, we discuss the current challenges and propose future directions for incomplete graph learning, with the aim of stimulating further innovations in this crucial field. To our knowledge, this is the first review dedicated to incomplete graph learning, aiming to offer valuable insights for researchers in related this http URL developed an online resource to follow relevant research based on this review, available at this https URL

[LG-58] Efficient Neural SDE Training using Wiener-Space Cubature

链接: https://arxiv.org/abs/2502.12395
作者: Luke Snow,Vikram Krishnamurthy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A neural stochastic differential equation (SDE) is an SDE with drift and diffusion terms parametrized by neural networks. The training procedure for neural SDEs consists of optimizing the SDE vector field (neural network) parameters to minimize the expected value of an objective functional on infinite-dimensional path-space. Existing training techniques focus on methods to efficiently compute path-wise gradients of the objective functional with respect to these parameters, then pair this with Monte-Carlo simulation to estimate the expectation, and stochastic gradient descent to optimize. In this work we introduce a novel training technique which bypasses and improves upon Monte-Carlo simulation; we extend results in the theory of Wiener-space cubature to approximate the expected objective functional by a weighted sum of deterministic ODE solutions. This allows us to compute gradients by efficient ODE adjoint methods. Furthermore, we exploit a high-order recombination scheme to drastically reduce the number of ODE solutions necessary to achieve a reasonable approximation. We show that this Wiener-space cubature approach can surpass the O(1/sqrt(n)) rate of Monte-Carlo simulation, or the O(log(n)/n) rate of quasi-Monte-Carlo, to achieve a O(1/n) rate under reasonable assumptions.

[LG-59] Reward-Safety Balance in Offline Safe RL via Diffusion Regularization

链接: https://arxiv.org/abs/2502.12391
作者: Junyu Guo,Zhi Zheng,Donghao Ying,Ming Jin,Shangding Gu,Costas Spanos,Javad Lavaei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained reinforcement learning (RL) seeks high-performance policies under safety constraints. We focus on an offline setting where the agent has only a fixed dataset – common in realistic tasks to prevent unsafe exploration. To address this, we propose Diffusion-Regularized Constrained Offline Reinforcement Learning (DRCORL), which first uses a diffusion model to capture the behavioral policy from offline data and then extracts a simplified policy to enable efficient inference. We further apply gradient manipulation for safety adaptation, balancing the reward objective and constraint satisfaction. This approach leverages high-quality offline data while incorporating safety requirements. Empirical results show that DRCORL achieves reliable safety performance, fast inference, and strong reward outcomes across robot learning tasks. Compared to existing safe offline RL methods, it consistently meets cost limits and performs well with the same hyperparameters, indicating practical applicability in real-world scenarios.

[LG-60] Achieving Upper Bound Accuracy of Joint Training in Continual Learning

链接: https://arxiv.org/abs/2502.12388
作者: Saleh Momeni,Bing Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning has been an active research area in machine learning, focusing on incrementally learning a sequence of tasks. A key challenge is catastrophic forgetting (CF), and most research efforts have been directed toward mitigating this issue. However, a significant gap remains between the accuracy achieved by state-of-the-art continual learning algorithms and the ideal or upper-bound accuracy achieved by training all tasks together jointly. This gap has hindered or even prevented the adoption of continual learning in applications, as accuracy is often of paramount importance. Recently, another challenge, termed inter-task class separation (ICS), was also identified, which spurred a theoretical study into principled approaches for solving continual learning. Further research has shown that by leveraging the theory and the power of large foundation models, it is now possible to achieve upper-bound accuracy, which has been empirically validated using both text and image classification datasets. Continual learning is now ready for real-life applications. This paper surveys the main research leading to this achievement, justifies the approach both intuitively and from neuroscience research, and discusses insights gained.

[LG-61] Scalable Back-Propagation-Free Training of Optical Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2502.12384
作者: Yequan Zhao,Xinling Yu,Xian Xiao,Zhixiong Chen,Ziyue Liu,Geza Kurczveil,Raymond G. Beausoleil,Sijia Liu,Zheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have shown promise in solving partial differential equations (PDEs), with growing interest in their energy-efficient, real-time training on edge devices. Photonic computing offers a potential solution to achieve this goal because of its ultra-high operation speed. However, the lack of photonic memory and the large device sizes prevent training real-size PINNs on photonic chips. This paper proposes a completely back-propagation-free (BP-free) and highly salable framework for training real-size PINNs on silicon photonic platforms. Our approach involves three key innovations: (1) a sparse-grid Stein derivative estimator to avoid the BP in the loss evaluation of a PINN, (2) a dimension-reduced zeroth-order optimization via tensor-train decomposition to achieve better scalability and convergence in BP-free training, and (3) a scalable on-chip photonic PINN training accelerator design using photonic tensor cores. We validate our numerical methods on both low- and high-dimensional PDE benchmarks. Through circuit simulation based on real device parameters, we further demonstrate the significant performance benefit (e.g., real-time training, huge chip area reduction) of our photonic accelerator.

[LG-62] Locally-Deployed Chain-of-Thought (CoT) Reasoning Model in Chemical Engineering: Starting from 30 Experimental Data

链接: https://arxiv.org/abs/2502.12383
作者: Tianhang Zhou,Yingchun Niu,Xingying Lan,Chunming Xu
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Code is avaliable upon request

点击查看摘要

Abstract:In the field of chemical engineering, traditional data-processing and prediction methods face significant challenges. Machine-learning and large-language models (LLMs) also have their respective limitations. This paper explores the application of the Chain-of-Thought (CoT) reasoning model in chemical engineering, starting from 30 experimental data points. By integrating traditional surrogate models like Gaussian processes and random forests with powerful LLMs such as DeepSeek-R1, a hierarchical architecture is proposed. Two CoT-building methods, Large Language Model-Chain of Thought (LLM-CoT) and Machine Learning-Large Language Model-Chain of Thought (ML-LLM-CoT), are studied. The LLM-CoT combines local models DeepSeek-r1:14b and Qwen2:7b with Ollama. The ML-LLM-CoT integrates a pre-trained Gaussian ML model with the LLM-based CoT framework. Our results show that during construction, ML-LLM-CoT is more efficient. It only has 2 points that require rethink and a total of 4 rethink times, while LLM-CoT has 5 points that need to be re-thought and 34 total rethink times. In predicting the solubility of 20 molecules with dissimilar structures, the number of molecules with a prediction deviation higher than 100% for the Gaussian model, LLM-CoT, and ML-LLM-CoT is 7, 6, and 4 respectively. These results indicate that ML-LLM-CoT performs better in controlling the number of high-deviation molecules, optimizing the average deviation, and achieving a higher success rate in solubility judgment, providing a more reliable method for chemical engineering and molecular property prediction. This study breaks through the limitations of traditional methods and offers new solutions for rapid property prediction and process optimization in chemical engineering.

[LG-63] DiffuRNN: Harnessing Diffusion Processes for Global Interactions

链接: https://arxiv.org/abs/2502.12381
作者: Jacob Fein-Ashley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion kernels capture global dependencies. We present DiffuRNN, a novel architecture that reinterprets sequential data processing as a unified diffusion process. Our model integrates adaptive diffusion modules with localized nonlinear updates and a diffusion-inspired attention mechanism. This design enables efficient global information propagation while preserving fine-grained temporal details. DiffuRNN overcomes the limitations of conventional recurrent and transformer models by allowing full parallelization across time steps and supporting robust multi-scale temporal representations. Experiments on benchmark sequence modeling tasks demonstrate that DiffuRNN delivers superior performance and scalability, setting a new standard for global interaction in sequential data.

[LG-64] Positional Encoding in Transformer-Based Time Series Models: A Survey

链接: https://arxiv.org/abs/2502.12370
作者: Habib Irani,Vangelis Metsis
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Recent advancements in transformer-based models have greatly improved time series analysis, providing robust solutions for tasks such as forecasting, anomaly detection, and classification. A crucial element of these models is positional encoding, which allows transformers to capture the intrinsic sequential nature of time series data. This survey systematically examines existing techniques for positional encoding in transformer-based time series models. We investigate a variety of methods, including fixed, learnable, relative, and hybrid approaches, and evaluate their effectiveness in different time series classification tasks. Furthermore, we outline key challenges and suggest potential research directions to enhance positional encoding strategies. By delivering a comprehensive overview and quantitative benchmarking, this survey intends to assist researchers and practitioners in selecting and designing effective positional encoding methods for transformer-based time series models.

[LG-65] ScriptoriumWS: A Code Generation Assistant for Weak Supervision ICLR’23

链接: https://arxiv.org/abs/2502.12366
作者: Tzu-Heng Huang,Catherine Cao,Spencer Schoenberg,Harit Vishwakarma,Nicholas Roberts,Frederic Sala
类目: Machine Learning (cs.LG)
*备注: Appeared in ICLR’23 Deep Learning for Code (DL4C) Workshop 2023 Midwest Machine Learning Symposium

点击查看摘要

Abstract:Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts – and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

[LG-66] Hovering Flight of Soft-Actuated Insect-Scale Micro Aerial Vehicles using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.12355
作者: Yi-Hsuan Hsiao,Wei-Tung Chen,Yun-Sheng Chang,Pulkit Agrawal,YuFeng Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 7 figures, accepted to 2025 IEEE International Conference on Soft Robotics (RoboSoft)

点击查看摘要

Abstract:Soft-actuated insect-scale micro aerial vehicles (IMAVs) pose unique challenges for designing robust and computationally efficient controllers. At the millimeter scale, fast robot dynamics ( \sim ms), together with system delay, model uncertainty, and external disturbances significantly affect flight performances. Here, we design a deep reinforcement learning (RL) controller that addresses system delay and uncertainties. To initialize this neural network (NN) controller, we propose a modified behavior cloning (BC) approach with state-action re-matching to account for delay and domain-randomized expert demonstration to tackle uncertainty. Then we apply proximal policy optimization (PPO) to fine-tune the policy during RL, enhancing performance and smoothing commands. In simulations, our modified BC substantially increases the mean reward compared to baseline BC; and RL with PPO improves flight quality and reduces command fluctuations. We deploy this controller on two different insect-scale aerial robots that weigh 720 mg and 850 mg, respectively. The robots demonstrate multiple successful zero-shot hovering flights, with the longest lasting 50 seconds and root-mean-square errors of 1.34 cm in lateral direction and 0.05 cm in altitude, marking the first end-to-end deep RL-based flight on soft-driven IMAVs.

[LG-67] Stability-based Generalization Bounds for Variational Inference

链接: https://arxiv.org/abs/2502.12353
作者: Yadi Wei,Roni Khardon
类目: Machine Learning (cs.LG)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:Variational inference (VI) is widely used for approximate inference in Bayesian machine learning. In addition to this practical success, generalization bounds for variational inference and related algorithms have been developed, mostly through the connection to PAC-Bayes analysis. A second line of work has provided algorithm-specific generalization bounds through stability arguments or using mutual information bounds, and has shown that the bounds are tight in practice, but unfortunately these bounds do not directly apply to approximate Bayesian algorithms. This paper fills this gap by developing algorithm-specific stability based generalization bounds for a class of approximate Bayesian algorithms that includes VI, specifically when using stochastic gradient descent to optimize their objective. As in the non-Bayesian case, the generalization error is bounded by by expected parameter differences on a perturbed dataset. The new approach complements PAC-Bayes analysis and can provide tighter bounds in some cases. An experimental illustration shows that the new approach yields non-vacuous bounds on modern neural network architectures and datasets and that it can shed light on performance differences between variant approximate Bayesian algorithms.

[LG-68] Understanding Silent Data Corruption in LLM Training

链接: https://arxiv.org/abs/2502.12340
作者: Jeffrey Ma,Hengzhi Pei,Leonard Lausen,George Karypis
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we are the first to investigate the impact of real-world SDCs on LLM training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss. Our analysis sheds light on further understanding and mitigating the impact of SDCs.

[LG-69] X-IL: Exploring the Design Space of Imitation Learning Policies

链接: https://arxiv.org/abs/2502.12330
作者: Xiaogang Jia,Atalay Donat,Xi Huang,Xuan Zhao,Denis Blessing,Hongyi Zhou,Hanyi Zhang,Han A. Wang,Qian Wang,Rudolf Lioutikov,Gerhard Neumann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing modern imitation learning (IL) policies requires making numerous decisions, including the selection of feature encoding, architecture, policy representation, and more. As the field rapidly advances, the range of available options continues to grow, creating a vast and largely unexplored design space for IL policies. In this work, we present X-IL, an accessible open-source framework designed to systematically explore this design space. The framework’s modular design enables seamless swapping of policy components, such as backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques (e.g., Score-matching, Flow-matching). This flexibility facilitates comprehensive experimentation and has led to the discovery of novel policy configurations that outperform existing methods on recent robot learning benchmarks. Our experiments demonstrate not only significant performance gains but also provide valuable insights into the strengths and weaknesses of various design choices. This study serves as both a practical reference for practitioners and a foundation for guiding future research in imitation learning.

[LG-70] Adversarial Debiasing for Unbiased Parameter Recovery

链接: https://arxiv.org/abs/2502.12323
作者: Luke C Sanford,Megan Ayers,Matthew Gordon,Eliana Stone
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Advances in machine learning and the increasing availability of high-dimensional data have led to the proliferation of social science research that uses the predictions of machine learning models as proxies for measures of human activity or environmental outcomes. However, prediction errors from machine learning models can lead to bias in the estimates of regression coefficients. In this paper, we show how this bias can arise, propose a test for detecting bias, and demonstrate the use of an adversarial machine learning algorithm in order to de-bias predictions. These methods are applicable to any setting where machine-learned predictions are the dependent variable in a regression. We conduct simulations and empirical exercises using ground truth and satellite data on forest cover in Africa. Using the predictions from a naive machine learning model leads to biased parameter estimates, while the predictions from the adversarial model recover the true coefficients.

[LG-71] Mean-Field Bayesian Optimisation

链接: https://arxiv.org/abs/2502.12315
作者: Petar Steinberg,Juliusz Ziomek,Matej Jusup,Ilija Bogunovic
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 16 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We address the problem of optimising the average payoff for a large number of cooperating agents, where the payoff function is unknown and treated as a black box. While standard Bayesian Optimisation (BO) methods struggle with the scalability required for high-dimensional input spaces, we demonstrate how leveraging the mean-field assumption on the black-box function can transform BO into an efficient and scalable solution. Specifically, we introduce MF-GP-UCB, a novel efficient algorithm designed to optimise agent payoffs in this setting. Our theoretical analysis establishes a regret bound for MF-GP-UCB that is independent of the number of agents, contrasting sharply with the exponential dependence observed when naive BO methods are applied. We evaluate our algorithm on a diverse set of tasks, including real-world problems, such as optimising the location of public bikes for a bike-sharing programme, distributing taxi fleets, and selecting refuelling ports for maritime vessels. Empirical results demonstrate that MF-GP-UCB significantly outperforms existing benchmarks, offering substantial improvements in performance and scalability, constituting a promising solution for mean-field, black-box optimisation. The code is available at this https URL.

[LG-72] Chaotic Map based Compression Approach to Classification

链接: https://arxiv.org/abs/2502.12302
作者: Harikrishnan N B,Anuja Vats,Nithin Nagaraj,Marius Pedersen
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 2 tables, 4 algorithms

点击查看摘要

Abstract:Modern machine learning approaches often prioritize performance at the cost of increased complexity, computational demands, and reduced interpretability. This paper introduces a novel framework that challenges this trend by reinterpreting learning from an information-theoretic perspective, viewing it as a search for encoding schemes that capture intrinsic data structures through compact representations. Rather than following the conventional approach of fitting data to complex models, we propose a fundamentally different method that maps data to intervals of initial conditions in a dynamical system. Our GLS (Generalized Lüroth Series) coding compression classifier employs skew tent maps - a class of chaotic maps - both for encoding data into initial conditions and for subsequent recovery. The effectiveness of this simple framework is noteworthy, with performance closely approaching that of well-established machine learning methods. On the breast cancer dataset, our approach achieves 92.98% accuracy, comparable to Naive Bayes at 94.74%. While these results do not exceed state-of-the-art performance, the significance of our contribution lies not in outperforming existing methods but in demonstrating that a fundamentally simpler, more interpretable approach can achieve competitive results.

[LG-73] On the Computational Tractability of the (Many) Shapley Values AISTATS2025

链接: https://arxiv.org/abs/2502.12295
作者: Reda Marzouk,Shahaf Bassan,Guy Katz,Colin de la Higuera
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: To appear in AISTATS 2025

点击查看摘要

Abstract:Recent studies have examined the computational complexity of computing Shapley additive explanations (also known as SHAP) across various models and distributions, revealing their tractability or intractability in different settings. However, these studies primarily focused on a specific variant called Conditional SHAP, though many other variants exist and address different limitations. In this work, we analyze the complexity of computing a much broader range of such variants, including Conditional, Interventional, and Baseline SHAP, while exploring both local and global computations. We show that both local and global Interventional and Baseline SHAP can be computed in polynomial time for various ML models under Hidden Markov Model distributions, extending popular algorithms such as TreeSHAP beyond empirical distributions. On the downside, we prove intractability results for these variants over a wide range of neural networks and tree ensembles. We believe that our results emphasize the intricate diversity of computing Shapley values, demonstrating how their complexity is substantially shaped by both the specific SHAP variant, the model type, and the distribution.

[LG-74] Healthcare cost prediction for heterogeneous patient profiles using deep learning models with administrative claims data

链接: https://arxiv.org/abs/2502.12277
作者: Mohammad Amin Morid,Olivia R. Liu Sheng
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Problem: How can we design patient cost prediction models that effectively address the challenges of heterogeneity in administrative claims (AC) data to ensure accurate, fair, and generalizable predictions, especially for high-need (HN) patients with complex chronic conditions? Relevance: Accurate and equitable patient cost predictions are vital for developing health management policies and optimizing resource allocation, which can lead to significant cost savings for healthcare payers, including government agencies and private insurers. Addressing disparities in prediction outcomes for HN patients ensures better economic and clinical decision-making, benefiting both patients and payers. Methodology: This study is grounded in socio-technical considerations that emphasize the interplay between technical systems (e.g., deep learning models) and humanistic outcomes (e.g., fairness in healthcare decisions). It incorporates representation learning and entropy measurement to address heterogeneity and complexity in data and patient profiles, particularly for HN patients. We propose a channel-wise deep learning framework that mitigates data heterogeneity by segmenting AC data into separate channels based on types of codes (e.g., diagnosis, procedures) and costs. This approach is paired with a flexible evaluation design that uses multi-channel entropy measurement to assess patient heterogeneity. Results: The proposed channel-wise models reduce prediction errors by 23% compared to single-channel models, leading to 16.4% and 19.3% reductions in overpayments and underpayments, respectively. Notably, the reduction in prediction bias is significantly higher for HN patients, demonstrating effectiveness in handling heterogeneity and complexity in data and patient profiles. This demonstrates the potential for applying channel-wise modeling to domains with similar heterogeneity challenges. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2502.12277 [cs.LG] (or arXiv:2502.12277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.12277 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Information Systems Research (forthcoming 2025) Submission history From: Mohammad Morid [view email] [v1] Mon, 17 Feb 2025 19:20:41 UTC (2,702 KB)

[LG-75] GeneralizeFormer: Layer-Adaptive Model Generation across Test-Time Distribution Shifts WACV2025

链接: https://arxiv.org/abs/2502.12195
作者: Sameer Ambekar,Zehao Xiao,Xiantong Zhen,Cees G. M. Snoek
类目: Machine Learning (cs.LG)
*备注: WACV 2025

点击查看摘要

Abstract:We consider the problem of test-time domain generalization, where a model is trained on several source domains and adjusted on target domains never seen during training. Different from the common methods that fine-tune the model or adjust the classifier parameters online, we propose to generate multiple layer parameters on the fly during inference by a lightweight meta-learned transformer, which we call \textitGeneralizeFormer. The layer-wise parameters are generated per target batch without fine-tuning or online adjustment. By doing so, our method is more effective in dynamic scenarios with multiple target distributions and also avoids forgetting valuable source distribution characteristics. Moreover, by considering layer-wise gradients, the proposed method adapts itself to various distribution shifts. To reduce the computational and time cost, we fix the convolutional parameters while only generating parameters of the Batch Normalization layers and the linear classifier. Experiments on six widely used domain generalization datasets demonstrate the benefits and abilities of the proposed method to efficiently handle various distribution shifts, generalize in dynamic scenarios, and avoid forgetting.

[LG-76] Direct Preference Optimization-Enhanced Multi-Guided Diffusion Model for Traffic Scenario Generation

链接: https://arxiv.org/abs/2502.12178
作者: Seungjun Yu,Kisung Kim,Daejung Kim,Haewook Han,Jinhan Lee
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Diffusion-based models are recognized for their effectiveness in using real-world driving data to generate realistic and diverse traffic scenarios. These models employ guided sampling to incorporate specific traffic preferences and enhance scenario realism. However, guiding the sampling process to conform to traffic rules and preferences can result in deviations from real-world traffic priors and potentially leading to unrealistic behaviors. To address this challenge, we introduce a multi-guided diffusion model that utilizes a novel training strategy to closely adhere to traffic priors, even when employing various combinations of guides. This model adopts a multi-task learning framework, enabling a single diffusion model to process various guide inputs. For increased guided sampling precision, our model is fine-tuned using the Direct Preference Optimization (DPO) algorithm. This algorithm optimizes preferences based on guide scores, effectively navigating the complexities and challenges associated with the expensive and often non-differentiable gradient calculations during the guided sampling fine-tuning process. Evaluated using the nuScenes dataset our model provides a strong baseline for balancing realism, diversity and controllability in the traffic scenario generation.

[LG-77] Recent Advances of NeuroDiffEq – An Open-Source Library for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2502.12177
作者: Shuheng Liu,Pavlos Protopapas,David Sondak,Feiyu Chen
类目: Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, submitted to Journal of Open Research Software

点击查看摘要

Abstract:Solving differential equations is a critical challenge across a host of domains. While many software packages efficiently solve these equations using classical numerical approaches, there has been less effort in developing a library for researchers interested in solving such systems using neural networks. With PyTorch as its backend, NeuroDiffEq is a software library that exploits neural networks to solve differential equations. In this paper, we highlight the latest features of the NeuroDiffEq library since its debut. We show that NeuroDiffEq can solve complex boundary value problems in arbitrary dimensions, tackle boundary conditions at infinity, and maintain flexibility for dynamic injection at runtime.

[LG-78] Application-oriented automatic hyperparameter optimization for spiking neural network prototyping

链接: https://arxiv.org/abs/2502.12172
作者: Vittorio Fra
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperparameter optimization (HPO) is of paramount importance in the development of high-performance, specialized artificial intelligence (AI) models, ranging from well-established machine learning (ML) solutions to the deep learning (DL) domain and the field of spiking neural networks (SNNs). The latter introduce further complexity due to the neuronal computational units and their additional hyperparameters, whose inadequate setting can dramatically impact the final model performance. At the cost of possible reduced generalization capabilities, the most suitable strategy to fully disclose the power of SNNs is to adopt an application-oriented approach and perform extensive HPO experiments. To facilitate these operations, automatic pipelines are fundamental, and their configuration is crucial. In this document, the Neural Network Intelligence (NNI) toolkit is used as reference framework to present one such solution, with a use case example providing evidence of the corresponding results. In addition, a summary of published works employing the presented pipeline is reported as possible source of insights into application-oriented HPO experiments for SNN prototyping.

[LG-79] Scalable and Robust Physics-Informed Graph Neural Networks for Water Distribution Systems

链接: https://arxiv.org/abs/2502.12164
作者: Inaam Ashraf,André Artelt,Barbara Hammer
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Water distribution systems (WDSs) are an important part of critical infrastructure becoming increasingly significant in the face of climate change and urban population growth. We propose a robust and scalable surrogate deep learning (DL) model to enable efficient planning, expansion, and rehabilitation of WDSs. Our approach incorporates an improved graph neural network architecture, an adapted physics-informed algorithm, an innovative training scheme, and a physics-preserving data normalization method. Evaluation results on a number of WDSs demonstrate that our model outperforms the current state-of-the-art DL model. Moreover, our method allows us to scale the model to bigger and more realistic WDSs. Furthermore, our approach makes the model more robust to out-of-distribution input features (demands, pipe diameters). Hence, our proposed method constitutes a significant step towards bridging the simulation-to-real gap in the use of artificial intelligence for WDSs.

[LG-80] owards Quantum Tensor Decomposition in Biomedical Applications

链接: https://arxiv.org/abs/2502.13140
作者: Myson Burch,Jiasen Zhang,Gideon Idumah,Hakan Doga,Richard Lartey,Lamis Yehia,Mingrui Yang,Murat Yildirim,Mihriban Karaayvaz,Omar Shehab,Weihong Guo,Ying Ni,Laxmi Parida,Xiaojuan Li,Aritra Bose
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 31 pages, 7 figures

点击查看摘要

Abstract:Tensor decomposition has emerged as a powerful framework for feature extraction in multi-modal biomedical data. In this review, we present a comprehensive analysis of tensor decomposition methods such as Tucker, CANDECOMP/PARAFAC, spiked tensor decomposition, etc. and their diverse applications across biomedical domains such as imaging, multi-omics, and spatial transcriptomics. To systematically investigate the literature, we applied a topic modeling-based approach that identifies and groups distinct thematic sub-areas in biomedicine where tensor decomposition has been used, thereby revealing key trends and research directions. We evaluated challenges related to the scalability of latent spaces along with obtaining the optimal rank of the tensor, which often hinder the extraction of meaningful features from increasingly large and complex datasets. Additionally, we discuss recent advances in quantum algorithms for tensor decomposition, exploring how quantum computing can be leveraged to address these challenges. Our study includes a preliminary resource estimation analysis for quantum computing platforms and examines the feasibility of implementing quantum-enhanced tensor decomposition methods on near-term quantum devices. Collectively, this review not only synthesizes current applications and challenges of tensor decomposition in biomedical analyses but also outlines promising quantum computing strategies to enhance its impact on deriving actionable insights from complex biomedical data.

[LG-81] A Neural Difference-of-Entropies Estimator for Mutual Information

链接: https://arxiv.org/abs/2502.13085
作者: Haoran Ni,Martin Lotz
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 23 pages, 17 figures

点击查看摘要

Abstract:Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.

[LG-82] Benchmarking MedMNIST dataset on real quantum hardware

链接: https://arxiv.org/abs/2502.13056
作者: Gurinder Singh,Hongni Jin,Kenneth M. Merz Jr
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning (QML) has emerged as a promising domain to leverage the computational capabilities of quantum systems to solve complex classification tasks. In this work, we present first comprehensive QML study by benchmarking the MedMNIST-a diverse collection of medical imaging datasets on a 127-qubit real IBM quantum hardware, to evaluate the feasibility and performance of quantum models (without any classical neural networks) in practical applications. This study explore recent advancements in quantum computing such as device-aware quantum circuits, error suppression and mitigation for medical image classification. Our methodology comprised of three stages: preprocessing, generation of noise-resilient and hardware-efficient quantum circuits, optimizing/training of quantum circuits on classical hardware, and inference on real IBM quantum hardware. Firstly, we process all input images in the preprocessing stage to reduce the spatial dimension due to the quantum hardware limitations. We generate hardware-efficient quantum circuits using backend properties expressible to learn complex patterns for medical image classification. After classical optimization of QML models, we perform the inference on real quantum hardware. We also incorporates advanced error suppression and mitigation techniques in our QML workflow including dynamical decoupling (DD), gate twirling, and matrix-free measurement mitigation (M3) to mitigate the effects of noise and improve classification performance. The experimental results showcase the potential of quantum computing for medical imaging and establishes a benchmark for future advancements in QML applied to healthcare.

[LG-83] Asymptotic Optimism of Random-Design Linear and Kernel Regression Models

链接: https://arxiv.org/abs/2502.12999
作者: Hengrui Luo,Yunzhang Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 56 pages;

点击查看摘要

Abstract:We derived the closed-form asymptotic optimism of linear regression models under random designs, and generalizes it to kernel ridge regression. Using scaled asymptotic optimism as a generic predictive model complexity measure, we studied the fundamental different behaviors of linear regression model, tangent kernel (NTK) regression model and three-layer fully connected neural networks (NN). Our contribution is two-fold: we provided theoretical ground for using scaled optimism as a model predictive complexity measure; and we show empirically that NN with ReLUs behaves differently from kernel models under this measure. With resampling techniques, we can also compute the optimism for regression models with real data.

[LG-84] Statistically Significant kNNAD by Selective Inference

链接: https://arxiv.org/abs/2502.12978
作者: Mizuki Niihori,Teruyuki Katsuoka,Tomohiro Shiraishi,Shuichi Nishino,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 40 pages, 11 figures

点击查看摘要

Abstract:In this paper, we investigate the problem of unsupervised anomaly detection using the k-Nearest Neighbor method. The k-Nearest Neighbor Anomaly Detection (kNNAD) is a simple yet effective approach for identifying anomalies across various domains and fields. A critical challenge in anomaly detection, including kNNAD, is appropriately quantifying the reliability of detected anomalies. To address this, we formulate kNNAD as a statistical hypothesis test and quantify the probability of false detection using p -values. The main technical challenge lies in performing both anomaly detection and statistical testing on the same data, which hinders correct p -value calculation within the conventional statistical testing framework. To resolve this issue, we introduce a statistical hypothesis testing framework called Selective Inference (SI) and propose a method named Statistically Significant NNAD (Stat-kNNAD). By leveraging SI, the Stat-kNNAD method ensures that detected anomalies are statistically significant with theoretical guarantees. The proposed Stat-kNNAD method is applicable to anomaly detection in both the original feature space and latent feature spaces derived from deep learning models. Through numerical experiments on synthetic data and applications to industrial product anomaly detection, we demonstrate the validity and effectiveness of the Stat-kNNAD method.

[LG-85] A Simplified and Numerically Stable Approach to the BG/NBD Churn Prediction model

链接: https://arxiv.org/abs/2502.12912
作者: Dylan Zammit,Christopher Zerafa
类目: Other Statistics (stat.OT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 4 pages, numerically stable BG/NBD

点击查看摘要

Abstract:This study extends the BG/NBD churn probability model, addressing its limitations in industries where customer behaviour is often influenced by seasonal events and possibly high purchase counts. We propose a modified definition of churn, considering a customer to have churned if they make no purchases within M days. Our contribution is twofold: First, we simplify the general equation for the specific case of zero purchases within M days. Second, we derive an alternative expression using numerical techniques to mitigate numerical overflow or underflow issues. This approach provides a more practical and robust method for predicting customer churn in industries with irregular purchase patterns.

[LG-86] Composition and Control with Distilled Energy Diffusion Models and Sequential Monte Carlo AISTATS2025

链接: https://arxiv.org/abs/2502.12786
作者: James Thornton,Louis Bethune,Ruixiang Zhang,Arwen Bradley,Preetum Nakkiran,Shuangfei Zhai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Initial submission to openreview on October 3, 2024 ( this https URL;) accepted to AISTATS 2025

点击查看摘要

Abstract:Diffusion models may be formulated as a time-indexed sequence of energy-based models, where the score corresponds to the negative gradient of an energy function. As opposed to learning the score directly, an energy parameterization is attractive as the energy itself can be used to control generation via Monte Carlo samplers. Architectural constraints and training instability in energy parameterized models have so far yielded inferior performance compared to directly approximating the score or denoiser. We address these deficiencies by introducing a novel training regime for the energy function through distillation of pre-trained diffusion models, resembling a Helmholtz decomposition of the score vector field. We further showcase the synergies between energy and score by casting the diffusion sampling procedure as a Feynman Kac model where sampling is controlled using potentials from the learnt energy functions. The Feynman Kac model formalism enables composition and low temperature sampling through sequential Monte Carlo.

[LG-87] Green LIME: Improving AI Explainability through Design of Experiments

链接: https://arxiv.org/abs/2502.12753
作者: Alexandra Stadler,Werner G. Müller,Radoslav Harman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In artificial intelligence (AI), the complexity of many models and processes often surpasses human interpretability, making it challenging to understand why a specific prediction is made. This lack of transparency is particularly problematic in critical fields like healthcare, where trust in a model’s predictions is paramount. As a result, the explainability of machine learning (ML) and other complex models has become a key area of focus. Efforts to improve model interpretability often involve experimenting with AI systems and approximating their behavior through simpler mechanisms. However, these procedures can be resource-intensive. Optimal design of experiments, which seeks to maximize the information obtained from a limited number of observations, offers promising methods for improving the efficiency of these explainability techniques. To demonstrate this potential, we explore Local Interpretable Model-agnostic Explanations (LIME), a widely used method introduced by Ribeiro, Singh, and Guestrin, 2016. LIME provides explanations by generating new data points near the instance of interest and passing them through the model. While effective, this process can be computationally expensive, especially when predictions are costly or require many samples. LIME is highly versatile and can be applied to a wide range of models and datasets. In this work, we focus on models involving tabular data, regression tasks, and linear models as interpretable local approximations. By utilizing optimal design of experiments’ techniques, we reduce the number of function evaluations of the complex model, thereby reducing the computational effort of LIME by a significant amount. We consider this modified version of LIME to be energy-efficient or “green”. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2502.12753 [stat.ML] (or arXiv:2502.12753v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2502.12753 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-88] Cross-Domain Continual Learning for Edge Intelligence in Wireless ISAC Networks

链接: https://arxiv.org/abs/2502.12736
作者: Jingzhi Hu,Xin Li,Zhou Su,Jun Luo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In wireless networks with integrated sensing and communications (ISAC), edge intelligence (EI) is expected to be developed at edge devices (ED) for sensing user activities based on channel state information (CSI). However, due to the CSI being highly specific to users’ characteristics, the CSI-activity relationship is notoriously domain dependent, essentially demanding EI to learn sufficient datasets from various domains in order to gain cross-domain sensing capability. This poses a crucial challenge owing to the EDs’ limited resources, for which storing datasets across all domains will be a significant burden. In this paper, we propose the EdgeCL framework, enabling the EI to continually learn-then-discard each incoming dataset, while remaining resilient to catastrophic forgetting. We design a transformer-based discriminator for handling sequences of noisy and nonequispaced CSI samples. Besides, we propose a distilled core-set based knowledge retention method with robustness-enhanced optimization to train the discriminator, preserving its performance for previous domains while preventing future forgetting. Experimental evaluations show that EdgeCL achieves 89% of performance compared to cumulative training while consuming only 3% of its memory, mitigating forgetting by 79%.

[LG-89] Neuromorphic Readout for Hadron Calorimeters

链接: https://arxiv.org/abs/2502.12693
作者: Enrico Lupi(1 and 2),Abhishek(7),Max Aehle(6 and 11),Muhammad Awais(1, 2, 3 and 11),Alessandro Breccia(2),Riccardo Carroccio(2),Long Chen(6 and 11),Abhijit Das(9),Andrea De Vita(1 and 2),Tommaso Dorigo(1, 3, 4 and 11),Nicolas R. Gauger(6 and 11),Ralf Keidel(8 and 11),Jan Kieseler(8),Anders Mikkelsen(9),Federico Nardi(2 and 10),Xuan Tung Nguyen(1 and 6),Fredrik Sandin(3 and 11),Kylian Schmidt(8),Pietro Vischia(4, 5 and 11),Joseph Willmore(1) ((1) INFN sezione di Padova, Italy, (2) Università di Padova dipartimento di Fisica e Astronomia, Italy, (3) Luleå University of Technology, Sweden, (4) Universal Scientific Education and Research Network, Italy, (5) Universidad de Oviedo and ICTEA, Spain, (6) University of Kaiserslautern-Landau (RPTU), Germany (7) National Institute of Science Education and Research, India, (8) Karlsruhe Institute of Technology, Germany, (9) Department of Physics and NanoLund, Lund University, Sweden, (10) Laboratoire de Physique Clermont Auvergne, France, (11) MODE Collaboration)
类目: High Energy Physics - Experiment (hep-ex); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 12 figures, submitted to MDPI Particles

点击查看摘要

Abstract:We simulate hadrons impinging on a homogeneous lead-tungstate (PbWO4) calorimeter to investigate how the resulting light yield and its temporal structure, as detected by an array of light-sensitive sensors, can be processed by a neuromorphic computing system. Our model encodes temporal photon distributions as spike trains and employs a fully connected spiking neural network to estimate the total deposited energy, as well as the position and spatial distribution of the light emissions within the sensitive material. The extracted primitives offer valuable topological information about the shower development in the material, achieved without requiring a segmentation of the active medium. A potential nanophotonic implementation using III-V semiconductor nanowires is discussed. It can be both fast and energy efficient.

[LG-90] Federated Variational Inference for Bayesian Mixture Models

链接: https://arxiv.org/abs/2502.12684
作者: Jackie Rao,Francesca L. Crowe,Tom Marshall,Sylvia Richardson,Paul D. W. Kirk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled ‘divide and conquer’ inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by ‘global’ merge moves across batches to find global clustering structures. We show that these merge moves require only summaries of the data in each batch, enabling federated learning across local nodes without requiring the full dataset to be shared. Empirical results on simulated and benchmark datasets demonstrate that our method performs well in comparison to existing clustering algorithms. We validate the practical utility of the method by applying it to large scale electronic health record (EHR) data.

[LG-91] NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation ICLR2025

链接: https://arxiv.org/abs/2502.12638
作者: Zhiyuan Liu,Yanchen Luo,Han Huang,Enzhi Zhang,Sihang Li,Junfeng Fang,Yaorui Shi,Xiang Wang,Kenji Kawaguchi,Tat-Seng Chua
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: ICLR 2025, 10 pages

点击查看摘要

Abstract:3D molecule generation is crucial for drug discovery and material design. While prior efforts focus on 3D diffusion models for their benefits in modeling continuous 3D conformers, they overlook the advantages of 1D SELFIES-based Language Models (LMs), which can generate 100% valid molecules and leverage the billion-scale 1D molecule datasets. To combine these advantages for 3D molecule generation, we propose a foundation model – NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation. NExT-Mol uses an extensively pretrained molecule LM for 1D molecule generation, and subsequently predicts the generated molecule’s 3D conformers with a 3D diffusion model. We enhance NExT-Mol’s performance by scaling up the LM’s model size, refining the diffusion neural architecture, and applying 1D to 3D transfer learning. Notably, our 1D molecule LM significantly outperforms baselines in distributional similarity while ensuring validity, and our 3D diffusion model achieves leading performances in conformer prediction. Given these improvements in 1D and 3D modeling, NExT-Mol achieves a 26% relative improvement in 3D FCD for de novo 3D generation on GEOM-DRUGS, and a 13% average relative gain for conditional 3D generation on QM9-2014. Our codes and pretrained checkpoints are available at this https URL.

[LG-92] Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation

链接: https://arxiv.org/abs/2502.12607
作者: Tatsuya Aoyama,Hanting Yang,Hiroyuki Hanada,Satoshi Akahane,Tomonari Tanaka,Yoshito Okura,Yu Inatsu,Noriaki Hashimoto,Taro Murayama,Hanju Lee,Shinya Kojima,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Duality Gap KIP (DGKIP), an extension of the Kernel Inducing Points (KIP) method for dataset distillation. While existing dataset distillation methods often rely on bi-level optimization, DGKIP eliminates the need for such optimization by leveraging duality theory in convex programming. The KIP method has been introduced as a way to avoid bi-level optimization; however, it is limited to the squared loss and does not support other loss functions (e.g., cross-entropy or hinge loss) that are more suitable for classification tasks. DGKIP addresses this limitation by exploiting an upper bound on parameter changes after dataset distillation using the duality gap, enabling its application to a wider range of loss functions. We also characterize theoretical properties of DGKIP by providing upper bounds on the test error and prediction consistency after dataset distillation. Experimental results on standard benchmarks such as MNIST and CIFAR-10 demonstrate that DGKIP retains the efficiency of KIP while offering broader applicability and robust performance.

[LG-93] Scientific Machine Learning of Flow Resistance Using Universal Shallow Water Equations with Differentiable Programming

链接: https://arxiv.org/abs/2502.12396
作者: Xiaofeng Liu,Yalan Song
类目: Fluid Dynamics (physics.flu-dyn); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shallow water equations (SWEs) are the backbone of most hydrodynamics models for flood prediction, river engineering, and many other water resources applications. The estimation of flow resistance, i.e., the Manning’s roughness coefficient n , is crucial for ensuring model accuracy, and has been previously determined using empirical formulas or tables. To better account for temporal and spatial variability in channel roughness, inverse modeling of n using observed flow data is more reliable and adaptable; however, it is challenging when using traditional SWE solvers. Based on the concept of universal differential equation (UDE), which combines physics-based differential equations with neural networks (NNs), we developed a universal SWEs (USWEs) solver, Hydrograd, for hybrid hydrodynamics modeling. It can do accurate forward simulations, support automatic differentiation (AD) for gradient-based sensitivity analysis and parameter inversion, and perform scientific machine learning for physics discovery. In this work, we first validated the accuracy of its forward modeling, then applied a real-world case to demonstrate the ability of USWEs to capture model sensitivity (gradients) and perform inverse modeling of Manning’s n . Furthermore, we used a NN to learn a universal relationship between n , hydraulic parameters, and flow in a real river channel. Unlike inverse modeling using surrogate models, Hydrograd uses a two-dimensional SWEs solver as its physics backbone, which eliminates the need for data-intensive pretraining and resolves the generalization problem when applied to out-of-sample scenarios. This differentiable modeling approach, with seamless integration with NNs, provides a new pathway for solving complex inverse problems and discovering new physics in hydrodynamics.

[LG-94] Stability Bounds for Smooth Optimal Transport Maps and their Statistical Implications

链接: https://arxiv.org/abs/2502.12326
作者: Sivaraman Balakrishnan,Tudor Manole
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 26 pages, 1 figure

点击查看摘要

Abstract:We study estimators of the optimal transport (OT) map between two probability distributions. We focus on plugin estimators derived from the OT map between estimates of the underlying distributions. We develop novel stability bounds for OT maps which generalize those in past work, and allow us to reduce the problem of optimally estimating the transport map to that of optimally estimating densities in the Wasserstein distance. In contrast, past work provided a partial connection between these problems and relied on regularity theory for the Monge-Ampere equation to bridge the gap, a step which required unnatural assumptions to obtain sharp guarantees. We also provide some new insights into the connections between stability bounds which arise in the analysis of plugin estimators and growth bounds for the semi-dual functional which arise in the analysis of Brenier potential-based estimators of the transport map. We illustrate the applicability of our new stability bounds by revisiting the smooth setting studied by Manole et al., analyzing two of their estimators under more general conditions. Critically, our bounds do not require smoothness or boundedness assumptions on the underlying measures. As an illustrative application, we develop and analyze a novel tuning parameter-free estimator for the OT map between two strongly log-concave distributions.

[LG-95] Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization

链接: https://arxiv.org/abs/2502.12298
作者: Aditya Ranganath,Mukesh Singhal,Roummel Marcia
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: submitted to TMLR

点击查看摘要

Abstract:Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods.x

[LG-96] Multi-dimensional Test Design

链接: https://arxiv.org/abs/2502.12264
作者: Xiaoyun Qiu,Liren Shan
类目: Theoretical Economics (econ.TH); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How should one jointly design tests and the arrangement of agencies to administer these tests (testing procedure)? To answer this question, we analyze a model where a principal must use multiple tests to screen an agent with a multi-dimensional type, knowing that the agent can change his type at a cost. We identify a new tradeoff between setting difficult tests and using a difficult testing procedure. We compare two settings: (1) the agent only misrepresents his type (manipulation) and (2) the agent improves his actual type (investment). Examples include interviews, regulations, and data classification. We show that in the manipulation setting, stringent tests combined with an easy procedure, i.e., offering tests sequentially in a fixed order, is optimal. In contrast, in the investment setting, non-stringent tests with a difficult procedure, i.e., offering tests simultaneously, is optimal; however, under mild conditions offering them sequentially in a random order may be as good. Our results suggest that whether the agent manipulates or invests in his type determines which arrangement of agencies is optimal.

[LG-97] On the Learnability of Knot Invariants: Representation Predictability and Neural Similarity

链接: https://arxiv.org/abs/2502.12243
作者: Audrey Lindsay,Fabian Ruehle
类目: Geometric Topology (math.GT); Machine Learning (cs.LG)
*备注: 14 pages, 3 figures, 1 table

点击查看摘要

Abstract:We analyze different aspects of neural network predictions of knot invariants. First, we investigate the impact of different knot representations on the prediction of invariants and find that braid representations work in general the best. Second, we study which knot invariants are easy to learn, with invariants derived from hyperbolic geometry and knot diagrams being very easy to learn, while invariants derived from topological or homological data are harder. Predicting the Arf invariant could not be learned for any representation. Third, we propose a cosine similarity score based on gradient saliency vectors, and a joint misclassification score to uncover similarities in neural networks trained to predict related topological invariants.

[LG-98] owards Efficient Molecular Property Optimization with Graph Energy Based Models

链接: https://arxiv.org/abs/2502.12219
作者: Luca Miglior,Lorenzo Simone,Marco Podda,Davide Bacciu
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Accepted at ESANN 2025

点击查看摘要

Abstract:Optimizing chemical properties is a challenging task due to the vastness and complexity of chemical space. Here, we present a generative energy-based architecture for implicit chemical property optimization, designed to efficiently generate molecules that satisfy target properties without explicit conditional generation. We use Graph Energy Based Models and a training approach that does not require property labels. We validated our approach on well-established chemical benchmarks, showing superior results to state-of-the-art methods and demonstrating robustness and efficiency towards de novo drug design.

[LG-99] Antimatter Annihilation Vertex Reconstruction with Deep Learning for ALPHA-g Radial Time Projection Chamber

链接: https://arxiv.org/abs/2502.12169
作者: Ashley Ferreira,Mahip Singh,Yukiya Saito,Andrea Capra,Ina Carli,Daniel Duque Quiceno,Wojciech T. Fedorko,Makoto C. Fujiwara,Muyan Li,Lars Martin,Gareth Smith,Anqui Xu
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:The ALPHA-g experiment at CERN aims to precisely measure the terrestrial gravitational acceleration of antihydrogen atoms. A radial Time Projection Chamber (rTPC), that surrounds the ALPHA-g magnetic trap, is employed to determine the annihilation location, called the vertex. The standard approach requires identifying the trajectories of the ionizing particles in the rTPC from the location of their interaction in the gas (spacepoints), and inferring the vertex positions by finding the point where those trajectories (helices) pass closest to one another. In this work, we present a novel approach to vertex reconstruction using an ensemble of models based on the PointNet deep learning architecture. The newly developed model, PointNet Ensemble for Annihilation Reconstruction (PEAR), directly learns the relation between the location of the vertices and the rTPC spacepoints, thus eliminating the need to identify and fit the particle tracks. PEAR shows strong performance in reconstructing vertical vertex positions from simulated data, that is superior to the standard approach for all metrics considered. Furthermore, the deep learning approach can reconstruct the vertical vertex position when the standard approach fails.

信息检索

[IR-0] Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search

链接: https://arxiv.org/abs/2502.12974
作者: Yifan Ji,Zhipeng Xu,Zhenghao Liu,Yukun Yan,Shi Yu,Yishan Li,Zhiyuan Liu,Yu Gu,Ge Yu,Maosong Sun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent dense retrievers usually thrive on the emergency capabilities of Large Language Models (LLMs), using them to encode queries and documents into an embedding space for retrieval. These LLM-based dense retrievers have shown promising performance across various retrieval scenarios. However, relying on a single embedding to represent documents proves less effective in capturing different perspectives of documents for matching. In this paper, we propose Deliberate Thinking based Dense Retriever (DEBATER), which enhances these LLM-based retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. DEBATER introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations using a continuous chain of thought. To consolidate information from various thinking steps, DEBATER also incorporates the Self Distillation mechanism, which identifies the most informative thinking steps and integrates them into a unified text embedding. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes are available at this https URL.

[IR-1] Introducing Context Information in Lifelong Sequential Modeling using Temporal Convolutional Networks

链接: https://arxiv.org/abs/2502.12634
作者: Ting Guo,Zhaoyang Yang,Qinsong Zeng,Ming Chen
类目: Information Retrieval (cs.IR)
*备注: 10 pages, including 1 page of reference, 7 figures

点击查看摘要

Abstract:The importance of lifelong sequential modeling (LSM) is growing in the realm of social media recommendation systems. A key component in this process is the attention module, which derives interest representations with respect to candidate items from the sequence. Typically, attention modules function in a point-wise fashion, concentrating only on the relevance of individual items in the sequence to the candidate item. However, the context information in the neighboring items that is useful for more accurately evaluating the significance of each item has not been taken into account. In this study, we introduce a novel network which employs the Temporal Convolutional Network (TCN) to generate context-aware representations for each item throughout the lifelong sequence. These improved representations are then utilized in the attention module to produce context-aware interest representations. Expanding on this TCN framework, we present a enhancement module which includes multiple TCN layers and their respective attention modules to capture interest representations across different context scopes. Additionally, we also incorporate a lightweight sub-network to create convolution filters based on users’ basic profile features. These personalized filters are then applied in the TCN layers instead of the original global filters to produce more user-specific representations. We performed experiments on both a public dataset and a proprietary dataset. The findings indicate that the proposed network surpasses existing methods in terms of prediction accuracy and online performance metrics.

[IR-2] From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation Comprehension Recommendation and Information Retrieval

链接: https://arxiv.org/abs/2502.12448
作者: Jian Jia,Jingtong Gao,Ben Xue,Junhao Wang,Qingpeng Cai,Quan Chen,Xiangyu Zhao,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Discrete tokenizers have emerged as indispensable components in modern machine learning systems, particularly within the context of autoregressive modeling and large language models (LLMs). These tokenizers serve as the critical interface that transforms raw, unstructured data from diverse modalities into discrete tokens, enabling LLMs to operate effectively across a wide range of tasks. Despite their central role in generation, comprehension, and recommendation systems, a comprehensive survey dedicated to discrete tokenizers remains conspicuously absent in the literature. This paper addresses this gap by providing a systematic review of the design principles, applications, and challenges of discrete tokenizers. We begin by dissecting the sub-modules of tokenizers and systematically demonstrate their internal mechanisms to provide a comprehensive understanding of their functionality and design. Building on this foundation, we synthesize state-of-the-art methods, categorizing them into multimodal generation and comprehension tasks, and semantic tokens for personalized recommendations. Furthermore, we critically analyze the limitations of existing tokenizers and outline promising directions for future research. By presenting a unified framework for understanding discrete tokenizers, this survey aims to guide researchers and practitioners in addressing open challenges and advancing the field, ultimately contributing to the development of more robust and versatile AI systems.

[IR-3] Semantica: Decentralized Search using a LLM -Guided Semantic Tree Overlay

链接: https://arxiv.org/abs/2502.10151
作者: Petru Neague,Quinten Stokkink,Naman Goel,Johan Pouwelse
类目: Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Centralized search engines are key for the Internet, but lead to undesirable concentration of power. Decentralized alternatives fail to offer equal document retrieval accuracy and speed. Nevertheless, Semantic Overlay Networks can come close to the performance of centralized solutions when the semantics of documents are properly captured. This work uses embeddings from Large Language Models to capture semantics and fulfill the promise of Semantic Overlay Networks. Our proposed algorithm, called Semantica, constructs a prefix tree (trie) utilizing document embeddings calculated by a language model. Users connect to each other based on the embeddings of their documents, ensuring that semantically similar users are directly linked. Thereby, this construction makes it more likely for user searches to be answered by the users that they are directly connected to, or by the users they are close to in the network connection graph. The implementation of our algorithm also accommodates the semantic diversity of individual users by spawning “clone” user identifiers in the tree. Our experiments use emulation with a real-world workload to show Semantica’s ability to identify and connect to similar users quickly. Semantica finds up to ten times more semantically similar users than current state-of-the-art approaches. At the same time, Semantica can retrieve more than two times the number of relevant documents given the same network load. We also make our code publicly available to facilitate further research in the area.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-19

目录

概览 (2025-02-19)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载