Arxiv今日论文 | 2025-02-20

本篇博文主要内容为 2025-02-20 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（LLMs）在处理长上下文任务时因输入中的无关信息导致注意力分散的问题，从而严重影响其长上下文能力。论文的关键解决方案是提出了一种名为多文档注意力聚焦（Multi-Document Attention Focusing, MuDAF）的新方法，通过对比学习直接优化注意力分布，以增强注意力头对相关信息的关注并减少注意力分散。

链接: https://arxiv.org/abs/2502.13963
作者: Weihao Liu,Ning Wu,Shiping Yang,Wenbiao Ding,Shining Liang,Ming Gong,Dongmei Zhang
机构: Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Large Language Models (LLMs) frequently show distracted attention due to irrelevant information in the input, which severely impairs their long-context capabilities. Inspired by recent studies on the effectiveness of retrieval heads in long-context factutality, we aim at addressing this distraction issue through improving such retrieval heads directly. We propose Multi-Document Attention Focusing (MuDAF), a novel method that explicitly optimizes the attention distribution at the head level through contrastive learning. According to the experimental results, MuDAF can significantly improve the long-context question answering performance of LLMs, especially in multi-document question answering. Extensive evaluations on retrieval scores and attention visualizations show that MuDAF possesses great potential in making attention heads more focused on relevant information and reducing attention distractions.
zh

[NLP-1] Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

【速读】：该论文旨在解决现有测试时间扩展（Test-time Scaling）评估方法中存在的两个主要问题：一是假设模型在任何情况下都应提供答案，忽视了模型是否对自己的回答具有信心；二是忽略了是否总是提供响应的适当性。为了解决这些问题，论文的关键解决方案是通过在推理过程中提取置信分数（confidence scores）来对模型的回答进行阈值处理（thresholding），从而判断模型是否应该给出答案。研究发现，增加推理过程中的计算预算不仅提高了模型正确回答问题的比例，还增强了其对正确回答的信心。此外，论文进一步扩展了当前零风险响应的评估范式，考虑了具有非零响应风险的场景，并提出了在这种场景下报告评估结果的方法。

链接: https://arxiv.org/abs/2502.13962
作者: William Jurayj,Jeffrey Cheng,Benjamin Van Durme
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
zh

[NLP-2] LIDDIA: Language-based Intelligent Drug Discovery Agent

【速读】：该论文旨在解决药物发现过程中高度依赖人工化学家的问题，提出了一种低成本且高度适应的智能自主药物发现代理——LIDDiA。其关键是利用大型语言模型的推理能力，使LIDDiA能够在计算机模拟中智能化地导航药物发现过程，从而显著提高药物发现的效率和成功率。

链接: https://arxiv.org/abs/2502.13959
作者: Reza Averly,Frazier N. Baker,Xia Ning
机构: The Ohio State University(俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDiA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDiA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDiA, demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it can identify promising novel drug candidates on EGFR, a critical target for cancers.
zh

[NLP-3] RAG -Gym: Optimizing Reasoning and Search Agents with Process Supervision

【速读】：该论文旨在解决传统检索增强生成（Retrieval-augmented generation, RAG）架构在处理复杂问题时因依赖静态检索而导致的效果局限性。论文的关键解决方案是引入了RAG-Gym框架，通过在每个搜索步骤进行细粒度的过程监督来增强信息寻求代理的能力，并提出了ReSearch新架构，该架构在RAG-Gym框架内协同实现了答案推理和搜索查询生成。实验结果表明，RAG-Gym在多种代理架构下性能提升可达25.6%，且ReSearch的表现优于现有基线方法。

链接: https://arxiv.org/abs/2502.13957
作者: Guangzhi Xiong,Qiao Jin,Xiao Wang,Yin Fang,Haolin Liu,Yifan Yang,Fangyuan Chen,Zhixing Song,Dengyu Wang,Minjia Zhang,Zhiyong Lu,Aidong Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has shown great potential for knowledge-intensive tasks, but its traditional architectures rely on static retrieval, limiting their effectiveness for complex questions that require sequential information-seeking. While agentic reasoning and search offer a more adaptive approach, most existing methods depend heavily on prompt engineering. In this work, we introduce RAG-Gym, a unified optimization framework that enhances information-seeking agents through fine-grained process supervision at each search step. We also propose ReSearch, a novel agent architecture that synergizes answer reasoning and search query generation within the RAG-Gym framework. Experiments on four challenging datasets show that RAG-Gym improves performance by up to 25.6% across various agent architectures, with ReSearch consistently outperforming existing baselines. Further analysis highlights the effectiveness of advanced LLMs as process reward judges and the transferability of trained reward models as verifiers for different LLMs. Additionally, we examine the scaling properties of training and inference in agentic RAG. The project homepage is available at this https URL.
zh

[NLP-4] Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition

【速读】：该论文旨在解决多模态多标签情感识别（MMER）中的\textit{aleatoric uncertainty}问题，即多模态数据固有的噪声，这会通过引入模糊性影响特征表示，从而阻碍模态融合的有效性。为了解决这一问题，论文提出了一种名为潜在情感分布分解与不确定性感知（LDDU）的框架。关键解决方案在于引入对比解缠分布机制来建模情绪空间内的多模态数据，同时设计了一种考虑不确定性的分散分布和整合分布信息的融合方法。实验结果表明，LDDU在CMU-MOSEI和M $^3$ ED数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2502.13954
作者: Jingwang Huang,Jiang Zhong,Qin Lei,Jinpeng Gao,Yuming Yang,Sirui Wang,Peiguang Li,Kaiwen Wei
机构: College of Computer Science, Chongqing University, China (重庆大学计算机学院); Department of Automation, Tsinghua University China (清华大学自动化系); Meituan Inc., Beijing, China (美团公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal multi-label emotion recognition (MMER) aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies primarily focus on improving fusion strategies and modeling modality-to-label dependencies. However, they often overlook the impact of \textbfaleatoric uncertainty, which is the inherent noise in the multimodal data and hinders the effectiveness of modality fusion by introducing ambiguity into feature representations. To address this issue and effectively model aleatoric uncertainty, this paper proposes Latent emotional Distribution Decomposition with Uncertainty perception (LDDU) framework from a novel perspective of latent emotional space probabilistic modeling. Specifically, we introduce a contrastive disentangled distribution mechanism within the emotion space to model the multimodal data, allowing for the extraction of semantic features and uncertainty. Furthermore, we design an uncertainty-aware fusion multimodal method that accounts for the dispersed distribution of uncertainty and integrates distribution information. Experimental results show that LDDU achieves state-of-the-art performance on the CMU-MOSEI and M ^3 ED datasets, highlighting the importance of uncertainty modeling in MMER. Code is available at this https URL\this http URL.
zh

[NLP-5] Why Safeguarded Ships Run Aground? Aligned Large Language Models Safety Mechanisms Tend to Be Anchored in The Template Region

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全对齐方面存在的漏洞问题。论文指出，这些模型的安全行为过度依赖于输入指令与初始输出之间的固定模板区域，从而导致其容易受到简单攻击的影响。论文的关键解决方案在于将安全机制从模板区域分离，以减轻模型遭受越狱攻击（jailbreak attacks）时的脆弱性。

链接: https://arxiv.org/abs/2502.13946
作者: Chak Tou Leong,Qingyu Yin,Jian Wang,Wenjie Li
机构: Department of Computing, The Hong Kong Polytechnic University (香港理工大学计算机系); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs’ safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models’ safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models’ susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.
zh

[NLP-6] AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

【速读】：该论文旨在解决现有方法在训练过程奖励模型（Process Reward Models, PRMs）时存在的问题，即通过基于规则的技术将响应分解成多个推理步骤（如使用预定义的占位符标记或设定固定长度的推理步骤），这些方法忽视了文本中的特定词汇通常并不表示真正的决策点这一事实。论文的关键解决方案是提出了一种名为AdaptiveStep的方法，该方法依据模型预测下一个词的信心来划分推理步骤。这种方法在每个步骤中提供了更多的决策信息，从而增强了下游任务（如奖励模型学习）的效果，并且无需人工标注。实验结果表明，采用AdaptiveStep训练的PRMs在数学推理和代码生成任务中表现出色，达到了最先进的Best-of-N性能，同时相较于现有的开源PRMs减少了超过30%的构建成本。

链接: https://arxiv.org/abs/2502.13943
作者: Yuliang Liu,Junjie Lu,Zhaoling Chen,Chaofeng Qu,Jason Klein Liu,Chonghan Liu,Zefan Cai,Yunhui Xia,Li Zhao,Jiang Bian,Chuheng Zhang,Wei Shen,Zhouhan Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages

点击查看摘要

Abstract:Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step’s length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model’s confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM’s performance, transferability, and generalization capabilities.
zh

[NLP-7] Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

【速读】：该论文旨在解决大型视觉-语言模型（Vision-Language Models, VLMs）在训练过程中忽略图像内容而过度依赖语言先验知识的问题，导致在视觉基础任务中的错误和幻觉现象。论文的关键解决方案是提出S-VCO（对称视觉对比优化），这是一种新的微调目标，通过引导模型捕捉重要的视觉细节并与相应的文本标记对齐，从而增强视觉反馈。此外，引入MVC数据集进一步促进这种细致的对齐，该数据集通过自动过滤和增强视觉反事实数据构建而成，用于挑战模型处理包含最小视觉对比的困难对比案例。

链接: https://arxiv.org/abs/2502.13928
作者: Shengguang Wu,Fan-Yun Sun,Kaiyue Wen,Nick Haber
机构: Stanford University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Website: this https URL

点击查看摘要

Abstract:Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM’s visually-dependent task performance while retaining or even improving the model’s general abilities. We opensource our code at this https URL
zh

[NLP-8] Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?

【速读】：该论文旨在解决现有基准测试主要集中在单图像理解上的局限性，忽视了对图像序列的理解。为了解决这一问题，论文引入了StripCipher基准，它包含一个人类标注的数据集和三个具有挑战性的子任务：视觉叙事理解、上下文帧预测以及时间叙事重排序。关键解决方案在于通过StripCipher评估大型多模态模型（LMMs）在处理图像序列时的综合能力，特别是模型在需要重新排序打乱的顺序图像的任务中的表现。

链接: https://arxiv.org/abs/2502.13925
作者: Xiaochen Wang,Heming Xia,Jialin Song,Longyu Guan,Yixin Yang,Qingxiu Dong,Weiyao Luo,Yifan Pu,Yiru Wang,Xiangdi Meng,Wenjie Li,Zhifang Sui
机构: State Key Laboratory of Multimedia Information Processing, Peking University (多媒体信息处理国家重点实验室，北京大学); Department of Computing, The Hong Kong Polytechnic University (香港理工大学计算学系); Tsinghua University (清华大学); ModelTC (ModelTC)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of 16 state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.
zh

[NLP-9] Qwen 2.5-VL Technical Report

【速读】：该论文旨在提升视觉与语言模型在理解与交互方面的能力，特别是在复杂输入处理、精确对象定位及文档解析等方面。论文的关键解决方案在于引入了动态分辨率处理和绝对时间编码技术，使得模型能够处理不同尺寸的图像和长时间视频，并实现秒级事件定位。此外，通过从零训练原生动态分辨率的视觉Transformer（Vision Transformer, ViT）并结合窗口注意力机制，降低了计算开销同时保持了原始分辨率。这些改进使Qwen2.5-VL不仅在静态图像和文档理解上表现出色，还能作为交互式的视觉代理，在实际场景中进行推理、工具使用和任务执行。

链接: https://arxiv.org/abs/2502.13923
作者: Shuai Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Sibo Song,Kai Dang,Peng Wang,Shijie Wang,Jun Tang,Humen Zhong,Yuanzhi Zhu,Mingkun Yang,Zhaohai Li,Jianqiang Wan,Pengfei Wang,Wei Ding,Zheren Fu,Yiheng Xu,Jiabo Ye,Xi Zhang,Tianbao Xie,Zesen Cheng,Hang Zhang,Zhibo Yang,Haiyang Xu,Junyang Lin, (additional authors not shown)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
zh

[NLP-10] LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization ICLR2025

【速读】：该论文旨在解决长上下文情境下大型语言模型（Large Language Models, LLMs）表现不佳的问题。论文指出，尽管大规模预训练和对齐使得LLMs在短上下文任务中表现出色，但它们在处理长上下文时可能由于对齐不足而性能下降。为了解决这一挑战，论文提出了一种名为LongPO的方法，其关键是通过内部转移短上下文能力，使LLMs能够自我进化以适应长上下文任务。LongPO通过让LLMs学习从自动生成的短到长偏好数据中获取经验，这些数据由针对相同指令的长上下文输入及其压缩后的短上下文响应配对组成，从而实现上述目标。此外，LongPO还引入了短到长的KL散度约束，以减轻在长上下文对齐过程中短上下文性能下降的问题。

链接: https://arxiv.org/abs/2502.13922
作者: Guanzheng Chen,Xin Li,Michael Qizhe Shieh,Lidong Bing
机构: National University of Singapore; DAMO Academy, Alibaba Group; Hupan Lab, 310023, Hangzhou, China; Shanda AI Research Institute
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.
zh

[NLP-11] Exploring Personalized Health Support through Data-Driven Theory-Guided LLM s: A Case Study in Sleep Health

【速读】：该论文旨在解决个体难以将睡眠追踪数据转化为实际改善睡眠健康行动的问题。当前方法虽提供数据驱动的建议，但可能无法适应现实生活中的限制和个人情境。为解决此问题，论文提出HealthGuru，这是一种基于大型语言模型的聊天机器人，通过数据驱动、理论指导及自适应推荐，辅以对话式行为改变支持来提升睡眠健康。HealthGuru的关键在于其多代理框架整合了可穿戴设备数据、情境信息以及上下文多臂强盗模型，从而提出个性化睡眠增强活动建议，实现自然对话的同时融入数据驱动的见解和理论行为改变技术。

链接: https://arxiv.org/abs/2502.13920
作者: Xingbo Wang,Janessa Griffith,Daniel A. Adler,Joey Castillo,Tanzeem Choudhury,Fei Wang
机构: Weill Cornell Medicine (威尔康奈尔医学院); Cornell University (康奈尔大学); Cornell Tech (康奈尔科技学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Accepted to CHI Conference on Human Factors in Computing Systems (CHI 2025)

点击查看摘要

Abstract:Despite the prevalence of sleep-tracking devices, many individuals struggle to translate data into actionable improvements in sleep health. Current methods often provide data-driven suggestions but may not be feasible and adaptive to real-life constraints and individual contexts. We present HealthGuru, a novel large language model-powered chatbot to enhance sleep health through data-driven, theory-guided, and adaptive recommendations with conversational behavior change support. HealthGuru’s multi-agent framework integrates wearable device data, contextual information, and a contextual multi-armed bandit model to suggest tailored sleep-enhancing activities. The system facilitates natural conversations while incorporating data-driven insights and theoretical behavior change techniques. Our eight-week in-the-wild deployment study with 16 participants compared HealthGuru to a baseline chatbot. Results show improved metrics like sleep duration and activity scores, higher quality responses, and increased user motivation for behavior change with HealthGuru. We also identify challenges and design considerations for personalization and user engagement in health chatbots.
zh

[NLP-12] ESS 2: A Large-Scale Generalist Diffusion Language Model

【速读】：该论文旨在解决如何训练一个更优秀的指令跟随型扩散语言模型（instruction-following diffusion language model），以超越现有的指令调优扩散模型，并与强大的自回归（AR）模型相匹敌甚至超越。关键在于通过继续预训练（continued pretraining）适应一个强大的AR模型，并采用交叉熵作为扩散损失（diffusion loss），随后进行进一步的指令调优（instruction tuning）。此外，提出了一种新的推理时引导方法——奖励引导（reward guidance），以及强调在推理过程中使用更多的计算资源可以进一步提升模型性能。

链接: https://arxiv.org/abs/2502.13917
作者: Jaesung Tae,Hamish Ivison,Sachin Kumar,Arman Cohan
机构: Yale University; University of Washington; Allen Institute for AI; The Ohio State University
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at this https URL.
zh

[NLP-13] How Do LLM s Perform Two-Hop Reasoning in Context?

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在处理多步推理任务时，如何从随机猜测过渡到准确推理。关键在于通过训练一个三层变换器（transformer）模型于合成的两步推理任务，揭示了模型学习过程中存在两个阶段：初始的缓慢学习阶段，模型表现类似LLMs进行随机猜测；随后发生突变式的转变，模型突然达到100%的准确率。论文通过逆向工程解释了模型如何最初学会在干扰信息中随机猜测，以及最终如何学会忽略这些干扰。提出的三参数模型进一步支持了这一过程中的因果关系，并且实验表明发现的机制在不同规模的模型中具有普适性。

链接: https://arxiv.org/abs/2502.13913
作者: Tianyu Guo,Hanlin Zhu,Ruiqi Zhang,Jiantao Jiao,Song Mei,Michael I. Jordan,Stuart Russell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:“Socrates is human. All humans are mortal. Therefore, Socrates is mortal.” This classical example demonstrates two-hop reasoning, where a conclusion logically follows from two connected premises. While transformer-based Large Language Models (LLMs) can make two-hop reasoning, they tend to collapse to random guessing when faced with distracting premises. To understand the underlying mechanism, we train a three-layer transformer on synthetic two-hop reasoning tasks. The training dynamics show two stages: a slow learning phase, where the 3-layer transformer performs random guessing like LLMs, followed by an abrupt phase transitions, where the 3-layer transformer suddenly reaches 100% accuracy. Through reverse engineering, we explain the inner mechanisms for how models learn to randomly guess between distractions initially, and how they learn to ignore distractions eventually. We further propose a three-parameter model that supports the causal claims for the mechanisms to the training dynamics of the transformer. Finally, experiments on LLMs suggest that the discovered mechanisms generalize across scales. Our methodologies provide new perspectives for scientific understandings of LLMs and our findings provide new insights into how reasoning emerges during training.
zh

[NLP-14] GroundCap: A Visually Grounded Image Captioning Dataset

【速读】：该论文旨在解决当前图像描述系统无法将描述性文本与特定视觉元素有效关联的问题，导致输出难以验证。论文的关键解决方案在于提出了一种基于ID的定位系统，该系统能够实现对象引用的一致跟踪和动作-对象链接，并引入了一种新的度量标准gMETEOR来结合评价描述质量和定位准确性。该方法通过持久的对象ID进行引用跟踪，显式的动作-对象链接以及通过K均值聚类进行背景分割来实现这些功能。

链接: https://arxiv.org/abs/2502.13898
作者: Daniel A. P. Oliveira,Lourenço Teodoro,David Martins de Matos
机构: INESC-ID(INESC系统动态研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 37 pages

点击查看摘要

Abstract:Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our approach’s effectiveness in producing verifiable descriptions with coherent object references.
zh

[NLP-15] DataSciBench: An LLM Agent Benchmark for Data Science

【速读】：该论文旨在解决现有数据科学领域大型语言模型（Large Language Model, LLM）评估基准的局限性。这些局限性包括主要集中在单一任务、容易获得的地面真实数据（ground truth）以及简单的评估指标。论文提出的解决方案关键在于开发DataSciBench基准，该基准基于更全面且精心策划的自然和具有挑战性的提示，适用于不确定的地面真实数据和评估指标。通过采用半自动化管道生成地面真实数据和验证评估指标，并利用LLM自一致性及人工验证策略，确保生成准确的地面真实数据。此外，论文还提出了一种创新的任务-函数-代码（Task - Function - Code, TFC）框架，以基于精确定义的指标和编程规则来评估每个代码执行结果。这种综合方法旨在提供更为全面和严格的LLM评估，揭示其优势与不足。

链接: https://arxiv.org/abs/2502.13897
作者: Dan Zhang,Sining Zhoubian,Min Cai,Fengzu Li,Lekang Yang,Wei Wang,Tianjiao Dong,Ziniu Hu,Jie Tang,Yisong Yue
机构: Tsinghua University (清华大学); Zhipu AI; University of California, Berkeley (加州大学伯克利分校); California Institute of Technology (加州理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 40 pages, 7 figures, 6 tables

点击查看摘要

Abstract:This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at this https URL.
zh

[NLP-16] PSCon: Toward Conversational Product Search

【速读】：该论文旨在解决对话式产品搜索（Conversational Product Search, CPS）领域缺乏真实世界数据集以反映人类语言的问题，并且现有数据集不足以支持跨市场和多语言的应用。论文的关键解决方案是引入了一种新的数据收集协议，并由此创建了PSCon数据集。该数据集通过指导的人与人交互方式构建，支持两种语言和两个市场，同时涵盖了CPS的六个子任务：用户意图检测、关键词提取、系统动作预测、问题选择、项目排名和回复生成。

链接: https://arxiv.org/abs/2502.13881
作者: Jie Zou,Mohammad Aliannejadi,Evangelos Kanoulas,Shuxi Han,Heli Ma,Zheng Wang,Yang Yang,Heng Tao Shen
机构: University of Electronic Science and Technology of China; University of Amsterdam; Tongji University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 11 pages

点击查看摘要

Abstract:Conversational Product Search (CPS) is confined to simulated conversations due to the lack of real-world CPS datasets that reflect human-like language. Additionally, current conversational datasets are limited to support cross-market and multi-lingual usage. In this paper, we introduce a new CPS data collection protocol and present PSCon, a novel CPS dataset designed to assist product search via human-like conversations. The dataset is constructed using a coached human-to-human data collection protocol and supports two languages and dual markets. Also, the dataset enables thorough exploration of six subtasks of CPS: user intent detection, keyword extraction, system action prediction, question selection, item ranking, and response generation. Furthermore, we also offer an analysis of the dataset and propose a benchmark model on the proposed CPS dataset.
zh

[NLP-17] SPEX: Scaling Feature Interaction Explanations for LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在处理长输入序列时，现有后验解释方法（如SHAP）仅能有效解释较小输入长度（约20个特征）的问题。论文的关键解决方案是提出了Spectral Explainer (SPEX)，这是一种适用于大规模输入长度（约1000个特征）的模型不可知交互归因算法。SPEX通过利用实际数据中存在的自然稀疏交互，并应用稀疏傅里叶变换结合信道解码算法，以高效识别重要交互，从而实现对LLMs输出的忠实重建。此外，SPEX能够识别出显著影响模型输出的关键特征与交互作用，并且在HotpotQA数据集上的实验结果与人类标注一致。

链接: https://arxiv.org/abs/2502.13870
作者: Justin Singh Kang,Landon Butler,Abhineet Agarwal,Yigit Efe Erginbas,Ramtin Pedarsani,Kannan Ramchandran,Bin Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ( \approx 20 ). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ( \approx 1000) . SPEX exploits underlying natural sparsity among interactions – common in real-world data – and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.
zh

[NLP-18] Fine-grained Fallacy Detection with Human Label Variation NAACL2025

【速读】：该论文旨在解决多模态反驳检测中的多个合理答案及自然分歧问题。解决方案的关键在于引入Faina数据集，该数据集包含超过11K的标注片段，涵盖了关于移民、气候变化和公共卫生的意大利社交媒体帖子中的20种反驳类型，并通过多次讨论轮次的广泛标注研究最小化了标注错误，同时保留了人类标签变化的信号。此外，论文设计了一个超越单一标准答案评估框架，能够同时处理多个等可靠性测试集以及任务特有的部分片段匹配、重叠和标签错误严重性差异等问题。实验结果表明，基于多任务和多标签变换器的方法在所有设置下都是强大的基线。

链接: https://arxiv.org/abs/2502.13853
作者: Alan Ramponi,Agnese Daffara,Sara Tonelli
机构: Fondazione Bruno Kessler(布鲁诺凯勒基金会); University of Pavia(帕维亚大学); Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所)
类目: Computation and Language (cs.CL)
备注: NAACL 2025

点击查看摘要

Abstract:We introduce Faina, the first dataset for fallacy detection that embraces multiple plausible answers and natural disagreement. Faina includes over 11K span-level annotations with overlaps across 20 fallacy types on social media posts in Italian about migration, climate change, and public health given by two expert annotators. Through an extensive annotation study that allowed discussion over multiple rounds, we minimize annotation errors whilst keeping signals of human label variation. Moreover, we devise a framework that goes beyond “single ground truth” evaluation and simultaneously accounts for multiple (equally reliable) test sets and the peculiarities of the task, i.e., partial span matches, overlaps, and the varying severity of labeling errors. Our experiments across four fallacy detection setups show that multi-task and multi-label transformer-based approaches are strong baselines across all settings. We release our data, code, and annotation guidelines to foster research on fallacy detection and human label variation more broadly.
zh

[NLP-19] DH-RAG : A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在多轮对话中未能充分利用动态历史信息的问题。解决方案的关键在于引入了DH-RAG（Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue），通过两个主要组件：基于历史学习的查询重构模块（History-Learning based Query Reconstruction Module）和动态历史信息更新模块（Dynamic History Information Updating Module）。DH-RAG的核心是一个动态历史信息数据库，并通过历史查询聚类（Historical Query Clustering）、层次匹配（Hierarchical Matching）和思维链跟踪（Chain of Thought Tracking）三种策略进一步优化查询重构模块，从而显著提升了对话的相关性、连贯性和质量。

链接: https://arxiv.org/abs/2502.13847
作者: Feiyuan Zhang,Dezhi Zhu,James Ming,Yilun Jin,Di Chai,Liu Yang,Han Tian,Zhaoxin Fan,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems have shown substantial benefits in applications such as question answering and multi-turn dialogue \citeplewis2020retrieval. However, traditional RAG methods, while leveraging static knowledge bases, often overlook the potential of dynamic historical information in ongoing conversations. To bridge this gap, we introduce DH-RAG, a Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue. DH-RAG is inspired by human cognitive processes that utilize both long-term memory and immediate historical context in conversational responses \citepstafford1987conversational. DH-RAG is structured around two principal components: a History-Learning based Query Reconstruction Module, designed to generate effective queries by synthesizing current and prior interactions, and a Dynamic History Information Updating Module, which continually refreshes historical context throughout the dialogue. The center of DH-RAG is a Dynamic Historical Information database, which is further refined by three strategies within the Query Reconstruction Module: Historical Query Clustering, Hierarchical Matching, and Chain of Thought Tracking. Experimental evaluations show that DH-RAG significantly surpasses conventional models on several benchmarks, enhancing response relevance, coherence, and dialogue quality.
zh

[NLP-20] Inner Thinking Transformer: Leverag ing Dynamic Depth Scaling to Foster Adaptive Internal Thinking

【速读】：该论文旨在解决大型语言模型（LLMs）在参数约束下处理关键复杂推理令牌时存在的性能瓶颈问题。论文的关键解决方案是提出了一种名为Inner Thinking Transformer (ITT)的新架构，通过自适应令牌路由（Adaptive Token Routing）、残差思维连接（Residual Thinking Connections）和思维步骤编码（Thinking Step Encoding）来动态分配计算资源，迭代优化表示，并区分推理阶段。ITT使模型能够在不增加参数的情况下更深入地处理关键令牌，从而提升了性能与效率。

链接: https://arxiv.org/abs/2502.13842
作者: Yilong Chen,Junyuan Shang,Zhenyu Zhang,Yanxi Xie,Jiawei Sheng,Tingwen Liu,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院网络安全学院); Baidu Inc.(百度公司); School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院)
类目: Computation and Language (cs.CL)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Large language models (LLMs) face inherent performance bottlenecks under parameter constraints, particularly in processing critical tokens that demand complex reasoning. Empirical analysis reveals challenging tokens induce abrupt gradient spikes across layers, exposing architectural stress points in standard Transformers. Building on this insight, we propose Inner Thinking Transformer (ITT), which reimagines layer computations as implicit thinking steps. ITT dynamically allocates computation through Adaptive Token Routing, iteratively refines representations via Residual Thinking Connections, and distinguishes reasoning phases using Thinking Step Encoding. ITT enables deeper processing of critical tokens without parameter expansion. Evaluations across 162M-466M parameter models show ITT achieves 96.5% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2%, and outperforms Transformer/Loop variants in 11 benchmarks. By enabling elastic computation allocation during inference, ITT balances performance and efficiency through architecture-aware optimization of implicit thinking pathways.
zh

[NLP-21] Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning

【速读】：该论文旨在解决代码验证在训练大规模推理模型中的有效性评估问题。论文的关键解决方案是提出新的基准测试（benchmarks），包括HE-R、HE-R+、MBPP-R和MBPP-R+，这些基准将现有的编码基准转化为评分和排名数据集，以评估合成验证方法的效果。通过这些基准，研究分析了标准、基于推理和基于奖励的大规模语言模型（LLMs）中的合成验证方法。研究结果表明，最近的推理模型显著提升了测试用例生成能力，并且增加测试用例数量能够提升验证准确性。

链接: https://arxiv.org/abs/2502.13820
作者: Aleksander Ficek,Somshubra Majumdar,Vahid Noroozi,Boris Ginsburg
机构: NVIDIA
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Code verification has recently found great success as a critical component in training large scale reasoning models for coding. Synthetic techniques such as self-generated test cases and reward models provide a way to enhance code capabilities beyond predefined tests. Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness. We introduce HE-R, HE-R+, MBPP-R, and MBPP-R+, which transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. Using these benchmarks, we analyze synthetic verification methods in standard, reasoning-based, and reward-based LLMs. Our results show that recent reasoning models significantly improve test case generation and that scaling test cases enhances verification accuracy.
zh

[NLP-22] On the Duality between Gradient Transformations and Adapters

【速读】：该论文旨在解决神经网络优化过程中的内存消耗问题，通过引入线性梯度变换，将梯度映射到一个比完整参数空间维度更低的空间，从而减少梯度累积和优化器状态持久化所需的内存。关键解决方案在于通过线性映射在降维空间中进行优化步骤，并利用线性自适应器（linear adapter）重新参数化模型，仅优化自适应器的参数。当变换为Kronecker分解形式时，该方法与单侧LoRA（Low-Rank Adaptation）等效，从而统一现有高效训练方法并提出改进训练效率和内存使用的新技术。

链接: https://arxiv.org/abs/2502.13811
作者: Lucas Torroba-Hennigen,Hunter Lang,Han Guo,Yoon Kim
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:We study memory-efficient optimization of neural networks with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map’s transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter’s parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.
zh

[NLP-23] LESA: Learnable LLM Layer Scaling-Up

【速读】：该论文旨在解决通过现有方法进行深度扩展（Depth Scaling-up）时存在的初始化不良和收敛缓慢的问题。论文的关键在于提出了一种名为\textbfLESA的新方法，通过层间参数的学习和神经网络预测插入相邻层之间的参数，从而实现更好的初始化和更快的训练过程。实验结果表明，\textbfLESA在持续预训练过程中以不到一半的计算成本实现了优于现有基线的性能。

链接: https://arxiv.org/abs/2502.13794
作者: Yifei Yang,Zouying Cao,Xinbei Ma,Yao Yao,Libo Qin,Zhi Chen,Hai Zhao
机构: Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系); Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University(上海交通大学上海市智能交互与认知工程重点实验室); Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3(上海可信任数据流通与治理Web3重点实验室); School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院); ByteDance(字节跳动)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbfLESA, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.
zh

[NLP-24] From Tools to Teammates: Evaluating LLM s in Multi-Session Coding Interactions

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在长期交互中协作的有效性。论文通过引入MemoryCode这一合成多会话数据集来测试LLMs跟踪和执行简单编码指令的能力，即使在无关信息干扰下也能模拟现实场景。研究发现，尽管所有模型在处理孤立指令时表现良好，但当指令分布在多个会话中时，即使是最先进的模型如GPT-4o的表现也会显著下降。关键在于这些模型无法有效地检索和整合长链指令中的信息，从而揭示了当前LLMs的一个基本局限性，限制了它们在长时间交互中有效协作的能力。

链接: https://arxiv.org/abs/2502.13791
作者: Nathanaël Carraz Rakotonirina,Mohammed Hamdy,Jon Ander Campos,Lucas Weber,Alberto Testoni,Marzieh Fadaee,Sandro Pezzelle,Marco Del Tredici
机构: Universitat Pompeu Fabra(庞培法布拉大学); University of Amsterdam(阿姆斯特丹大学); Cohere; Cohere For AI; Cohere For AI Community
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
zh

[NLP-25] ranslation in the Hands of Many:Centering Lay Users in Machine Translation Interactions

【速读】：该论文旨在解决机器翻译（Machine Translation, MT）在非专家用户中的理解和应用问题。随着多语言大语言模型（Multilingual Large Language Models, LLMs）的应用，MT已广泛用于跨语言服务，但非专家用户的需要、体验和系统交互的理解仍有限。论文的关键在于通过探讨可用性（usability）、信任（trust）和素养（literacy）这三个因素，提出以用户为中心的方法来改进MT，以更好地满足这些用户的需求。

链接: https://arxiv.org/abs/2502.13780
作者: Beatrice Savoldi,Alan Ramponi,Matteo Negri,Luisa Bentivogli
机构: Fondazione Bruno Kessler (布鲁诺凯勒基金会)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Converging societal and technical factors have transformed language technologies into user-facing applications employed across languages. Machine Translation (MT) has become a global tool, with cross-lingual services now also supported by dialogue systems powered by multilingual Large Language Models (LLMs). This accessibility has expanded MT’s reach to a vast base of lay users, often with little to no expertise in the languages or the technology itself. Despite this, the understanding of MT consumed by this diverse group of users – their needs, experiences, and interactions with these systems – remains limited. This paper traces the shift in MT user profiles, focusing on non-expert users and how their engagement with these systems may change with LLMs. We identify three key factors – usability, trust, and literacy – that shape these interactions and must be addressed to align MT with user needs. By exploring these dimensions, we offer insights to guide future MT with a user-centered approach.
zh

[NLP-26] EHOP: A Dataset of Everyday NP-Hard Optimization Problems

【速读】：该论文旨在研究大型语言模型（LLMs）在解决自然语言表述的NP难优化问题时的表现，特别关注其在不同问题类型（教科书问题、现实生活中问题以及规则反转的问题）上的表现差异。论文的关键在于通过引入Everyday Hard Optimization Problems (EHOP) 数据集，揭示LLMs更擅长解决训练中常见的教科书问题，而非需要更多推理能力或新颖性的现实生活中问题或规则反转问题。这表明LLMs倾向于适应训练中所见的解决方案，而不是发展出能够应对新问题的推理能力。

链接: https://arxiv.org/abs/2502.13776
作者: Alex Duchnowski,Ellie Pavlick,Alexander Koller
机构: 未知
类目: Computation and Language (cs.CL); Computational Complexity (cs.CC)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:We introduce the dataset of Everyday Hard Optimization Problems (EHOP), a collection of NP-hard optimization problems expressed in natural language. EHOP includes problem formulations that could be found in computer science textbooks, versions that are dressed up as problems that could arise in real life, and variants of well-known problems with inverted rules. We find that state-of-the-art LLMs, across multiple prompting strategies, systematically solve textbook problems more accurately than their real-life and inverted counterparts. We argue that this constitutes evidence that LLMs adapt solutions seen during training, rather than leveraging reasoning abilities that would enable them to generalize to novel problems.
zh

[NLP-27] VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare

【速读】：该论文旨在解决大型语言模型（LLMs）在处理健康相关场景时，未能有效反映文化、宗教和个人价值观等多样性的问题。现有对齐方法通常仅考虑平均或单一的偏好，无法适应多元化的视角。为了解决这一问题，论文引入了VITAL数据集，这是一个包含13.1K个价值导向情景和5.4K个多项选择题的基准数据集，专注于健康领域。通过评估八种不同规模的LLMs，研究发现现有的多元化对齐技术在适应不同的医疗信念方面效果不佳，强调了在特定领域内开发定制化对齐方法的必要性。论文的关键在于提出了VITAL数据集，并通过其验证了当前对齐方法的局限性，为开发针对健康领域的特定对齐解决方案奠定了基础。

链接: https://arxiv.org/abs/2502.13775
作者: Anudeex Shetty,Amin Beheshti,Mark Dras,Usman Naseem
机构: School of Computing and Information System, the University of Melbourne (墨尔本大学计算与信息系统学院), Australia; School of Computing, FSE, Macquarie University (麦考瑞大学计算机学院), Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Alignment techniques have become central to ensuring that Large Language Models (LLMs) generate outputs consistent with human values. However, existing alignment paradigms often model an averaged or monolithic preference, failing to account for the diversity of perspectives across cultures, demographics, and communities. This limitation is particularly critical in health-related scenarios, where plurality is essential due to the influence of culture, religion, personal values, and conflicting opinions. Despite progress in pluralistic alignment, no prior work has focused on health, likely due to the unavailability of publicly available datasets. To address this gap, we introduce VITAL, a new benchmark dataset comprising 13.1K value-laden situations and 5.4K multiple-choice questions focused on health, designed to assess and benchmark pluralistic alignment methodologies. Through extensive evaluation of eight LLMs of varying sizes, we demonstrate that existing pluralistic alignment techniques fall short in effectively accommodating diverse healthcare beliefs, underscoring the need for tailored AI alignment in specific domains. This work highlights the limitations of current approaches and lays the groundwork for developing health-specific alignment solutions.
zh

[NLP-28] GIMMICK – Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在处理非西方文化背景下的使用场景时表现不足的问题。为实现这一目标，论文引入了GIMMICK，这是一个广泛的多模态基准测试，涵盖来自六大全球宏观区域的144个国家的广泛文化知识。GIMMICK包含六个任务，并基于三个新数据集评估了20个LVLMs和11个LLMs。通过系统分析，论文揭示了模型对西方文化的偏好，并强调了模型规模与性能之间的强相关性以及多模态输入和外部地理提示的有效性。关键解决方案在于设计一个全面的基准测试框架，以评估不同文化背景下的模型性能，从而识别和解决现有模型的文化偏见和技术局限性。

链接: https://arxiv.org/abs/2502.13766
作者: Florian Schneider,Carolin Holtermann,Chris Biemann,Anne Lauscher
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
zh

[NLP-29] SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在长上下文理解能力评估方面的挑战。解决方案的关键在于提出了一种名为SCALAR（基于学术引用的长上下文学术推理实时评估）的新基准。SCALAR利用学术论文及其引用网络自动生成高质量的真实标签，并具备可控难度级别及动态更新机制以防止数据污染。

链接: https://arxiv.org/abs/2502.13753
作者: Renxi Wang,Honglin Mu,Liqun Ma,Lizhi Lin,Yunlong Feng,Timothy Baldwin,Xudong Han,Haonan Li
机构: MBZUAI; LibrAI; Tsinghua University (清华大学); Alibaba Group (阿里集团); The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating large language models’ (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.
zh

[NLP-30] Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding

【速读】：该论文旨在解决大型语言模型（LLMs）在通过上下文学习（ICL）进行任务预测时，往往忽视输入-标签映射信息的问题，而是更多依赖于预训练知识。为了解决这一问题，论文提出了一种名为上下文对比解码（ICCD）的新方法，关键在于通过对比正负上下文示例之间的输出分布，来强调输入-标签映射信息。

链接: https://arxiv.org/abs/2502.13738
作者: Keqin Peng,Liang Ding,Yuanxin Ouyang,Meng Fang,Yancheng Yuan,Dacheng Tao
机构: Beihang University (北京航空航天大学); The University of Sydney (悉尼大学); University of Liverpool (利物浦大学); The Hong Kong Polytechnic University (香港理工大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at a range of tasks through in-context learning (ICL), where only a few task examples guide their predictions. However, prior research highlights that LLMs often overlook input-label mapping information in ICL, relying more on their pre-trained knowledge. To address this issue, we introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping by contrasting the output distributions between positive and negative in-context examples. Experiments on 7 natural language understanding (NLU) tasks show that our ICCD method brings consistent and significant improvement (up to +2.1 improvement on average) upon 6 different scales of LLMs without requiring additional training. Our approach is versatile, enhancing performance with various demonstration selection methods, demonstrating its broad applicability and effectiveness. The code and scripts will be publicly released.
zh

[NLP-31] Adapting Large Language Models for Time Series Modeling via a Novel Parameter-efficient Adaptation Method

【速读】：该论文旨在解决时间序列建模在预训练基础模型领域中因数据稀疏性导致的发展限制问题。论文的关键解决方案在于提出Time-LlaMA框架，该框架通过线性token化机制将时间序列输入转换为token嵌入，并与文本提示进行对齐。此外，引入动态低秩适应技术（Dynamic Low-Rank Adaptation, D-LoRA），动态选择每个Transformer层中最适合的低秩模块以适应时间序列输入，从而增强模型的预测能力。实验结果表明，该方法在一系列具有挑战性的实际时间序列任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2502.13725
作者: Juyuan Zhang,Wei Zhu,Jiechao Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time series modeling holds significant importance in many real-world applications and has been extensively studied. While pre-trained foundation models have made impressive strides in the fields of natural language processing (NLP) and computer vision (CV), their development in time series domains has been constrained by data sparsity. A series of recent studies have demonstrated that large language models (LLMs) possess robust pattern recognition and reasoning abilities over complex sequences of tokens. However, the current literature have yet striked a high-quality balance between (a) effectively aligning the time series and natural language modalities, and (b) keeping the inference efficiency. To address the above issues, we now propose the Time-LlaMA framework. Time-LlaMA first converts the time series input into token embeddings through a linear tokenization mechanism. Second, the time series token embeddings are aligned with the text prompts. Third, to further adapt the LLM backbone for time series modeling, we have developed a dynamic low-rank adaptation technique (D-LoRA). D-LoRA dynamically chooses the most suitable LoRA modules at each layer of the Transformer backbone for each time series input, enhancing the model’s predictive capabilities. Our experimental results on an extensive collection of challenging real-world time series tasks confirm that our proposed method achieves the state-of-the-art (SOTA) performance.
zh

[NLP-32] Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLM s with Refined Values

【速读】：该论文旨在解决复杂推理任务中大型语言模型优化的问题，传统方法依赖于偏好标签，而论文提出的Direct Value Optimization (DVO)框架通过在每个推理步骤利用价值信号，采用均方误差损失进行模型优化。解决方案的关键在于其细粒度的监督机制，无需劳动密集型的人工标注。目标值通过蒙特卡洛树搜索或结果值模型来估计，从而实现比现有离线偏好优化技术更优的性能，即使在较少的训练步骤下也能如此。这些发现强调了价值信号在提升推理能力方面的重要性，并将DVO突出为在缺乏显式人工偏好信息场景下的优越方法。

链接: https://arxiv.org/abs/2502.13723
作者: Hongbo Zhang,Han Cui,Guangsheng Bao,Linyi Yang,Jun Wang,Yue Zhang
机构: Zhejiang University; School of Engineering, Westlake University; Institute of Advanced Technology, Westlake Institute for Advanced Study; University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.
zh

[NLP-33] Learning Novel Transformer Architecture for Time-series Forecasting

【速读】：该论文旨在解决现有基于Transformer的模型在时间序列预测（Time-Series Prediction, TSP）任务中的局限性，并探索替代架构。论文的关键解决方案是提出AutoFormer-TS框架，该框架利用了一个全面的搜索空间来优化针对TSP任务的Transformer架构。解决方案的关键在于引入了一种改进的可微分神经架构搜索方法AB-DARTS，该方法能够更有效地识别架构内的最优操作。这一方法使得AutoFormer-TS能够在多种注意力机制、激活函数和编码操作方面进行系统性的探索与优化，从而超越传统Transformer的设计，实现更优的预测精度和合理的训练效率。

链接: https://arxiv.org/abs/2502.13721
作者: Juyuan Zhang,Wei Zhu,Jiechao Gao
机构: Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the success of Transformer-based models in the time-series prediction (TSP) tasks, the existing Transformer architecture still face limitations and the literature lacks comprehensive explorations into alternative architectures. To address these challenges, we propose AutoFormer-TS, a novel framework that leverages a comprehensive search space for Transformer architectures tailored to TSP tasks. Our framework introduces a differentiable neural architecture search (DNAS) method, AB-DARTS, which improves upon existing DNAS approaches by enhancing the identification of optimal operations within the architecture. AutoFormer-TS systematically explores alternative attention mechanisms, activation functions, and encoding operations, moving beyond the traditional Transformer design. Extensive experiments demonstrate that AutoFormer-TS consistently outperforms state-of-the-art baselines across various TSP benchmarks, achieving superior forecasting accuracy while maintaining reasonable training efficiency.
zh

[NLP-34] Multi-Scale and Multi-Objective Optimization for Cross-Lingual Aspect-Based Sentiment Analysis

【速读】：该论文旨在解决跨语言方面级情感分析（Cross-lingual Aspect-based Sentiment Analysis, ABSA）中的特征对齐不足和细粒度方面级别对齐不精准的问题。关键解决方案在于提出了一种名为多尺度多目标优化（Multi-Scale and Multi-Objective optimization, MSMO）的新框架。MSMO通过在多尺度对齐阶段实现跨语言句子级别和方面级别的特征对齐，并引入代码转换双语句来增强模型鲁棒性；同时，在多目标优化阶段设计了监督训练和一致性训练两个目标，以提升跨语言语义对齐效果。此外，还融入了目标语言的知识蒸馏以进一步提升模型性能。

链接: https://arxiv.org/abs/2502.13718
作者: Chengyan Wu,Bolei Ma,Ningyuan Deng,Yanqing He,Yun Xue
机构: Guangdong Provincial Key Laboratory of Quantum Engineering and Quantum Materials (广东省量子工程与量子材料重点实验室), School of Electronic Science and Engineering (School of Microelectronics) (电子科学与技术学院 (微电子学院)), South China Normal University (华南师范大学); LMU Munich & Munich Center for Machine Learning (慕尼黑大学 & 慕尼黑机器学习中心); Institute of Scientific and Technical Information of China (中国科学技术信息研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect-based sentiment analysis (ABSA) is a sequence labeling task that has garnered growing research interest in multilingual contexts. However, recent studies lack more robust feature alignment and finer aspect-level alignment. In this paper, we propose a novel framework, Multi-Scale and Multi-Objective optimization (MSMO) for cross-lingual ABSA. During multi-scale alignment, we achieve cross-lingual sentence-level and aspect-level alignment, aligning features of aspect terms in different contextual environments. Specifically, we introduce code-switched bilingual sentences into the language discriminator and consistency training modules to enhance the model’s robustness. During multi-objective optimization, we design two optimization objectives: supervised training and consistency training, aiming to enhance cross-lingual semantic alignment. To further improve model performance, we incorporate distilled knowledge of the target language into the model. Results show that MSMO significantly enhances cross-lingual ABSA by achieving state-of-the-art performance across multiple languages and models.
zh

[NLP-35] Is This Collection Worth My LLM s Time? Automatically Measuring Information Potential in Text Corpora

【速读】：该论文旨在解决如何评估文本集合对于大型语言模型 (LLMs) 的潜在信息增益，从而确定哪些文本值得进行数字化、预处理及集成。关键解决方案在于提出了一种自动化流程，通过生成多选题 (MCQs) 并对比模型在有无访问源材料条件下的表现差距，以此作为衡量文本集合信息潜力的代理指标。

链接: https://arxiv.org/abs/2502.13691
作者: Tristan Karch,Luca Engel,Philippe Schwaller,Frédéric Kaplan
机构: EPFL (洛桑联邦理工学院) / DHLab; EPFL (洛桑联邦理工学院) / ILIAC
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM’s performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection’s information potential. We validate our approach using three strategically selected datasets: EPFL PhD manuscripts (likely containing novel specialized knowledge), Wikipedia articles (presumably part of training data), and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.
zh

[NLP-36] MoM: Linear Sequence Modeling with Mixture-of-Memories

【速读】：该论文旨在解决现有线性序列建模方法（如线性注意力、状态空间模型和线性RNN）在处理需要大量回忆的下游任务时因压缩整个输入序列为单一固定大小内存状态而导致性能不佳的问题。为解决这一问题，论文提出了一种名为记忆混合（Mixture-of-Memories, MoM）的新架构。其关键是利用多个独立的记忆状态，并通过路由网络将输入标记定向到特定的记忆状态中，从而显著提升整体记忆容量并减少记忆干扰。这种方法在保持训练线性复杂度和推理常数复杂度的同时，大幅提升了模型在需要大量回忆的任务中的表现，甚至可与Transformer模型相媲美。

链接: https://arxiv.org/abs/2502.13685
作者: Jusen Du,Weigao Sun,Disen Lan,Jiaxi Hu,Yu Cheng
机构: Shanghai AI Laboratory; Nanjing University; South China University of Technology; The Hong Kong University of Science and Technology (Guangzhou); The Chinese University of Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report, 14 pages

点击查看摘要

Abstract:Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain’s ability to maintain robust long-term memory while mitigating “memory interference”, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at this https URL and is also released as a part of this https URL.
zh

[NLP-37] An LLM -based Agent for Reliable Docker Environment Configuration

【速读】：该论文旨在解决软件开发过程中环境配置这一关键但耗时的任务，特别是在处理不熟悉的代码仓库时。现有方法通常依赖手动操作或脆弱的脚本，导致效率低下且结果不可靠。论文提出的关键解决方案是Repo2Run，这是一种基于大语言模型（LLM）的代理，用于全自动环境配置并生成任意Python代码仓库的可执行Docker文件。Repo2Run通过引入原子配置合成（Atomic Configuration Synthesis）解决了两个主要挑战：一是在隔离的Docker容器中配置环境；二是确保配置过程的成功记录并准确转移到Dockerfile中而不出错。该方案采用双环境架构（内部和外部环境）并具备回滚机制，以防止因失败命令导致的环境“污染”，保证原子执行，并包含一个Dockerfile生成器将成功的配置步骤转化为可运行的Docker文件。

链接: https://arxiv.org/abs/2502.13681
作者: Ruida Hu,Chao Peng,Xinchen Wang,Cuiyun Gao
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区); ByteDance(字节跳动)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Environment configuration is a critical yet time-consuming step in software development, especially when dealing with unfamiliar code repositories. While Large Language Models (LLMs) demonstrate the potential to accomplish software engineering tasks, existing methods for environment configuration often rely on manual efforts or fragile scripts, leading to inefficiencies and unreliable outcomes. We introduce Repo2Run, the first LLM-based agent designed to fully automate environment configuration and generate executable Dockerfiles for arbitrary Python repositories. We address two major challenges: (1) enabling the LLM agent to configure environments within isolated Docker containers, and (2) ensuring the successful configuration process is recorded and accurately transferred to a Dockerfile without error. To achieve this, we propose atomic configuration synthesis, featuring a dual-environment architecture (internal and external environment) with a rollback mechanism to prevent environment “pollution” from failed commands, guaranteeing atomic execution (execute fully or not at all) and a Dockerfile generator to transfer successful configuration steps into runnable Dockerfiles. We evaluate Repo2Run~on our proposed benchmark of 420 recent Python repositories with unit tests, where it achieves an 86.0% success rate, outperforming the best baseline by 63.9%.
zh

[NLP-38] SCOPE: A Self-supervised Framework for Improving Faithfulness in Conditional Text Generation ICLR2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在条件文本生成过程中产生的幻觉问题，即生成的信息与输入上下文不一致或缺乏忠实度。论文的关键解决方案在于引入了一种新颖的自监督方法来生成不忠实样本的训练集，并通过偏好驱动的训练过程优化模型，鼓励生成更加忠实于给定上下文的输出，从而显著提升了文本生成的忠实度。这一方法在自动指标评估、基于LLM的评估及人工评估中均表现出色，优于现有的自监督技术。

链接: https://arxiv.org/abs/2502.13674
作者: Song Duong,Florian Le Bronnec,Alexandre Allauzen,Vincent Guigue,Alberto Lumbreras,Laure Soulier,Patrick Gallinari
机构: Sorbonne Université, CNRS, ISIR, F-75005 Paris, France (索邦大学, 法国国家科学研究中心, 国际视觉与机器人研究所, 巴黎);
Miles Team, LAMSADE, Université Paris-Dauphine, Université PSL, CNRS, 75016 Paris, France (巴黎第九大学, LAMSADE, 法国人民科技大学, 法国国家科学研究中心, 巴黎);
AgroParisTech, UMR MIA-PS, Palaiseau, France (阿格罗巴黎理工, 米亚-巴黎萨克研究所, 帕莱索);
Criteo AI Lab, Paris, France (Criteo人工智能实验室, 巴黎, 法国)
类目: Computation and Language (cs.CL)
备注: 10 pages, ICLR 2025 conference

点击查看摘要

Abstract:Large Language Models (LLMs), when used for conditional text generation, often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context. This issue arises in typical conditional text generation tasks, such as text summarization and data-to-text generation, where the goal is to produce fluent text based on contextual input. When fine-tuned on specific domains, LLMs struggle to provide faithful answers to a given context, often adding information or generating errors. One underlying cause of this issue is that LLMs rely on statistical patterns learned from their training data. This reliance can interfere with the model’s ability to stay faithful to a provided context, leading to the generation of ungrounded information. We build upon this observation and introduce a novel self-supervised method for generating a training set of unfaithful samples. We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones, drawing on preference-based training. Our approach leads to significantly more grounded text generation, outperforming existing self-supervised techniques in faithfulness, as evaluated through automatic metrics, LLM-based assessments, and human evaluations.
zh

[NLP-39] PeerQA: A Scientific Question Answering Dataset from Peer Reviews NAACL2025

【速读】：该论文旨在解决科学文献级别的问题回答（Document-level Question Answering, QA）系统开发中的三个关键任务：证据检索（Evidence retrieval）、无法回答的问题分类（Unanswerable question classification）以及答案生成（Answer generation）。论文的关键解决方案在于引入了一个名为PeerQA的新数据集，该数据集包含从同行评审中提取的问题及其对应的答案。通过分析PeerQA数据集，论文展示了即使采用简单的去上下文化方法（decontextualization approaches），也能显著提升文档级别检索的表现。此外，针对长上下文建模挑战，PeerQA数据集平均每个论文包含12k个标记，为答案生成提供了具有挑战性的基准测试。

链接: https://arxiv.org/abs/2502.13668
作者: Tim Baumgärtner,Ted Briscoe,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt; Mohamed bin Zayed University of Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at NAACL 2025

点击查看摘要

Abstract:We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at this https URL.
zh

[NLP-40] Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models

【速读】：该论文旨在解决句子嵌入（Sentence Embedding）任务中依赖手动标注数据限制可扩展性的问题。现有的对比学习方法虽然在像自然语言推理（NLI）这样的标注数据集上表现良好，但受限于手动标签。此外，利用大规模语言模型（LLMs）生成句子对的方法虽减少了标注依赖，但忽视了细粒度语义区分所需的排序信息。论文的关键解决方案在于通过控制潜在空间中LLMs的生成方向来引入排序信息，并将排序信息与语义信息整合到现有的句子嵌入模型中，从而实现更有效的语义区分和新的状态-of-the-art (SOTA) 性能。

链接: https://arxiv.org/abs/2502.13656
作者: Liyang He,Chenglong Liu,Rui Li,Zhenya Huang,Shulan Ruan,Jun Zhou,Enhong Chen
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using annotated datasets like NLI. Yet, the reliance on manual labels limits scalability. Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency. However, they overlook ranking information crucial for fine-grained semantic distinctions. To tackle this challenge, we propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence. Then, we refine exist sentence embedding model by integrating ranking information and semantic information. Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
zh

[NLP-41] C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在推理延迟和计算成本增加的问题。为应对这些问题，现有推测解码方法在构建令牌树和验证候选令牌时效率低下。论文的关键解决方案在于提出了一种名为C2T的新方法，该方法采用轻量级分类器动态生成和剪枝令牌树。通过考虑除了常用的联合概率之外的额外特征变量，C2T能够预测每个候选令牌的信任分数，从而确定是否将其作为验证对象。这种方法在多个基准测试中超越了最先进的方法如EAGLE-2，减少了25%的候选令牌总数，同时保持或提高了接受长度。

链接: https://arxiv.org/abs/2502.13652
作者: Feiye Huo,Jianchao Tan,Kefeng Zhang,Xunliang Cai,Shengli Sun
机构: Peking University(北京大学); Meituan(美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.
zh

[NLP-42] Reliability Across Parametric and External Knowledge: Understanding Knowledge Handling in LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在利用参数知识和外部知识时所面临的知识处理难题。具体而言，这些挑战包括解决知识源之间的冲突、避免受不具信息性的外部知识干扰以及在知识不足时正确拒绝作答。论文的关键解决方案在于提出了一种综合框架，通过分析参数知识的存在性和外部知识的信息性这两个维度来系统评估LLMs的知识处理能力，并通过针对性的数据训练提升其知识整合与应用的可靠性。

链接: https://arxiv.org/abs/2502.13648
作者: Youna Kim,Minjoon Choi,Sungmin Cho,Hyuhng Joon Kim,Sang-goo Lee,Taeuk Kim
机构: Seoul National University (首尔国立大学); IntelliSys, Korea (韩国智能系统研究所); Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注: under-review

点击查看摘要

Abstract:Large Language Models (LLMs) enhance their problem-solving capability by leveraging both parametric and external knowledge. Beyond leveraging external knowledge to improve response accuracy, they require key capabilities for reliable knowledge-handling: resolving conflicts between knowledge sources, avoiding distraction from uninformative external knowledge, and abstaining when sufficient knowledge is unavailable. Prior studies have examined these scenarios in isolation or with limited scope. To systematically evaluate these capabilities, we introduce a comprehensive framework for analyzing knowledge-handling based on two key dimensions: the presence of parametric knowledge and the informativeness of external knowledge. Through analysis, we identify biases in knowledge utilization and examine how the ability to handle one scenario impacts performance in others. Furthermore, we demonstrate that training on data constructed based on the knowledge-handling scenarios improves LLMs’ reliability in integrating and utilizing knowledge.
zh

[NLP-43] Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

【速读】：该论文旨在解决低资源语言中指令调优（Instruction Tuning）研究不足的问题，特别是在政府和文化领域由于文本数据有限所导致的挑战。论文的关键解决方案在于引入并开源了一个大规模（10,600样本）的指令遵循（Instruction-Following, IFT）数据集，该数据集涵盖了与哈萨克斯坦相关的机构和文化知识。通过使用大型语言模型（LLM）辅助的数据生成方法，并比较开放权重（open-weight）和封闭权重（closed-weight）模型进行数据集构建，最终选用GPT-4o作为基础模型。每个数据集实体都经过全面的人工验证以确保高质量。此外，论文展示了在Qwen、Falcon和Gemma模型上进行微调能够一致地提高多项选择和生成任务的表现，证明了LLM辅助指令调优在低资源语言中的潜力。

链接: https://arxiv.org/abs/2502.13647
作者: Nurkhan Laiyk,Daniil Orel,Rituraj Joshi,Maiya Goloburda,Yuxia Wang,Preslav Nakov,Fajri Koto
机构: Department of Natural Language Processing, MBZUAI (自然语言处理系, MBZUAI); Cerebras Systems (Cerebras系统)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs’ understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
zh

[NLP-44] D.Va: Validate Your Demonstration First Before You Use It

【速读】：该论文旨在解决在情境学习（In-Context Learning, ICL）中，通过选择有效示例来增强大型语言模型（LLMs）能力时存在的局限性。具体而言，现有方法通常依赖于直观指标评估示例的有效性，这导致了较低的鲁棒性和跨模型泛化能力。论文的关键解决方案是引入了一种名为“示范验证”（Demonstration Validation）的新机制，该机制能够有效地识别出既有效又具有高度泛化能力的示例。这种方法在自然语言理解（NLU）和自然语言生成（NLG）任务中超越了所有现有的示例选择技术，并展示了其在不同语言模型中的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2502.13646
作者: Qi Zhang,Zhiqing Xiao,Ruixuan Xiao,Lirong Gao,Junbo Zhao
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:In-context learning (ICL) has demonstrated significant potential in enhancing the capabilities of large language models (LLMs) during inference. It’s well-established that ICL heavily relies on selecting effective demonstrations to generate outputs that better align with the expected results. As for demonstration selection, previous approaches have typically relied on intuitive metrics to evaluate the effectiveness of demonstrations, which often results in limited robustness and poor cross-model generalization capabilities. To tackle these challenges, we propose a novel method, \textbfDemonstration \textbfVAlidation (\textbfthis http URL), which integrates a demonstration validation perspective into this field. By introducing the demonstration validation mechanism, our method effectively identifies demonstrations that are both effective and highly generalizable. \textbfthis http URL surpasses all existing demonstration selection techniques across both natural language understanding (NLU) and natural language generation (NLG) tasks. Additionally, we demonstrate the robustness and generalizability of our approach across various language models with different retrieval models.
zh

[NLP-45] Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks

【速读】：该论文旨在解决由自动语音识别（ASR）引入的转录噪声对下游自然语言处理（NLP）任务的影响问题。论文的关键解决方案在于提出了一种可配置的框架，用于在不同噪声强度和类型的情况下评估任务模型，并考察转录清洗技术的效果。该框架能够帮助研究任务模型的行为，从而支持生成有效的语音理解（SLU）解决方案。

链接: https://arxiv.org/abs/2502.13645
作者: Ori Shapira,Shlomo E. Chazan,Amir DN Cohen
机构: OriginAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the increasing prevalence of recorded human speech, spoken language understanding (SLU) is essential for its efficient processing. In order to process the speech, it is commonly transcribed using automatic speech recognition technology. This speech-to-text transition introduces errors into the transcripts, which subsequently propagate to downstream NLP tasks, such as dialogue summarization. While it is known that transcript noise affects downstream tasks, a systematic approach to analyzing its effects across different noise severities and types has not been addressed. We propose a configurable framework for assessing task models in diverse noisy settings, and for examining the impact of transcript-cleaning techniques. The framework facilitates the investigation of task model behavior, which can in turn support the development of effective SLU solutions. We exemplify the utility of our framework on three SLU tasks and four task models, offering insights regarding the effect of transcript noise on tasks in general and models in particular. For instance, we find that task models can tolerate a certain level of noise, and are affected differently by the types of errors in the transcript.
zh

[NLP-46] Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts

【速读】：该论文旨在解决大型语言模型（LLMs）在双语环境中，尤其是涉及低资源语言（如哈萨克语）和高资源语言（如俄语）的地区，安全评估方面存在的不足。论文的关键解决方案是引入了一个名为Qorgau的新数据集，专门用于哈萨克语和俄语的安全评估，以反映哈萨克斯坦独特的双语环境。实验表明，多语言和语言特定的LLMs在安全性能上存在显著差异，强调了为特定区域定制数据集的重要性，以确保LLMs在类似哈萨克斯坦的国家中负责任且安全地部署。

链接: https://arxiv.org/abs/2502.13640
作者: Maiya Goloburda,Nurkhan Laiyk,Diana Turmakhan,Yuxia Wang,Mukhammed Togmanov,Jonibek Mansurov,Askhat Sametov,Nurdaulet Mukhituly,Minghan Wang,Daniil Orel,Zain Muhammad Mujahid,Fajri Koto,Timothy Baldwin,Preslav Nakov
机构: Department of Natural Language Processing, MBZUAI (自然语言处理系，MBZUAI); Monash University (蒙纳士大学); The University of Melbourne (墨尔本大学); LibrAI (LibrAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
zh

[NLP-47] Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）的可解释性难题，特别是在现有系统管道中集成可解释性和干预性方法的关键挑战。论文提出的关键解决方案是引入概念层（Concept Layers, CLs），通过将模型内部向量表示投影到一个概念性的、可解释的向量空间中，并在重新构建后反馈到模型中，从而实现模型的可解释性和干预性。此外，论文还通过算法自动搜索本体论中的概念集，消除了人工选择概念集的需求，使得这些概念既可以是任务特定的，也可以是任务无关的。这种方法不仅保持了原始模型的性能和一致性，还实现了有意义的干预，如动态调整模型行为以减轻偏见。

链接: https://arxiv.org/abs/2502.13632
作者: Or Raphael Bidusa,Shaul Markovitch
机构: The Henry and Marilyn Taub Faculty of Computer Science (塔布计算机科学学院); Technion – Israel Institute of Technology (以色列理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The opaque nature of Large Language Models (LLMs) has led to significant research efforts aimed at enhancing their interpretability, primarily through post-hoc methods. More recent in-hoc approaches, such as Concept Bottleneck Models (CBMs), offer both interpretability and intervenability by incorporating explicit concept representations. However, these methods suffer from key limitations, including reliance on labeled concept datasets and significant architectural modifications that challenges re-integration into existing system pipelines. In this work, we introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers (CLs) into its architecture. Our approach projects the model’s internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model. Furthermore, we eliminate the need for a human-selected concept set by algorithmically searching an ontology for a set of concepts that can be either task-specific or task-agnostic. We evaluate CLs across multiple tasks, demonstrating that they maintain the original model’s performance and agreement while enabling meaningful interventions. Additionally, we present a proof of concept showcasing an intervenability interface, allowing users to adjust model behavior dynamically, such as mitigating biases during inference.
zh

[NLP-48] Non-Euclidean Hierarchical Representational Learning using Hyperbolic Graph Neural Networks for Environmental Claim Detection

【速读】：该论文旨在解决环境声明检测中的高计算需求和缺乏透明度的问题。解决方案的关键在于利用图神经网络（Graph Neural Networks, GNNs）和双曲图神经网络（Hyperbolic Graph Neural Networks, HGNNs），将环境声明检测重新定义为图分类问题，并通过构建依赖解析图显式建模句法结构，使用词嵌入（word2vec）作为节点特征，依赖关系作为边特征。实验结果表明，这些基于图的方法在性能上可与最先进的变换器模型媲美或更优，同时参数量减少30倍，从而突显了其在结构化、可解释性和计算效率方面的优势。

链接: https://arxiv.org/abs/2502.13628
作者: Darpan Aswal,Manjira Sinha
机构: Indian Institute of Technology, Kharagpur(印度理工学院，卡哈格普尔); TCS Research(塔塔咨询服务有限公司研究部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer-based models dominate NLP tasks like sentiment analysis, machine translation, and claim verification. However, their massive computational demands and lack of interpretability pose challenges for real-world applications requiring efficiency and transparency. In this work, we explore Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks (HGNNs) as lightweight yet effective alternatives for Environmental Claim Detection, reframing it as a graph classification problem. We construct dependency parsing graphs to explicitly model syntactic structures, using simple word embeddings (word2vec) for node features with dependency relations encoded as edge features. Our results demonstrate that these graph-based models achieve comparable or superior performance to state-of-the-art transformers while using 30x fewer parameters. This efficiency highlights the potential of structured, interpretable, and computationally efficient graph-based approaches.
zh

[NLP-49] REFIND: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLM）输出中的幻觉现象，这严重限制了它们在知识密集型任务（如问答）中的可靠性。论文的关键解决方案是引入REFIND框架，通过直接利用检索到的文档来检测LLM输出中的幻觉片段。REFIND的核心创新之处在于提出了上下文敏感性比率（CSR），这一新型指标量化了LLM输出对检索证据的敏感度。这种方法使得REFIND能够高效且准确地检测幻觉现象，从而显著优于现有的基线模型，并在九种不同语言的评估中表现出色。

链接: https://arxiv.org/abs/2502.13622
作者: DongGeon Lee,Hwanjo Yu
机构: Pohang University of Science and Technology (POSTECH)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations in large language model (LLM) outputs severely limit their reliability in knowledge-intensive tasks such as question answering. To address this challenge, we introduce REFIND (Retrieval-augmented Factuality hallucINation Detection), a novel framework that detects hallucinated spans within LLM outputs by directly leveraging retrieved documents. As part of the REFIND, we propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. This innovative approach enables REFIND to efficiently and accurately detect hallucinations, setting it apart from existing methods. In the evaluation, REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models, achieving superior IoU scores in identifying hallucinated spans. This work highlights the effectiveness of quantifying context sensitivity for hallucination detection, thereby paving the way for more reliable and trustworthy LLM applications across diverse languages.
zh

[NLP-50] Complex Ontology Matching with Large Language Model Embeddings

【速读】：该论文旨在解决本体（Ontology）和知识图谱匹配领域中表达能力未充分展现的问题。尽管嵌入（embeddings）和语言模型（language models）在该任务中的应用日益增多，但生成具有表现力的对应关系的方法仍未充分利用这些模型，特别是大型语言模型（LLMs）。论文的关键解决方案在于将LLMs集成到基于对齐需求和基于ABox的关系发现方法中，通过匹配实例子图的相似环境来生成对应关系，并对标签相似性、子图匹配和实体匹配进行架构上的修改。实验结果表明，使用LLMs的集成方法显著优于其他模型，使基线方法的F测度提高了45%。

链接: https://arxiv.org/abs/2502.13619
作者: Guilherme Sousa,Rinaldo Lima,Cassia Trojahn
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ontology, and more broadly, Knowledge Graph Matching is a challenging task in which expressiveness has not been fully addressed. Despite the increasing use of embeddings and language models for this task, approaches for generating expressive correspondences still do not take full advantage of these models, in particular, large language models (LLMs). This paper proposes to integrate LLMs into an approach for generating expressive correspondences based on alignment need and ABox-based relation discovery. The generation of correspondences is performed by matching similar surroundings of instance sub-graphs. The integration of LLMs results in different architectural modifications, including label similarity, sub-graph matching, and entity matching. The performance word embeddings, sentence embeddings, and LLM-based embeddings, was compared. The results demonstrate that integrating LLMs surpasses all other models, enhancing the baseline version of the approach with a 45% increase in F-measure.
zh

[NLP-51] BeamLoRA: Beam-Constraint Low-Rank Adaptation

【速读】：该论文旨在解决低秩适应（Low-Rank Adaptation, LoRA）在高效微调大型语言模型时准确性仍有提升空间的问题。关键在于提出BeamLoRA方法，将每个LoRA模块视为一个光束，其中每一秩自然对应一个潜在的子解，微调过程则成为寻找最优子解组合的过程。BeamLoRA通过动态消除表现不佳的子解并扩展有前景的子解参数空间，在固定秩的情况下提升了性能。

链接: https://arxiv.org/abs/2502.13604
作者: Naibin Gu,Zhenyu Zhang,Xiyu Liu,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所), Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院), Beijing, China; Baidu Inc. (百度公司), Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Due to the demand for efficient fine-tuning of large language models, Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves efficiency, there remains room for improvement in accuracy. Herein, we adopt a novel perspective to assess the characteristics of LoRA ranks. The results reveal that different ranks within the LoRA modules not only exhibit varying levels of importance but also evolve dynamically throughout the fine-tuning process, which may limit the performance of LoRA. Based on these findings, we propose BeamLoRA, which conceptualizes each LoRA module as a beam where each rank naturally corresponds to a potential sub-solution, and the fine-tuning process becomes a search for the optimal sub-solution combination. BeamLoRA dynamically eliminates underperforming sub-solutions while expanding the parameter space for promising ones, enhancing performance with a fixed rank. Extensive experiments across three base models and 12 datasets spanning math reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA consistently enhances the performance of LoRA, surpassing the other baseline methods.
zh

[NLP-52] Efficient Safety Retrofitting Against Jailbreaking for LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在面对越狱攻击（jailbreaking attacks）时的安全性问题，同时力求最小化数据需求和训练成本。论文的关键解决方案是直接偏好优化（DPO, Direct Preference Optimization），它通过训练模型以偏好数据为导向，从而绕过显式的奖励模型，实现对LLMs输出的引导。DPO方法的简便性使其易于适应不同的领域和安全要求。研究引入了一个名为Egida的数据集，并验证了其在增强模型安全性方面的有效性，同时评估了模型在一般任务中的性能下降及其拒绝倾向。该方法通过小规模的训练努力（如使用2,000个样本）显著降低了攻击成功率（降低10%-30%），并且证明了增强后的模型能够泛化到未见过的主题和攻击样式。

链接: https://arxiv.org/abs/2502.13603
作者: Dario Garcia-Gasulla,Anna Arias-Duart,Adrian Tormos,Daniel Hinjos,Oscar Molina-Sedano,Ashwin Kumar Gururajan,Maria Eugenia Cardello
机构: Barcelona Supercomputing Center (BSC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO’s effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over refusal. Following the proposed methodology, trained models reduce their Attack Success Rate by 10%-30%, using small training efforts (2,000 samples) with low computational cost (3\ for 8B models, 20\ for 72B models). Safety aligned models generalize to unseen topics and attack styles, with the most successful attack style reaching a success rate around 5%. Size and family are found to strongly influence model malleability towards safety, pointing at the importance of pre-training choices. To validate our findings, a large independent assessment of human preference agreement with Llama-Guard-3-8B is conducted by the authors and the associated dataset Egida-HSafe is released. Overall, this study illustrates how affordable and accessible it is to enhance LLM safety using DPO while outlining its current limitations. All datasets and models are released to enable reproducibility and further research.
zh

[NLP-53] MMTEB: Massive Multilingual Text Embedding Benchmark ICLR

【速读】：该论文旨在解决现有文本嵌入评估存在的语言、领域和任务多样性限制问题。为克服这些局限性，论文引入了大规模多语言文本嵌入基准测试（MMTEB），涵盖超过500个经过质量控制的评估任务，涉及250多种语言。解决方案的关键在于开发了一个包含多样化挑战性任务（如指令跟随、长文档检索和代码检索）的大型多语言基准测试集合，并通过一种基于任务间相关性的新颖降采样方法确保模型排名的相对一致性，同时优化了诸如检索等任务以提高效率。这些改进使得基准测试在大幅降低计算需求的同时仍能保持模型排名的稳定性。

链接: https://arxiv.org/abs/2502.13595
作者: Kenneth Enevoldsen,Isaac Chung,Imene Kerboua,Márton Kardos,Ashwin Mathur,David Stap,Jay Gala,Wissam Siblini,Dominik Krzemiński,Genta Indra Winata,Saba Sturua,Saiteja Utpala,Mathieu Ciancone,Marion Schaeffer,Gabriel Sequeira,Diganta Misra,Shreeya Dhakal,Jonathan Rystrøm,Roman Solomatin,Ömer Çağatan,Akash Kundu,Martin Bernstorff,Shitao Xiao,Akshita Sukhlecha,Bhavish Pahwa,Rafał Poświata,Kranthi Kiran GV,Shawon Ashraf,Daniel Auras,Björn Plüster,Jan Philipp Harries,Loïc Magne,Isabelle Mohr,Mariya Hendriksen,Dawei Zhu,Hippolyte Gisserot-Boukhlef,Tom Aarsen,Jan Kostkan,Konrad Wojtasik,Taemin Lee,Marek Šuppa,Crystina Zhang,Roberta Rocca,Mohammed Hamdy,Andrianos Michail,John Yang,Manuel Faysse,Aleksei Vatolin,Nandan Thakur,Manan Dey,Dipam Vasani,Pranjal Chitale,Simone Tedeschi,Nguyen Tai,Artem Snegirev,Michael Günther,Mengzhou Xia,Weijia Shi,Xing Han Lù,Jordan Clive,Gayatri Krishnakumar,Anna Maksimova,Silvan Wehrli,Maria Tikhonova,Henil Panchal,Aleksandr Abramov,Malte Ostendorff,Zheng Liu,Simon Clematide,Lester James Miranda,Alena Fenogenova,Guangyu Song,Ruqiya Bin Safi,Wen-Ding Li,Alessia Borghini,Federico Cassano,Hongjin Su,Jimmy Lin,Howard Yen,Lasse Hansen,Sara Hooker,Chenghao Xiao,Vaibhav Adlakha,Orion Weller,Siva Reddy,Niklas Muennighoff
机构: Aarhus University; Individual Contributor; Esker; INSA Lyon, LIRIS; University of Amsterdam; MBZUAI; Jina AI; Microsoft Research; Wikit; McGill University; University of Oxford; ITMO University; Koç University; Heritage Institute of Technology; Apart Research; BAAI; National Information Processing Institute; New York University; Ellamind; Peking University; CentraleSupélec; Artefact Research Center; Hugging Face; Wrocław University; Korea University; Illuin Technology; Comenius University Bratislava; Cisco Systems; University of Waterloo; Seoul National University; Salesforce; University of Zurich; Stanford University; FRC CSC RAS; IIT Madras; Sapienza University of Rome; University of Pennsylvania; University of Washington; Imperial College London; R. V. College of Engineering; Robert Koch Institute; HSE University; Nirma University; Occiglot; Allen Institute for AI; Northeastern University; Hong Kong University; Durham University; ServiceNow Research; Johns Hopkins University; Contextual AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted for ICLR: this https URL

点击查看摘要

Abstract:Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
zh

[NLP-54] Dont Stop the Multi-Party! On Generating Synthetic Multi-Party Conversations with Constraints

【速读】：该论文旨在解决多党对话（MPCs）数据集在隐私问题及平台特定属性限制下的局限性。为克服这些限制，论文提出利用指令调优的大语言模型（LLMs）生成多样化的MPCs，并通过提供确定性约束（如对话结构和参与者的立场）来确保生成对话的质量和复杂度。关键在于探索两种利用LLMs生成MPCs的策略：一是让LLMs一次性生成整个对话，二是逐轮生成对话。研究发现，逐轮生成的方式更符合约束条件且具有更高的语言变异性，但两种策略均能生成高质量的MPCs。

链接: https://arxiv.org/abs/2502.13592
作者: Nicolò Penzo,Marco Guerini,Bruno Lepri,Goran Glavaš,Sara Tonelli
机构: Fondazione Bruno Kessler(布鲁诺·凯勒基金会), Italy; University of Trento(特伦托大学), Italy; Center For Artificial Intelligence and Data Science, University of Würzburg(伍珀塔尔大学人工智能与数据科学中心), Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-Party Conversations (MPCs) are widely studied across disciplines, with social media as a primary data source due to their accessibility. However, these datasets raise privacy concerns and often reflect platform-specific properties. For example, interactions between speakers may be limited due to rigid platform structures (e.g., threads, tree-like discussions), which yield overly simplistic interaction patterns (e.g., as a consequence of ``reply-to’’ links). This work explores the feasibility of generating diverse MPCs with instruction-tuned Large Language Models (LLMs) by providing deterministic constraints such as dialogue structure and participants’ stance. We investigate two complementary strategies of leveraging LLMs in this context: (i.) LLMs as MPC generators, where we task the LLM to generate a whole MPC at once and (ii.) LLMs as MPC parties, where the LLM generates one turn of the conversation at a time, provided the conversation history. We next introduce an analytical framework to evaluate compliance with the constraints, content quality, and interaction complexity for both strategies. Finally, we assess the quality of obtained MPCs via human annotation and LLM-as-a-judge evaluations. We find stark differences among LLMs, with only some being able to generate high-quality MPCs. We also find that turn-by-turn generation yields better conformance to constraints and higher linguistic variability than generating MPCs in one pass. Nonetheless, our structural and qualitative evaluation indicates that both generation strategies can yield high-quality MPCs.
zh

[NLP-55] LSR-Adapt: Ultra-Efficient Parameter Tuning with Matrix Low Separation Rank Kernel Adaptation

【速读】：该论文旨在解决大规模预训练模型在适应下游任务时参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）的挑战，特别是低秩适应方法在现代大型语言模型中的局限性。论文的关键在于提出了一种有效的核化方法，即基于矩阵低分离秩（Low-Separation-Rank, LSR）表示的核，用于大型网络线性层的低秩适配器矩阵，称为LSR-Adapt核。通过这种高效的核表示，论文实现了超越现有低秩方法的最先进性能，且仅使用几乎一半的参数即可达到更高的精度。这一结构假设还由于Kronecker计算的高度并行化特性，开启了进一步GPU优化的可能性。

链接: https://arxiv.org/abs/2502.13568
作者: Xin Li,Anand Sarwate
机构: Rutgers, the State University of New Jersey(罗格斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Imposing an effective structural assumption on neural network weight matrices has been the major paradigm for designing Parameter-Efficient Fine-Tuning (PEFT) systems for adapting modern large pre-trained models to various downstream tasks. However, low rank based adaptation has become increasingly challenging due to the sheer scale of modern large language models. In this paper, we propose an effective kernelization to further reduce the number of parameters required for adaptation tasks. Specifically, from the classical idea in numerical analysis regarding matrix Low-Separation-Rank (LSR) representations, we develop a kernel using this representation for the low rank adapter matrices of the linear layers from large networks, named the Low Separation Rank Adaptation (LSR-Adapt) kernel. With the ultra-efficient kernel representation of the low rank adapter matrices, we manage to achieve state-of-the-art performance with even higher accuracy with almost half the number of parameters as compared to conventional low rank based methods. This structural assumption also opens the door to further GPU-side optimizations due to the highly parallelizable nature of Kronecker computations.
zh

[NLP-56] Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLM s

【速读】：该论文旨在从二战后从芬兰东卡累利阿地区重新安置的难民家庭的89,339份简短的芬兰语访谈记录中，通过零样本信息提取方法，分别针对每个家庭成员提取社会组织和兴趣爱好，以作为衡量难民在新环境中的社会融合程度的代理变量。研究的关键在于评估多种不同的方法来处理这一任务，包括若干生成模型（如GPT-4）和监督学习方法（如基于FinBERT的微调），从而更全面地了解这些不同方法的相对优势及其在类似研究中的适用性。研究表明，最优的生成模型（GPT-4）与人类表现相当，F分数达到88.8%，而开放模型（Llama-3-70B-Instruct）也达到了87.7%的F分数，证明了开放模型在非英语数据上的实用性。此外，通过使用GPT-4生成的训练数据对FinBERT进行微调的方法，实现了高达86.3%的F分数，表明这种方法在资源有限或数据量大的情况下特别有吸引力。

链接: https://arxiv.org/abs/2502.13566
作者: Joonatan Laato,Jenna Kanerva,John Loehr,Virpi Lummaa,Filip Ginter
机构: TurkuNLP, Department of Computing, University of Turku (图尔库大学), Finland; Lammi Biological Station, Faculty of Biological and Environmental Sciences, University of Helsinki (赫尔辛基大学), Finland; Department of Biology, University of Turku (图尔库大学), Finland
类目: Computation and Language (cs.CL)
备注: Published at Proceedings of Fifth Conference on Computational Humanities Research (CHR’2024), December 2024 this https URL

点击查看摘要

Abstract:We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process. Comments: Published at Proceedings of Fifth Conference on Computational Humanities Research (CHR’2024), December 2024 this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.13566 [cs.CL] (or arXiv:2502.13566v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13566 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-57] PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models

【速读】：该论文旨在解决在人与大型语言模型（Large Language Models, LLMs）交互过程中用户数据隐私保护的问题。论文的关键解决方案在于提出了一种多阶段策略，该策略能够在保护用户信息不被泄露的同时，保持云端LLMs响应的质量。此外，论文构建了一个名为SensitiveQA的隐私开放问答数据集，包含中英双语共计57k次交互，涵盖了广泛的用户敏感信息。

链接: https://arxiv.org/abs/2502.13564
作者: Guangwei Li,Yuansen Zhang,Yinggui Wang,Shoumeng Yan,Lei Wang,Tao Wei
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) is redefining the landscape of human-computer interaction, and their integration into various user-service applications is becoming increasingly prevalent. However, transmitting user data to cloud-based LLMs presents significant risks of data breaches and unauthorized access to personal identification information. In this paper, we propose a privacy preservation pipeline for protecting privacy and sensitive information during interactions between users and LLMs in practical LLM usage scenarios. We construct SensitiveQA, the first privacy open-ended question-answering dataset. It comprises 57k interactions in Chinese and English, encompassing a diverse range of user-sensitive information within the conversations. Our proposed solution employs a multi-stage strategy aimed at preemptively securing user information while simultaneously preserving the response quality of cloud-based LLMs. Experimental validation underscores our method’s efficacy in balancing privacy protection with maintaining robust interaction quality. The code and dataset are available at this https URL.
zh

[NLP-58] STaR-SQL: Self-Taught Reason er for Text-to-SQL

【速读】：该论文旨在解决在结构化任务如文本到SQL转换（text-to-SQL）中应用逐步“链式思维”（chain-of-thought）推理方法的问题。解决方案的关键在于引入Self-Taught Reasoner for text-to-SQL (STaR-SQL)，一种将SQL查询生成重新定义为基于推理的过程的方法。STaR-SQL通过引导大型语言模型（LLM）生成详细的推理步骤，并通过正确的推理结果进行微调，从而改进性能。此外，该方法在推理阶段投入额外的计算资源，使LLM成为自发的推理者而非简单的基于提示的代理。为了进一步提升推理过程的效率，还引入了结果监督奖励模型（ORM）作为验证器，以增强SQL查询的准确性。实验结果表明，STaR-SQL显著提升了text-to-SQL任务的执行精度，达到了86.6%，超过了现有的基线方法。

链接: https://arxiv.org/abs/2502.13550
作者: Mingqian He,Yongliang Shen,Wenqi Zhang,Qiuying Peng,Jun Wang,Weiming Lu
机构: Zhejiang University(浙江大学); OPPO Research Institute(OPPO研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating step-by-step “chain-of-thought” rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.
zh

[NLP-59] Detecting Linguistic Bias in Government Documents Using Large language Models

【速读】：该论文旨在解决政府文件中偏见检测的关键需求，这是一个具有重要意义但研究不足的领域，现有方法往往未能充分考虑政府文件的独特背景及其深远影响，从而可能忽略其中隐含的偏见。为了解决这一问题，论文提出了荷兰政府数据偏见检测（Dutch Government Data for Bias Detection, DGDB）数据集，该数据集来源于荷兰众议院，并由专家进行了标注。解决方案的关键在于使用细调后的BERT模型在DGDB数据集上进行训练，并通过与生成式语言模型的对比来验证其性能，最终证明这些细调模型在偏见检测任务上的表现显著优于生成式语言模型。

链接: https://arxiv.org/abs/2502.13548
作者: Milena de Swart,Floris den Hengst,Jieying Chen
机构: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Computation and Language (cs.CL)
备注: to appear in Proceedings of the ACM Web Conference 2025

点击查看摘要

Abstract:This paper addresses the critical need for detecting bias in government documents, an underexplored area with significant implications for governance. Existing methodologies often overlook the unique context and far-reaching impacts of governmental documents, potentially obscuring embedded biases that shape public policy and citizen-government interactions. To bridge this gap, we introduce the Dutch Government Data for Bias Detection (DGDB), a dataset sourced from the Dutch House of Representatives and annotated for bias by experts. We fine-tune several BERT-based models on this dataset and compare their performance with that of generative language models. Additionally, we conduct a comprehensive error analysis that includes explanations of the models’ predictions. Our findings demonstrate that fine-tuned models achieve strong performance and significantly outperform generative language models, indicating the effectiveness of DGDB for bias detection. This work underscores the importance of labeled datasets for bias detection in various languages and contributes to more equitable governance practices.
zh

[NLP-60] From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN

【速读】：该论文旨在解决大型语言模型（LLMs）在可控长度文本生成（LCTG）方面能力不足的问题，这一局限性阻碍了其实际应用。论文的关键解决方案是MarkerGen方法，它通过外部工具集成来缓解LLM的基本缺陷，使用动态插入的标记进行显式的长度建模，并采用三阶段生成方案以更好地满足长度约束同时保持内容一致性。

链接: https://arxiv.org/abs/2502.13544
作者: Peiwen Yuan,Chuyi Tan,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Yueqi Zhang,Jiayi Shi,Boyuan Pan,Yao Hu,Kan Li
机构: School of Computer Science and Technology, Beijing Institute of Technology (北京理工大学计算机科学与技术学院); Xiaohongshu Inc (小红书股份有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the rapid progress of large language models (LLMs), their length-controllable text generation (LCTG) ability remains below expectations, posing a major limitation for practical applications. Existing methods mainly focus on end-to-end training to reinforce adherence to length constraints. However, the lack of decomposition and targeted enhancement of LCTG sub-abilities restricts further this http URL bridge this gap, we conduct a bottom-up decomposition of LCTG sub-abilities with human patterns as reference and perform a detailed error this http URL this basis, we propose MarkerGen, a simple-yet-effective plug-and-play approach that:(1) mitigates LLM fundamental deficiencies via external tool integration;(2) conducts explicit length modeling with dynamically inserted markers;(3) employs a three-stage generation scheme to better align length constraints while maintaining content this http URL experiments demonstrate that MarkerGen significantly improves LCTG across various settings, exhibiting outstanding effectiveness and generalizability.
zh

[NLP-61] Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLM s Inference

【速读】：该论文旨在解决大型语言模型（LLMs）在长上下文任务中的高效推理挑战，特别是在有限GPU内存条件下。现有方法采用滑动窗口技术积累历史键值（KV）对以供重用，并通过选择性保留子集进一步改进。然而，由于长上下文中稀疏的注意力分布，难以识别和召回相关的KV对，注意力容易被大量候选对分散。为了解决这些问题，论文提出了一种无需训练的ActQKV方法，该方法能够动态确定探针查询（probe-Query），并通过此查询有效地检索相关的KV对进行推理。ActQKV的关键在于利用激活偏差（Activation Bias）作为指示器，在每个上下文窗口内构建合适的探针查询，以及设计了一个基于层间信息密度的动态KV截断机制，以准确召回相关KV对并最小化无关对。

链接: https://arxiv.org/abs/2502.13542
作者: Qingfa Xiao,Jiachuan Wang,Haoyang Li,Cheng Deng,Jiaqi Tang,Shuangyin Li,Yongqi Zhang,Jun Wang,Lei Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); The Hong Kong Polytechnic University (香港理工大学); South China Normal University (华南师范大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical \textbfkey-value (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose \textbfActQKV, a training-free, \textbfActivation-aware approach that dynamically determines probe-\textbfQuery and leverages it to retrieve the relevant \textbfKV pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and \infty Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
zh

[NLP-62] rain Small Infer Large: Memory-Efficient LoRA Training for Large Language Models ICLR2025

【速读】：该论文旨在解决大规模语言模型（LLMs）在使用低秩适应（LoRA）方法进行微调时内存消耗过大的问题。论文的关键解决方案是提出了一种名为LoRAM的新颖训练方案，通过在修剪后的模型上训练以获得低秩矩阵，并在推理阶段恢复这些矩阵以与原始模型结合使用。此外，论文还介绍了低成本持续预训练的方法，以弥合修剪模型与原始模型之间的知识差距。这种方法显著减少了内存使用，使得具有700亿参数的模型仅需20GB的HBM内存即可完成训练，相比传统的LoRA和全量微调方法大幅降低了硬件需求。

链接: https://arxiv.org/abs/2502.13533
作者: Jun Zhang,Jue Wang,Huan Li,Lidan Shou,Ke Chen,Yang You,Guiming Xie,Xuejian Gong,Kunlong Zhou
机构: The State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新技术区滨江区块链与数据安全研究所); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); AI Center, Guangdong OPPO Mobile Telecommunications Corp., Ltd. (广东OPPO移动通信有限公司AI中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81 \times (16.95 \times ), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B).
zh

[NLP-63] A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment

【速读】：该论文旨在解决阿拉伯语文本可读性评估的问题。解决方案的关键在于构建了一个大规模、细粒度的阿拉伯语数据集——平衡的阿拉伯语可读性评估语料库（Balanced Arabic Readability Evaluation Corpus, BAREC），包含19个不同的可读性级别，覆盖从幼儿园到研究生水平的理解能力。BAREC由一个大型标注团队进行完全手动标注，并通过高平均二次加权Kappa值（81.3%）确保了标注的一致性和可靠性。此外，论文还评估了不同粒度级别的自动可读性评估方法，展示了各种方法的竞争性能。

链接: https://arxiv.org/abs/2502.13520
作者: Khalid N. Elmadani,Nizar Habash,Hanada Taha-Thomure
机构: Computational Approaches to Modeling Language Lab, New York University Abu Dhabi (计算语言建模实验室，纽约大学阿布扎比分校); Zai Arabic Language Research Centre, Zayed University (扎耶德阿拉伯语研究中心，扎耶德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Balanced Arabic Readability Evaluation Corpus BAREC, a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 68,182 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.3%, reflecting a high level of substantial agreement. Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods. To support research and education, we will make BAREC openly available, along with detailed annotation guidelines and benchmark results.
zh

[NLP-64] Shall Your Data Strategy Work? Perform a Swift Study

【速读】：该论文旨在开发一种快速评估特定类型指令调优数据有效性的方法，无需重新训练模型。关键在于利用基于梯度的数据影响估计思想，通过分析探针示例在选定策略上的梯度投影来评估其效果。论文通过三个快速研究验证了这种评估方法，并进一步通过对照实验确认了结果。

链接: https://arxiv.org/abs/2502.13514
作者: Minlong Peng,Jingyi Yang,Zhongjun He,Hua Wu
机构: NLP, Baidu Research (百度研究), China
类目: Computation and Language (cs.CL)
备注: 8 pages 5 figures

点击查看摘要

Abstract:This work presents a swift method to assess the efficacy of particular types of instruction-tuning data, utilizing just a handful of probe examples and eliminating the need for model retraining. This method employs the idea of gradient-based data influence estimation, analyzing the gradient projections of probe examples from the chosen strategy onto evaluation examples to assess its advantages. Building upon this method, we conducted three swift studies to investigate the potential of Chain-of-thought (CoT) data, query clarification data, and response evaluation data in enhancing model generalization. Subsequently, we embarked on a validation study to corroborate the findings of these swift studies. In this validation study, we developed training datasets tailored to each studied strategy and compared model performance with and without the use of these datasets. The results of the validation study aligned with the findings of the swift studies, validating the efficacy of our proposed method.
zh

[NLP-65] Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion

【速读】：该论文旨在解决大型语言模型（LLMs）在医学领域中整合结构化时间序列数据与非结构化临床笔记方面应用不足的问题。论文的关键解决方案是提出ProMedTS，这是一种新颖的自监督多模态框架，通过提示引导学习来统一异构数据类型。ProMedTS利用轻量级异常检测生成异常描述作为提示，指导原始时间序列数据编码为信息丰富的嵌入，并将其与文本表示对齐于共享潜在空间，同时保留细粒度的时间特征及语义洞察。此外，该框架结合定制的自监督目标以增强模态内的对齐。

链接: https://arxiv.org/abs/2502.13509
作者: Shuai Niu,Jing Ma,Hongzhan Lin,Liang Bai,Zhihua Wang,Wei Bi,Yida Xu,Guo Li,Xian Yang
机构: Hong Kong Baptist University(香港浸会大学); Shanxi University(山西大学); Shanghai Institute for Advanced Study of Zhejiang University(浙江大学上海高等研究院); Manchester Metropolitan University(曼彻斯特 metropolitan 大学); Tencent AI Lab(腾讯 AI 实验室); The University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data such as lab test results capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative embeddings. These embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.
zh

[NLP-66] PLDR-LLM s Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

【速读】：该论文旨在解决大型语言模型在推理阶段的效率与准确性问题。关键在于利用生成的能量-曲率张量 $\mathbf{G}_{LM}$ 替换原有的基于幂律图注意力机制（PLGA）的深度神经网络，以实现高效的推理。通过引入 $\mathbf{G}_{LM}$ 缓存（G-cache）和KV缓存，论文展示了如何显著提升推理速度，同时保持高保真的输出不变性，即推理结果在缓存后仍具有相同的均方根误差（RMSE）和行列式值，并且零样本基准分数保持不变。

链接: https://arxiv.org/abs/2502.13502
作者: Burc Gokden
机构: Fromthesky Research Labs LLC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 1 figure, 12 tables

点击查看摘要

Abstract:We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor \mathbfG_LM to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for \mathbfG_LM (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where \mathbfG_LM is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.
zh

[NLP-67] owards Geo-Culturally Grounded LLM Generations

【速读】：该论文旨在解决生成式大语言模型（LLMs）在全球范围内文化知识多样性方面的不足。研究的关键在于通过检索增强生成（retrieval augmented generation）和基于搜索的锚定技术（search-grounding techniques），评估这些方法能否提升LLMs对多元国家文化的熟悉度。具体而言，论文比较了标准LLMs、从定制知识库中检索信息增强的LLMs（即KB锚定）以及从网络搜索中检索信息增强的LLMs（即搜索锚定）在一系列文化熟悉度基准测试中的表现。结果显示，搜索锚定显著提升了LLMs在命题知识多选题测试上的表现，但同时也增加了刻板印象的风险，并未能有效改善人类评估的文化熟悉度判断。

链接: https://arxiv.org/abs/2502.13497
作者: Piyawat Lertvittayakumjorn,David Kinney,Vinodkumar Prabhakaran,Donald Martin,Sunipa Dev
机构: Google(谷歌); Washington University in St. Louis(圣路易斯华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative large language models (LLMs) have been demonstrated to have gaps in diverse, cultural knowledge across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on the ability of LLMs to display familiarity with a diverse range of national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on a series of cultural familiarity benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., the norms, artifacts, and institutions of national cultures), while KB grounding’s effectiveness is limited by inadequate knowledge base coverage and a suboptimal retriever. However, search grounding also increases the risk of stereotypical judgments by language models, while failing to improve evaluators’ judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional knowledge about a culture and open-ended cultural fluency when it comes to evaluating the cultural familiarity of generative LLMs.
zh

[NLP-68] What are Models Thinking about? Understanding Large Language Model Hallucinations “Psychology” through Model Inner State Analysis

【速读】：该论文旨在解决大型语言模型（LLM）在生成有效且符合事实内容时不稳定的问题，即幻觉生成（hallucination generation）。当前的幻觉检测方法严重依赖于模型外部的信息源，如RAG，这导致了额外的延迟。论文的关键解决方案在于利用LLM推理过程中的内部状态，这些状态具有可解释性且不需外部信息源。通过将LLM的推理过程分为理解、查询和生成三个阶段，并从这些阶段提取内部状态，论文系统分析了不同内部状态在幻觉检测中的揭示特征，并全面评估了它们的能力。

链接: https://arxiv.org/abs/2502.13490
作者: Peiran Wang,Yang Liu,Yunfei Lu,Jue Hong,Ye Wu
机构: ByteDance Inc
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) systems suffer from the models’ unstable ability to generate valid and factual content, resulting in hallucination generation. Current hallucination detection methods heavily rely on out-of-model information sources, such as RAG to assist the detection, thus bringing heavy additional latency. Recently, internal states of LLMs’ inference have been widely used in numerous research works, such as prompt injection detection, etc. Considering the interpretability of LLM internal states and the fact that they do not require external information sources, we introduce such states into LLM hallucination detection. In this paper, we systematically analyze different internal states’ revealing features during inference forward and comprehensively evaluate their ability in hallucination detection. Specifically, we cut the forward process of a large language model into three stages: understanding, query, generation, and extracting the internal state from these stages. By analyzing these states, we provide a deep understanding of why the hallucinated content is generated and what happened in the internal state of the models. Then, we introduce these internal states into hallucination detection and conduct comprehensive experiments to discuss the advantages and limitations.
zh

[NLP-69] ransferring Textual Preferences to Vision-Language Understanding through Model Merging

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在评估生成内容方面能力有限的问题，并提出了一种训练-free的方法来高效地将文本偏好融入LVLMs。关键在于将基于文本的奖励模型（Reward Models, RMs）与LVLMs结合，以创建视觉语言奖励模型（Vision-Language Reward Models, VLRMs），从而提升性能并提供一种无需训练的高效方案。

链接: https://arxiv.org/abs/2502.13487
作者: Chen-An Li,Tzu-Han Lin,Yun-Nung Chen,Hung-yi Lee
机构: National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Under Review

点击查看摘要

Abstract:Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs’ scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.
zh

[NLP-70] LLM should think and action as a human

【速读】：该论文旨在解决多轮对话中聊天助手存在的问题：回复错误频繁、无法根据不同需求生成多样化的响应、工具使用效率低下且支持的工具调用数量有限。这些问题的核心在于大型语言模型缺乏人类的思维能力、推理能力和规划能力。为了解决这些问题，论文提出了一种基于内置链式思维的思考方法，使大型语言模型在多轮对话中能够针对每个用户提示，结合聊天历史、思维上下文、动作调用、记忆和知识进行详细推理和规划，并按计划采取行动。关键解决方案在于通过监督学习和强化学习的方法，收集训练数据集并微调大型语言模型，以增强其推理和规划能力，从而有效解决多轮对话中的问题。

链接: https://arxiv.org/abs/2502.13475
作者: Haun Leung,ZiNan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 1 table

点击查看摘要

Abstract:It is popular lately to train large language models to be used as chat assistants, but in the conversation between the user and the chat assistant, there are prompts, require multi-turns between the chat assistant and the user. However, there are a number of issues with the multi-turns conversation: The response of the chat assistant is prone to errors and cannot help users achieve their goals; It is difficult for chat assistant to generate responses with different processes based on actual needs for the same command or request; Chat assistant require the use of tools, but the current approach is not elegant and efficient, and the number of tool calls that can be supported is limited. The main reason for these issues is that large language models do not have the thinking ability as a human, lack the reasoning ability and planning ability, and lack the ability to execute plans. To solve these issues, we propose a thinking method based on a built-in chain of thought: In the multi-turns conversation, for each user prompt, the large language model thinks based on elements such as chat history, thinking context, action calls, memory and knowledge, makes detailed reasoning and planning, and actions according to the plan. We also explored how the large language model enhances thinking ability through this thinking method: Collect training datasets according to the thinking method and fine tune the large language model through supervised learning; Train a consistency reward model and use it as a reward function to fine tune the large language model using reinforcement learning, and the reinforced large language model outputs according to this way of thinking. Our experimental results show that the reasoning ability and planning ability of the large language model are enhanced, and the issues in the multi-turns conversation are solved.
zh

[NLP-71] owards Lightweight Adaptive and Attribute-Aware Multi-Aspect Controllable Text Generation with Large Language Models

【速读】：该论文旨在解决多方面可控文本生成中的若干挑战，包括低秩适应（Low Rank Adaptation, LoRA）控制效果不佳，全参数微调（Full Fine-Tuning, FFT）需要大量计算资源且易过拟合，以及现有方法在有限数据条件下难以准确生成具有特定属性的文本。论文的关键在于提出了一种轻量级、自适应且具备属性感知能力的框架，能够根据不同的数据方面动态调整模型参数，以实现多方面的可控文本生成，从而优化性能并提高对数据分布差异的适应性和属性感知的准确性。

链接: https://arxiv.org/abs/2502.13474
作者: Chenyu Zhu,Yefeng Liu,Chenyang Lyu,Xue Yang,Guanhua Chen,Longyue Wang,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce; Southern University of Science and Technology
类目: Computation and Language (cs.CL)
备注: 17 pages,9 figures

点击查看摘要

Abstract:Multi-aspect controllable text generation aims to control text generation in attributes from multiple aspects, making it a complex but powerful task in natural language processing. Supervised fine-tuning methods are often employed for this task due to their simplicity and effectiveness. However, they still have some limitations: low rank adaptation (LoRA) only fine-tunes a few parameters and has suboptimal control effects, while full fine-tuning (FFT) requires significant computational resources and is susceptible to overfitting, particularly when data is limited. Moreover, existing works typically train multi-aspect controllable text generation models using only single-aspect annotated data, which results in discrepancies in data distribution; at the same time, accurately generating text with specific attributes is a challenge that requires strong attribute-aware capabilities. To address these limitations, we propose a lightweight, adaptive and attribute-aware framework for multi-aspect controllable text generation. Our framework can dynamically adjust model parameters according to different aspects of data to achieve controllable text generation, aiming to optimize performance across multiple aspects. Experimental results show that our framework outperforms other strong baselines, achieves state-of-the-art performance, adapts well to data discrepancies, and is more accurate in attribute perception.
zh

[NLP-72] FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems

【速读】：该论文旨在解决现有全双工语音对话系统（Full-Duplex SDS）在独立模块优化和上下文噪声干扰方面的挑战。这些挑战源于高度耦合的架构设计和过于简化的二元状态建模。为了解决这些问题，论文提出了一种名为FlexDuo的灵活全双工控制模块，通过即插即用的架构设计将全双工控制与语音对话系统解耦。关键创新在于引入了一个显式的空闲状态（Idle State），该状态不仅能过滤冗余噪声和无关音频以提升对话质量，还能基于语义完整性建立缓冲机制，从而减少相互中断的风险并确保准确的响应过渡。

链接: https://arxiv.org/abs/2502.13472
作者: Borui Liao,Yulong Xu,Jiao Ou,Kaiyuan Yang,Weihua Jian,Pengfei Wan,Di Zhang
机构: Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.
zh

[NLP-73] HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

【速读】：该论文旨在解决现实世界信息检索场景中用户动态且多样化的需求，如何使检索增强生成（Retrieval-Augmented Generation, RAG）系统展现适应性的鲁棒性。论文的关键解决方案在于引入了一个名为HawkBench的新基准测试集，它是一个涵盖多领域的人工标注数据集，用于全面评估RAG系统的性能。HawkBench通过系统化地分层任务类型，覆盖了包括事实查询和推理查询在内的广泛查询类型，并整合了多领域的语料库以减少语料库偏差，从而确保高质量的标注和评估。这一方法为提升RAG系统的泛化能力和适应多样性用户需求提供了重要支持。

链接: https://arxiv.org/abs/2502.13465
作者: Hongjin Qian,Zheng Liu,Chao Gao,Yankai Wang,Defu Lian,Zhicheng Dou
机构: Beijing Academy of Artificial Intelligence(北京人工智能研究院); Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); University of Science and Technology of China(中国科学技术大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs. Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation. HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking. Comments: 13 pages Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2502.13465 [cs.IR] (or arXiv:2502.13465v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.13465 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-74] Estimating Commonsense Plausibility through Semantic Shifts

【速读】：该论文旨在解决常识合理性评估中细粒度区分能力不足的问题。现有基于生成的方法依赖于似然性或口头判断，在细粒度区分方面表现不佳。论文的关键解决方案是提出ComPaSS框架，这是一种新颖的判别式方法，通过衡量添加常识相关信息后语义的变化来量化常识合理性。合理增强导致语义变化最小，而不合理的增强则引起显著的语义偏离。这一方法在不同基础模型（包括大型语言模型LLMs和视觉-语言模型VLMs）上的细粒度常识合理性评估任务中表现出色，证明了判别方法在细粒度常识合理性评估中的优势。

链接: https://arxiv.org/abs/2502.13464
作者: Wanqing Cui,Keping Bi,Jiafeng Guo,Xueqi Cheng
机构: CAS Key Lab of Network Data Science and Technology(网络数据科学与技术重点实验室),
Institute of Computing Technology(计算技术研究所), Chinese Academy of Sciences(中国科学院), Beijing, China(中国北京);
University of Chinese Academy of Sciences(中国科学院大学), Beijing, China(中国北京)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches–reliant on likelihoods or verbalized judgments–struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models’ ability to capture semantic nuances, thereby further enhancing ComPaSS.
zh

[NLP-75] hinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

【速读】：该论文旨在解决现有安全护栏（Safety Guardrails）在处理复杂安全违规行为时能力有限的问题。解决方案的关键在于提出了一种名为ThinkGuard的方法，通过生成结构化批评（structured critiques）来提炼大型语言模型（LLMs）的知识，并与安全标签一起使用，从而显著提升了安全护栏的谨慎性和可解释性。

链接: https://arxiv.org/abs/2502.13458
作者: Xiaofei Wen,Wenxuan Zhou,Wenjie Jacky Mo,Muhao Chen
机构: University of California, Davis (加州大学戴维斯分校); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail’s cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.
zh

[NLP-76] Enhancing Chest X-ray Classification through Knowledge Injection in Cross-Modality Learning ICASSP’25

【速读】：该论文旨在探讨在跨模态学习中，显式注入医学知识如何影响模型性能，特别是在胸部X射线（Chest X-ray, CXR）图像分类中的表现。研究的关键在于引入了一种基于集合论的知识注入框架，能够生成可控粒度医学知识的图像描述，并通过调整医学信息的详细程度来微调CLIP模型。实验结果表明，注入细粒度的医学知识显著提升了零样本分类的准确性，达到了72.5%相比仅使用人工生成描述的49.9%。此外，研究还探索了知识密度及领域专用的大语言模型（Domain-Specific Large Language Models, LLMs）对生成描述的影响，发现更密集的知识和专门化的LLMs有助于提升性能。这一研究通过证明知识注入的有效性，推进了医学影像分析领域，为开发更准确可靠的诊断工具铺平了道路。

链接: https://arxiv.org/abs/2502.13447
作者: Yang Yan,Bingqing Yue,Qiaxuan Li,Man Huang,Jingyu Chen,Zhenzhong Lan
机构: Zhejiang University (浙江大学), Hangzhou, Zhejiang, China; Westlake University (西湖大学), Hangzhou, Zhejiang, China; The Second Affiliated Hospital Zhejiang University School of Medicine (浙江大学医学院第二附属医院), Hangzhou, Zhejiang, China
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by ICASSP’25

点击查看摘要

Abstract:The integration of artificial intelligence in medical imaging has shown tremendous potential, yet the relationship between pre-trained knowledge and performance in cross-modality learning remains unclear. This study investigates how explicitly injecting medical knowledge into the learning process affects the performance of cross-modality classification, focusing on Chest X-ray (CXR) images. We introduce a novel Set Theory-based knowledge injection framework that generates captions for CXR images with controllable knowledge granularity. Using this framework, we fine-tune CLIP model on captions with varying levels of medical information. We evaluate the model’s performance through zero-shot classification on the CheXpert dataset, a benchmark for CXR classification. Our results demonstrate that injecting fine-grained medical knowledge substantially improves classification accuracy, achieving 72.5% compared to 49.9% when using human-generated captions. This highlights the crucial role of domain-specific knowledge in medical cross-modality learning. Furthermore, we explore the influence of knowledge density and the use of domain-specific Large Language Models (LLMs) for caption generation, finding that denser knowledge and specialized LLMs contribute to enhanced performance. This research advances medical image analysis by demonstrating the effectiveness of knowledge injection for improving automated CXR classification, paving the way for more accurate and reliable diagnostic tools.
zh

[NLP-77] reeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

【速读】：该论文旨在解决大型语言模型（LLMs）在处理数学应用题时产生不实推理的问题。论文的关键解决方案是引入了一个名为TreeCut的合成数据集，通过将每个问题表示为树结构，并移除选定的必要条件，从而系统地生成无限的不可回答的数学应用题及其可回答的对应题目。实验表明，TreeCut能够有效地诱导包括GPT-4在内的大型语言模型产生幻觉，其错误率在最坏情况下分别达到61%和42%。进一步分析揭示，更深层或复杂的树结构、组合项名称以及在路径中间移除必要条件均会增加幻觉产生的概率，这凸显了LLMs在识别不可回答的数学问题方面持续面临的挑战。

链接: https://arxiv.org/abs/2502.13442
作者: Jialin Ouyang
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 61% and 42% in their respective worst-case scenarios. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems.
zh

[NLP-78] he Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

【速读】：该论文旨在解决大型语言模型（LLMs）自我提升过程中对外部监督信号依赖的问题。解决方案的关键在于提出了一种名为Crescent的框架，能够以完全自主的方式生成高质量的合成问答数据。Crescent通过诱饵提示使模型生成原始问题，利用基于拒绝采样的自去重方法多样化这些问题，并通过多数投票收集对应的答案，从而实现无需外部监督信号的真正自我提升。

链接: https://arxiv.org/abs/2502.13441
作者: Yutao Sun,Mingshuai Chen,Tiancheng Zhao,Ruochen Xu,Zilun Zhang,Jianwei Yin
机构: Zhejiang University(浙江大学); Binjiang Institute of Zhejiang University(浙江大学滨江研究所); Om AI Research(奥米AI研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-improving large language models (LLMs) – i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself – is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent – a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.
zh

[NLP-79] MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering

【速读】：该论文旨在解决知识库问答（KBQA）中基于语义解析方法所面临的挑战，特别是提高大型语言模型（LLMs）在这些任务中的推理能力。当前利用LLMs作为代理的方法虽展现出潜力，但受限于其线性决策过程。论文的关键解决方案在于提出了一种基于蒙特卡洛树搜索（MCTS）的框架，通过树搜索方法增强LLMs的推理能力。该框架采用了一种精心设计的逐步奖励机制，仅需对开源指令LLMs进行直接提示而无需额外微调。实验结果表明，该方法显著优于线性决策方法，尤其在低资源场景下表现出色。

链接: https://arxiv.org/abs/2502.13428
作者: Guanming Xiong,Haochen Li,Wen Zhao
机构: Peking University (北京大学); 01.AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores how to enhance the reasoning capabilities of large language models (LLMs) in knowledge base question answering (KBQA) by leveraging Monte Carlo Tree Search (MCTS). Semantic parsing-based KBQA methods are particularly challenging as these approaches require locating elements from knowledge bases and generating logical forms, demanding not only extensive annotated data but also strong reasoning capabilities. Although recent approaches leveraging LLMs as agents have demonstrated considerable potential, these studies are inherently constrained by their linear decision-making processes. To address this limitation, we propose a MCTS-based framework that enhances LLMs’ reasoning capabilities through tree search methodology. We design a carefully designed step-wise reward mechanism that requires only direct prompting of open-source instruction LLMs without additional fine-tuning. Experimental results demonstrate that our approach significantly outperforms linear decision-making methods, particularly in low-resource scenarios. Additionally, we contribute new data resources to the KBQA community by annotating intermediate reasoning processes for existing question-SPARQL datasets using distant supervision. Experimental results on the extended dataset demonstrate that our method achieves comparable performance to fully supervised models while using significantly less training data.
zh

[NLP-80] abSD: Large Free-Form Table Question Answering with SQL-Based Table Decomposition

【速读】：该论文旨在解决自由形式表格上的问答（TableQA）挑战，特别是处理大型表格中的噪声和复杂性。为了解决这些问题，论文提出了一种基于SQL的分解模型TabSD（SQL-based decomposition model），其关键是通过生成SQL查询来指导表的分解、去除噪声，并处理子表以优化答案生成过程。此外，SQL验证器进一步精炼SQL输出以提高分解准确性。

链接: https://arxiv.org/abs/2502.13422
作者: Yuxiang Wang,Junhao Gan,Jianzhong Qi
机构: School of Computing and Information Systems, The University of Melbourne (计算与信息系统学院，墨尔本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Question answering on free-form tables (TableQA) is challenging due to the absence of predefined schemas and the presence of noise in large tables. While Large Language Models (LLMs) have shown promise in TableQA, they struggle with large free-form tables and noise sensitivity. To address these challenges, we propose TabSD, a SQL-based decomposition model that enhances LLMs’ ability to process large free-form tables. TabSD generates SQL queries to guide the table decomposition, remove noise, and processes sub-tables for better answer generation. Additionally, SQL Verifier refines SQL outputs to enhance decomposition accuracy. We introduce two TableQA datasets with large free-form tables, SLQA and SEQA, which consist solely of large free-form tables and will be publicly available. Experimental results on four benchmark datasets demonstrate that TABSD outperforms the best-existing baseline models by 23.07%, 2.84%, 23.24% and 9.32% in accuracy, respectively, highlighting its effectiveness in handling large and noisy free-form tables.
zh

[NLP-81] RLTHF: Targeted Human Feedback for LLM Alignment

【速读】：该论文旨在解决通过强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）微调大型语言模型（LLMs）以符合用户偏好时面临的高成本高质量人工标注以及AI反馈泛化能力有限的问题。关键解决方案是提出了一种名为RLTHF的人机混合框架，该框架结合了基于LLM的初步对齐和选择性的人工标注，从而以最小的努力实现与全人工标注相当的对齐效果。RLTHF通过奖励模型的奖励分布识别出LLMs难以标注且被错误标记的样本，并通过整合战略性的人类修正来迭代增强对齐，同时利用LLMs正确标注的样本。

链接: https://arxiv.org/abs/2502.13417
作者: Yifei Xu,Tusher Chakraborty,Emre Kıcıman,Bibek Aryal,Eduardo Rodrigues,Srinagesh Sharma,Roberto Estevao,Maria Angels de Luis Balaguer,Jessica Wolk,Rafael Padilha,Leonardo Nunes,Shobana Balakrishnan,Songwu Lu,Ranveer Chandra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model’s reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM’s correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF’s curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF’s strategic data curation.
zh

[NLP-82] Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成过程中产生的事实冲突幻觉（Fact-Conflicting Hallucinations, FCH）问题。为了解决这一挑战，论文提出的关键方案是Drowzee框架，它利用时态逻辑构建了一个全面的事实知识库，并通过自动化推理将这些知识转化为大规模可扩展的测试用例。Drowzee通过模板化的提示来测试LLMs，要求它们生成答案及推理步骤，并通过两个语义感知的验证器来验证生成内容的合理性。实验结果显示，Drowzee能够有效地识别出非时态相关和时态相关的幻觉现象。

链接: https://arxiv.org/abs/2502.13416
作者: Ningke Li,Yahui Song,Kailong Wang,Yuekang Li,Ling Shi,Yi Liu,Haoyu Wang
机构: Huazhong University of Science and Technology, China(华中科技大学,中国); National University of Singapore, Singapore(新加坡国立大学,新加坡); University of New South Wales, Australia(新南威尔士大学,澳大利亚); Nanyang Technological University, Singapore(南洋理工大学,新加坡)
类目: Computation and Language (cs.CL)
备注: 16 pages, under review. arXiv admin note: substantial text overlap with arXiv:2405.00648

点击查看摘要

Abstract:Large language models (LLMs) face the challenge of hallucinations – outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on – especially those involving intricate temporal features – is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model’s decision-making process. This paper presents Drowzee, an innovative end-to-end metamorphic testing framework that utilizes temporal logic to identify fact-conflicting hallucinations (FCH) in large language models (LLMs). Drowzee builds a comprehensive factual knowledge base by crawling sources like Wikipedia and uses automated temporal-logic reasoning to convert this knowledge into a large, extensible set of test cases with ground truth answers. LLMs are tested using these cases through template-based prompts, which require them to generate both answers and reasoning steps. To validate the reasoning, we propose two semantic-aware oracles that compare the semantic structure of LLM outputs to the ground truths. Across nine LLMs in nine different knowledge domains, experimental results show that Drowzee effectively identifies rates of non-temporal-related hallucinations ranging from 24.7% to 59.8%, and rates of temporal-related hallucinations ranging from 16.7% to 39.2%. Comments: 16 pages, under review. arXiv admin note: substantial text overlap with arXiv:2405.00648 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.13416 [cs.CL] (or arXiv:2502.13416v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13416 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-83] mathttGeLLM 3O: Generalizing Large Language Models for Multi-property Molecule Optimization

【速读】：该论文旨在解决现有分子优化计算方法局限于单一或双属性优化任务，并且在可扩展性和对新任务的泛化能力方面表现不佳的问题。解决方案的关键在于引入了\mathttMoMUInstruct，这是一个专门针对复杂多属性分子优化任务的高质量指令调优数据集。基于此数据集，开发了\mathttGeLLM^3O系列模型，这些模型通过指令调优展示了卓越的性能和出色的零样本泛化能力，显著优于现有的最先进基线和强大的闭源大型语言模型（LLMs），从而证明了其作为基础模型在处理新型优化任务方面的巨大潜力。

链接: https://arxiv.org/abs/2502.13398
作者: Vishal Dey,Xiao Hu,Xia Ning
机构: Department of Computer Science and Engineering, The Ohio State University(计算机科学与工程学院，俄亥俄州立大学); Translational Data Analytics Institute, The Ohio State University(转化数据分析研究所，俄亥俄州立大学); Department of Biomedical Informatics, The Ohio State University(生物医学信息学系，俄亥俄州立大学); College of Pharmacy, The Ohio State University(药学院，俄亥俄州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
备注: Vishal Dey and Xiao Hu contributed equally to this paper

点击查看摘要

Abstract:Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs’ potential for molecule optimization, we introduce \mathttMoMUInstruct , the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging \mathttMoMUInstruct , we develop \mathttGeLLM^3O s, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that \mathttGeLLM^3O s consistently outperform state-of-the-art baselines. \mathttGeLLM^3O s also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of \mathttGeLLM^3O s as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. \mathttMoMUInstruct , models, and code are accessible through this https URL.
zh

[NLP-84] Prompting a Weighting Mechanism into LLM -as-a-Judge in Two-Step: A Case Study

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在评估自然语言生成（Natural Language Generation, NLG）任务时，因无法适当地权衡不同主题的重要性而导致评估效果受限的问题。论文的关键解决方案在于提出了一种高效的提示设计机制，通过引入明确的重要性加权机制进行策略性提示工程，从而有效提升LLM作为评判者的能力，使其能够更好地优先处理相关重要信息。实验结果显示，这种方法使人类对齐率（Human Alignment Rate, HAR）平均提高了6%。

链接: https://arxiv.org/abs/2502.13396
作者: Wenwen Xie,Gray Gwizdz,Dongji Feng
机构: Databricks; MCS department, Gustavus Adolphus College
类目: Computation and Language (cs.CL)
备注: 5 pages, 5 tables, 1 figure

点击查看摘要

Abstract:While Large Language Models (LLMs) have emerged as promising tools for evaluating Natural Language Generation (NLG) tasks, their effectiveness is limited by their inability to appropriately weigh the importance of different topics, often overemphasizing minor details while undervaluing critical information, leading to misleading assessments. Our work proposes an efficient prompt design mechanism to address this specific limitation and provide a case study. Through strategic prompt engineering that incorporates explicit importance weighting mechanisms, we enhance using LLM-as-a-Judge ability to prioritize relevant information effectively, as demonstrated by an average improvement of 6% in the Human Alignment Rate (HAR) metric.
zh

[NLP-85] MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

【速读】：该论文旨在解决多模态（Multimodal, MM）领域中缺乏强大MM验证器的问题。论文的关键解决方案在于引入了MM-Verifier和MM-Reasoner，通过两步MM验证数据合成方法结合基于模拟的树搜索与验证，并采用拒绝采样生成高质量的Chain-of-Thought (COT) 数据，用于微调MM-Verifier模型。此外，还提出了一种更高效的方法来合成MMCOT数据，以增强文本推理到多模态推理之间的联系。这些合成数据被用来进一步微调MM-Reasoner模型。这一系列措施显著提升了多轮推理能力和验证的鲁棒性。

链接: https://arxiv.org/abs/2502.13383
作者: Linzhuang Sun,Hao Liang,Jingxuan Wei,Bihui Yu,Tianpeng Li,Fan Yang,Zenan Zhou,Wentao Zhang
机构: University of Chinese Academy of Sciences(中国科学院大学); Peking University(北京大学); Baichuan Inc.
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.
zh

[NLP-86] ask-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor

【速读】：该论文旨在解决现有提示压缩方法（Prompt Compression）依赖于显式问题或手工设计模板的问题，从而限制其通用性。论文提出了一种名为任务无关提示压缩（Task-Agnostic Prompt Compression, TPC）的新框架，无需输入问题或模板即可实现跨任务和跨领域的提示压缩。TPC的关键在于使用在精心策划的上下文和查询对数据集上训练的任务描述符（task descriptor），并通过强化学习进行微调，以生成与任务相关的描述，并计算提示中每个句子的相关性，最终生成压缩后的提示。

链接: https://arxiv.org/abs/2502.13374
作者: Barys Liskavets,Shuvendu Roy,Maxim Ushakov,Mark Klibanov,Ali Etemad,Shane Luke
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has led to significant interest in prompt compression, a technique aimed at reducing the length of input prompts while preserving critical information. However, the prominent approaches in prompt compression often require explicit questions or handcrafted templates for compression, limiting their generalizability. We propose Task-agnostic Prompt Compression (TPC), a novel framework that generalizes compression across tasks and domains without requiring input questions or templates. TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs, and fine-tuned via reinforcement learning with a reward function designed to capture the most relevant information. The task descriptor is then utilized to compute the relevance of each sentence in the prompt to generate the compressed prompt. We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS benchmarks, and our smallest model performs comparable to the existing solutions while being considerably smaller.
zh

[NLP-87] Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval

【速读】：该论文旨在解决利用大型语言模型（LLMs）生成SPARQL查询时，由于内部参数化知识导致的Uniform Resource Identifiers (URIs)等知识图谱（KG）元素生成错误的问题。这些问题通常表现为生成的内容看似合理但事实不准确，严重影响了其在实际信息检索（IR）应用中的可靠性。为了解决这一问题，论文提出了一种名为PGMR（后生成记忆检索）的模块化框架，通过引入非参数化记忆模块来检索KG元素并增强基于LLMs的SPARQL查询生成。关键在于PGMR能够显著减轻URI幻觉现象，在多个场景下几乎消除了该问题。

链接: https://arxiv.org/abs/2502.13369
作者: Aditya Sharma,Luis Lara,Amal Zouaq,Christopher J. Pal
机构: Mila; Polytechnique Montréal; Canada CIFAR AI Chair
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to generate SPARQL queries from natural language questions is crucial for ensuring efficient and accurate retrieval of structured data from knowledge graphs (KG). While large language models (LLMs) have been widely adopted for SPARQL query generation, they are often susceptible to hallucinations and out-of-distribution errors when producing KG elements like Uniform Resource Identifiers (URIs) based on internal parametric knowledge. This often results in content that appears plausible but is factually incorrect, posing significant challenges for their use in real-world information retrieval (IR) applications. This has led to increased research aimed at detecting and mitigating such errors. In this paper, we introduce PGMR (Post-Generation Memory Retrieval), a modular framework that incorporates a non-parametric memory module to retrieve KG elements and enhance LLM-based SPARQL query generation. Our experimental results indicate that PGMR consistently delivers strong performance across diverse datasets, data distributions, and LLMs. Notably, PGMR significantly mitigates URI hallucinations, nearly eliminating the problem in several scenarios.
zh

[NLP-88] RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering

【速读】：该论文旨在解决医疗问答系统在获取和应用事实知识方面存在的不足，现有方法如检索增强生成（RAG）范式往往忽视了事实性知识的重要性，导致检索到的概念性知识相关性和实际应用场景中的适用性受限。解决方案的关键在于提出了一种名为RGAR（循环生成增强检索）的框架，该框架能够从电子健康记录（EHRs）和大规模语料库中同时检索相关的事实性和概念性知识，并使两者相互作用和精炼。通过在三个包含事实感知的医疗问答基准上的广泛评估，RGAR展示了其在医疗RAG系统中的新最先进性能，尤其是在使用Llama-3.1-8B-Instruct模型时，其表现超越了更大规模的RAG增强GPT-3.5模型。研究结果表明，提取事实性知识对于检索具有显著的益处，能够持续提升生成质量。

链接: https://arxiv.org/abs/2502.13361
作者: Sichu Liang,Linhai Zhang,Hongyu Zhu,Wenwen Wang,Yulan He,Deyu Zhou
机构: School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, China(东南大学，教育部新世代人工智能技术及其交叉应用重点实验室，计算机科学与工程学院);
Department of Informatics, King’s College London, UK(英国伦敦国王学院，信息学系);
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China(上海交通大学，电子信息与电气工程学院);
School of Electrical and Computer Engineering, Carnegie Mellon University, USA(美国卡内基梅隆大学，电气与计算机工程学院);
The Alan Turing Insitute, UK(英国图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical question answering requires extensive access to specialized conceptual knowledge. The current paradigm, Retrieval-Augmented Generation (RAG), acquires expertise medical knowledge through large-scale corpus retrieval and uses this knowledge to guide a general-purpose large language model (LLM) for generating answers. However, existing retrieval approaches often overlook the importance of factual knowledge, which limits the relevance of retrieved conceptual knowledge and restricts its applicability in real-world scenarios, such as clinical decision-making based on Electronic Health Records (EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval framework that retrieves both relevant factual and conceptual knowledge from dual sources (i.e., EHRs and the corpus), allowing them to interact and refine each another. Through extensive evaluation across three factual-aware medical question answering benchmarks, RGAR establishes a new state-of-the-art performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings demonstrate the benefit of extracting factual knowledge for retrieval, which consistently yields improved generation quality.
zh

[NLP-89] Bridging the Editing Gap in LLM s: FineEdit for Precise and Targeted Text Modifications

【速读】：该论文旨在解决大型语言模型（LLMs）在直接文本编辑任务中的不足，这些任务需要精确且上下文感知的修改。尽管如ChatGPT等模型在文本生成和分析方面表现出色，但它们在编辑能力上仍显不足，仅能处理表面问题而非深层次的结构或逻辑不一致。为了解决这一问题，论文提出了双重方法：首先，引入InstrEditBench，一个包含超过20,000个结构化编辑任务的高质量基准数据集，涵盖维基文章、LaTeX文档、代码和领域特定语言（DSL）。其次，提出FineEdit模型，该模型基于此精心设计的数据集进行训练。实验结果表明，FineEdit在直接编辑任务上的表现比Gemini提升了约10%，有效验证了其有效性。关键在于InstrEditBench的创新自动化工作流程和FineEdit模型的针对性训练。

链接: https://arxiv.org/abs/2502.13358
作者: Yiming Zeng,Wanhao Yu,Zexin Li,Tao Ren,Yu Ma,Jinghan Cao,Xiyan Chen,Tingting Yu
机构: University of Connecticut(康涅狄格大学); University of North Carolina at Charlotte(夏洛特北卡罗来纳大学); University of California, Riverside(加州大学河滨分校); University of Pittsburgh(匹兹堡大学); Carnegie Mellon University(卡内基梅隆大学); San Francisco State University(旧金山州立大学); University of Pittsburgh(匹兹堡大学); University of Connecticut(康涅狄格大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing, yet they still struggle with direct text editing tasks that demand precise, context-aware modifications. While models like ChatGPT excel in text generation and analysis, their editing abilities often fall short, addressing only superficial issues rather than deeper structural or logical inconsistencies. In this work, we introduce a dual approach to enhance LLMs editing performance. First, we present InstrEditBench, a high-quality benchmark dataset comprising over 20,000 structured editing tasks spanning Wiki articles, LaTeX documents, code, and database Domain-specific Languages (DSL). InstrEditBench is generated using an innovative automated workflow that accurately identifies and evaluates targeted edits, ensuring that modifications adhere strictly to specified instructions without altering unrelated content. Second, we propose FineEdit, a specialized model trained on this curated benchmark. Experimental results demonstrate that FineEdit achieves significant improvements around 10% compared with Gemini on direct editing tasks, convincingly validating its effectiveness.
zh

[NLP-90] Event Segmentation Applications in Large Language Model Enabled Automated Recall Assessments

【速读】：该论文旨在解决如何客观且高效地评估事件分割（Event Segmentation）模式及回忆能力的问题。当前研究方法主要依赖于人类判断，这不仅主观而且耗时。为了解决这些问题，论文的关键在于利用大规模语言模型（Large Language Models, LLMs）来自动化事件分割和回忆评分。通过聊天完成（chat completion）和文本嵌入（text-embedding）模型，论文验证了LLMs能够准确识别事件边界，并且其结果比人类之间的分割一致性更高。基于此框架，研究提出了一种自动化的回忆评估方法，揭示了分段叙述事件与参与者回忆之间的语义相似性可以估计回忆表现。论文表明，LLMs能够有效模拟人类的分割模式并提供可扩展的回忆评估，从而替代手动评分。

链接: https://arxiv.org/abs/2502.13349
作者: Ryan A. Panela(1,2),Alex J. Barnett(2,3),Morgan D. Barense(1,2),Björn Herrmann(1,2) ((1) Rotman Research Institute, Baycrest Academy for Research and Education, (2) Department of Psychology, University of Toronto, (3) Department of Neurology and Neurosurgery, Montreal Neurological Institute and Hospital, McGill University)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, 7 figures

点击查看摘要

Abstract:Understanding how individuals perceive and recall information in their natural environments is critical to understanding potential failures in perception (e.g., sensory loss) and memory (e.g., dementia). Event segmentation, the process of identifying distinct events within dynamic environments, is central to how we perceive, encode, and recall experiences. This cognitive process not only influences moment-to-moment comprehension but also shapes event specific memory. Despite the importance of event segmentation and event memory, current research methodologies rely heavily on human judgements for assessing segmentation patterns and recall ability, which are subjective and time-consuming. A few approaches have been introduced to automate event segmentation and recall scoring, but validity with human responses and ease of implementation require further advancements. To address these concerns, we leverage Large Language Models (LLMs) to automate event segmentation and assess recall, employing chat completion and text-embedding models, respectively. We validated these models against human annotations and determined that LLMs can accurately identify event boundaries, and that human event segmentation is more consistent with LLMs than among humans themselves. Using this framework, we advanced an automated approach for recall assessments which revealed semantic similarity between segmented narrative events and participant recall can estimate recall performance. Our findings demonstrate that LLMs can effectively simulate human segmentation patterns and provide recall evaluations that are a scalable alternative to manual scoring. This research opens novel avenues for studying the intersection between perception, memory, and cognitive impairment using methodologies driven by artificial intelligence.
zh

[NLP-91] Craw4LLM : Efficient Web Crawling for LLM Pretraining

【速读】：该论文旨在解决大规模语言模型（LLMs）预训练数据质量低的问题。当前，网络爬虫获取的大部分网页因数据质量问题在预训练阶段被丢弃。论文的关键解决方案是提出了一种名为Crawl4LLM的高效网络爬虫方法，它基于LLM预训练的数据偏好探索网络图谱，并将网页在LLM预训练中的影响力作为网络爬虫调度的优先级评分标准，取代了传统的基于图连通性的优先级设定。实验结果表明，仅爬取了21%的URL，基于Crawl4LLM的数据预训练的LLMs就能达到先前爬取数据的下游任务性能，显著减少了爬取浪费并减轻了网站负担。

链接: https://arxiv.org/abs/2502.13347
作者: Shi Yu,Zhiyuan Liu,Chenyan Xiong
机构: Tsinghua University(清华大学); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Web crawl is a main source of large language models’ (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler’s scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine’s index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at this https URL.
zh

[NLP-92] K-Paths: Reasoning over Graph Paths for Drug Repurposing and Drug Interaction Prediction

【速读】：该论文旨在解决从大规模生物医学知识图谱（Knowledge Graphs, KGs）中提取有意义洞见的挑战，特别是针对药物重定位和药物-疾病相互作用预测中的未观察到交互。现有基于子图的方法主要适用于图神经网络（Graph Neural Networks, GNNs），而无法与大型语言模型（Large Language Models, LLMs）等其他模型兼容。论文的关键解决方案是K-Paths框架，它能够从KGs中提取结构化、多样化且具有生物学意义的路径。通过将这些路径转化为可以直接被LLMs处理的格式，K-Paths不仅提高了LLMs在零样本学习下的表现，还提升了GNNs的监督训练效率。K-Paths的独特之处在于其多样性感知的路径检索算法，以及在KGs和LLMs之间建立桥梁的能力，从而提供可解释的预测理由。

链接: https://arxiv.org/abs/2502.13344
作者: Tassallah Abdullahi,Ioanna Gemou,Nihal V. Nayak,Ghulam Murtaza,Stephen H. Bach,Carsten Eickhoff,Ritambhara Singh
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Drug discovery is a complex and time-intensive process that requires identifying and validating new therapeutic candidates. Computational approaches using large-scale biomedical knowledge graphs (KGs) offer a promising solution to accelerate this process. However, extracting meaningful insights from large-scale KGs remains challenging due to the complexity of graph traversal. Existing subgraph-based methods are tailored to graph neural networks (GNNs), making them incompatible with other models, such as large language models (LLMs). We introduce K-Paths, a retrieval framework that extracts structured, diverse, and biologically meaningful paths from KGs. Integrating these paths enables LLMs and GNNs to effectively predict unobserved drug-drug and drug-disease interactions. Unlike traditional path-ranking approaches, K-Paths retrieves and transforms paths into a structured format that LLMs can directly process, facilitating explainable reasoning. K-Paths employs a diversity-aware adaptation of Yen’s algorithm to retrieve the K shortest loopless paths between entities in an interaction query, prioritizing biologically relevant and diverse relationships. Our experiments on benchmark datasets show that K-Paths improves the zero-shot performance of Llama 8.1B’s F1-score by 12.45 points on drug repurposing and 13.42 points on interaction severity prediction. We also show that Llama 70B achieves F1-score gains of 6.18 and 8.46 points, respectively. K-Paths also improves the supervised training efficiency of EmerGNN, a state-of-the-art GNN, by reducing KG size by 90% while maintaining strong predictive performance. Beyond its scalability and efficiency, K-Paths uniquely bridges the gap between KGs and LLMs, providing explainable rationales for predicted interactions. These capabilities show that K-Paths is a valuable tool for efficient data-driven drug discovery.
zh

[NLP-93] Beyond De-Identification: A Structured Approach for Defining and Detecting Indirect Identifiers in Medical Texts

【速读】：该论文旨在解决在科学研究中共享敏感文本时保护患者及医疗人员隐私的问题。论文的关键在于引入了一种由九类间接标识符构成的分类方案，以应对不同的潜在对手，包括熟人、家庭成员和医务人员。通过这一方案，论文标注了MIMIC-III出院摘要中的1亿份数据，并提出了识别间接标识符的基准模型。

链接: https://arxiv.org/abs/2502.13342
作者: Ibrahim Baroud,Lisa Raithel,Sebastian Möller,Roland Roller
机构: Technische Universität Berlin (柏林工业大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心), Berlin; BIFOLD – Berlin Institute for the Foundations of Learning and Data (BIFOLD – 柏林学习与数据基础研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sharing sensitive texts for scientific purposes requires appropriate techniques to protect the privacy of patients and healthcare personnel. Anonymizing textual data is particularly challenging due to the presence of diverse unstructured direct and indirect identifiers. To mitigate the risk of re-identification, this work introduces a schema of nine categories of indirect identifiers designed to account for different potential adversaries, including acquaintances, family members and medical staff. Using this schema, we annotate 100 MIMIC-III discharge summaries and propose baseline models for identifying indirect identifiers. We will release the annotation guidelines, annotation spans (6,199 annotations in total) and the corresponding MIMIC-III document IDs to support further research in this area.
zh

[NLP-94] Language Models are Few-Shot Graders

【速读】：该论文旨在解决自动短答案评分（Automatic Short Answer Grading, ASAG）系统在评估学生开放式回答时的工作负载问题。解决方案的关键在于利用最新的大型语言模型（Large Language Models, LLMs）构建了一个新的ASAG流水线，该流水线在相同数据集上的表现优于现有的定制模型。此外，研究还探讨了不同OpenAI模型（GPT-4、GPT-4o和o1-preview）的表现，并发现GPT-4o在准确性和成本效益之间取得了最佳平衡。进一步的研究表明，通过在提示中加入经过教师评分的例子以及采用基于检索增强生成（Retrieval-Augmented Generation, RAG）的选择策略，可以显著提高评分准确性。

链接: https://arxiv.org/abs/2502.13337
作者: Chenyan Zhao,Mariana Silva,Seth Poulsen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Providing evaluations to student work is a critical component of effective student learning, and automating its process can significantly reduce the workload on human graders. Automatic Short Answer Grading (ASAG) systems, enabled by advancements in Large Language Models (LLMs), offer a promising solution for assessing and providing instant feedback for open-ended student responses. In this paper, we present an ASAG pipeline leveraging state-of-the-art LLMs. Our new LLM-based ASAG pipeline achieves better performances than existing custom-built models on the same datasets. We also compare the grading performance of three OpenAI models: GPT-4, GPT-4o, and o1-preview. Our results demonstrate that GPT-4o achieves the best balance between accuracy and cost-effectiveness. On the other hand, o1-preview, despite higher accuracy, exhibits a larger variance in error that makes it less practical for classroom use. We investigate the effects of incorporating instructor-graded examples into prompts using no examples, random selection, and Retrieval-Augmented Generation (RAG)-based selection strategies. Our findings indicate that providing graded examples enhances grading accuracy, with RAG-based selection outperforming random selection. Additionally, integrating grading rubrics improves accuracy by offering a structured standard for evaluation.
zh

[NLP-95] Language Models Can Predict Their Own Behavior

【速读】：该论文旨在解决如何在早期预测语言模型（Language Model, LM）的行为，从而避免不必要的文本生成。关键在于利用内部表示探针（internal state probes），这些探针能够通过分析输入标记的内部表示来精确预测模型在整个输出序列中的行为，包括是否需要回答问题或遵循特定格式。通过这种方法，论文展示了可以将推理成本平均降低65%，同时最坏情况下仅损失1.4%的准确性。

链接: https://arxiv.org/abs/2502.13329
作者: Dhananjay Ashok,Jonathan May
机构: Information Sciences Institute (信息科学研究所); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive Language Models output text by sequentially predicting the next token to generate, with modern methods like Chain-of-Thought (CoT) prompting achieving state-of-the-art reasoning capabilities by scaling the number of generated tokens. However, are there times when we can infer how the model will behave (e.g. abstain from answering a question) early in the computation, making generation unnecessary? We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence. We leverage this capacity and learn probes on internal states to create early warning (and exit) systems. Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead. On 27 text classification datasets spanning five different tasks, we apply this method to estimate the eventual answer of an LM under CoT prompting, reducing inference costs by 65% (average) while suffering an accuracy loss of no more than 1.4% (worst case). We demonstrate the potential of this method to pre-emptively identify when a model will abstain from answering a question, fail to follow output format specifications, or give a low-confidence response. We explore the limits of this capability, showing that probes generalize to unseen datasets, but perform worse when LM outputs are longer and struggle to predict properties that require access to knowledge that the models themselves lack. Encouragingly, performance scales with model size, suggesting applicability to the largest of models
zh

[NLP-96] Capturing Human Cognitive Styles with Language: Towards an Experimental Evaluation Paradigm

【速读】：该论文旨在解决通过语言特征评估个体认知风格与决策行为之间关系的问题。关键在于引入了一种基于实验的框架，通过比较语言特征与经典决策实验中的认知风格，验证语言特征是否能够有效预测个体的决策风格。研究表明，所提取的语言特征能够以中等到高的准确度（AUC约0.8）预测参与者的决策风格，从而证明认知风格可以在一定程度上通过话语模式捕捉和揭示。

链接: https://arxiv.org/abs/2502.13326
作者: Vasudha Varadarajan,Syeda Mahwish,Xiaoran Liu,Julia Buffolino,Christian C. Luhmann,Ryan L. Boyd,H. Andrew Schwartz
机构: Department of Computer Science, Stony Brook University (计算机科学系，石溪大学); Department of Psychology, Stony Brook University (心理学系，石溪大学); Department of Psychology, University of Texas at Dallas (心理学系，达拉斯德克萨斯大学)
类目: Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:While NLP models often seek to capture cognitive states via language, the validity of predicted states is determined by comparing them to annotations created without access the cognitive states of the authors. In behavioral sciences, cognitive states are instead measured via experiments. Here, we introduce an experiment-based framework for evaluating language-based cognitive style models against human behavior. We explore the phenomenon of decision making, and its relationship to the linguistic style of an individual talking about a recent decision they made. The participants then follow a classical decision-making experiment that captures their cognitive style, determined by how preferences change during a decision exercise. We find that language features, intended to capture cognitive style, can predict participants’ decision style with moderate-to-high accuracy (AUC ~ 0.8), demonstrating that cognitive style can be partly captured and revealed by discourse patterns.
zh

[NLP-97] Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance

【速读】：该论文旨在解决在人工智能辅助决策任务中，用户对人工智能（AI）的信任偏差导致的不当依赖问题。论文的关键解决方案在于通过信任适应性干预（trust-adaptive interventions），使AI助手根据用户的信任水平调整其行为。例如，在用户信任度低的情况下提供解释可以促使用户更仔细地考虑AI的建议。研究发现，在两个决策场景中——非专业人士回答科学问题和医生进行医学诊断——这种策略可减少多达38%的不当依赖，并提高20%的决策准确性。此外，通过适时插入强制暂停以促进深思熟虑，同样能够减少过度依赖。这些结果表明，AI根据用户信任水平进行适应性调整有助于实现适当的依赖，为提升人机协作开辟了新的途径。

链接: https://arxiv.org/abs/2502.13321
作者: Tejas Srinivasan,Jesse Thomason
机构: University of Southern California (南加州大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Trust biases how users rely on AI recommendations in AI-assisted decision-making tasks, with low and high levels of trust resulting in increased under- and over-reliance, respectively. We propose that AI assistants should adapt their behavior through trust-adaptive interventions to mitigate such inappropriate reliance. For instance, when user trust is low, providing an explanation can elicit more careful consideration of the assistant’s advice by the user. In two decision-making scenarios – laypeople answering science questions and doctors making medical diagnoses – we find that providing supporting and counter-explanations during moments of low and high trust, respectively, yields up to 38% reduction in inappropriate reliance and 20% improvement in decision accuracy. We are similarly able to reduce over-reliance by adaptively inserting forced pauses to promote deliberation. Our results highlight how AI adaptation to user trust facilitates appropriate reliance, presenting exciting avenues for improving human-AI collaboration.
zh

[NLP-98] Elucidating Mechanisms of Demographic Bias in LLM s for Healthcare

【速读】：该论文旨在揭示大型语言模型（LLMs）在医疗保健背景下对社会人口统计学信息（如性别、种族）的编码及其潜在偏差。研究的关键在于采用机械性可解释性工具，识别并验证LLMs内部特定激活层是否包含社会人口统计学信息，并探究这些信息如何影响临床预测任务。研究表明，性别信息主要集中在中间的多层感知器（MLP）层，且可以在推理阶段通过干预可靠地操控；而种族信息的表示则较为分散，但同样可以被一定程度地干预。这些发现首次将机械性可解释性方法应用于医疗保健领域的LLMs研究。

链接: https://arxiv.org/abs/2502.13319
作者: Hiba Ahsan,Arnab Sen Sharma,Silvio Amir,David Bau,Byron C. Wallace
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in middle MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
zh

[NLP-99] raining Turn-by-Turn Verifiers for Dialogue Tutoring Agents : The Curious Case of LLM s as Your Coding Tutors

【速读】：该论文旨在解决智能辅导代理在引导用户完成复杂真实世界任务方面的不足，特别是在编码辅导这一具有挑战性的问题上，需要辅导者主动引导学生完成预定义的编程任务。论文的关键解决方案是提出了一种名为Trace-and-Verify (TRAVER)的新颖代理工作流程，该方法结合了知识追踪（Knowledge Tracing）以估计学生的知识状态，并通过逐回合验证（turn-by-turn verification）确保有效的任务完成指导。

链接: https://arxiv.org/abs/2502.13311
作者: Jian Wang,Yinpei Dai,Yichi Zhang,Ziqiao Ma,Wenjie Li,Joyce Chai
机构: The Hong Kong Polytechnic University(香港理工大学); University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized guidance in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students toward completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student’s knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents holistically using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our results and findings can be extended beyond coding, providing valuable insights into advancing tutoring agents for a variety of tasks.
zh

[NLP-100] Evaluating and Enhancing Out-of-Domain Generalization of Task-Oriented Dialog Systems for Task Completion without Turn-level Dialog Annotations

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）是否可以仅通过自然语言对话进行微调来执行任务型对话（Task-oriented Dialog, ToD）任务，而无需传统的逐轮标注（如对话状态和策略标签）。论文的关键解决方案是提出了一种名为ZeroToD的框架，该框架引入了一个模式增强机制以提升API调用的准确性，并且整体提高了任务完成率，特别是在跨领域设置中。此外，ZeroToD使得较小的微调模型在任务完成方面优于大规模专有LLM。

链接: https://arxiv.org/abs/2502.13310
作者: Adib Mosharrof,Moghis Fereidouni,A.B. Siddique
机构: University of Kentucky (肯塔基大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Traditional task-oriented dialog (ToD) systems rely heavily on labor-intensive turn-level annotations, such as dialogue states and policy labels, for training. This work explores whether large language models (LLMs) can be fine-tuned solely on natural language dialogs to perform ToD tasks, without requiring such annotations. We evaluate their ability to generalize to unseen domains and compare their performance with models trained on fully annotated data. Through extensive experiments with three open-source LLMs of varying sizes and two diverse ToD datasets, we find that models fine-tuned without turn-level annotations generate coherent and contextually appropriate responses. However, their task completion performance - measured by accurate execution of API calls - remains suboptimal, with the best models achieving only around 53% success in unseen domains. To improve task completion, we propose ZeroToD, a framework that incorporates a schema augmentation mechanism to enhance API call accuracy and overall task completion rates, particularly in out-of-domain settings. We also compare ZeroToD with fine-tuning-free alternatives, such as prompting off-the-shelf LLMs, and find that our framework enables smaller, fine-tuned models that outperform large-scale proprietary LLMs in task completion. Additionally, a human study evaluating informativeness, fluency, and task completion confirms our empirical findings. These findings suggest the feasibility of developing cost-effective, scalable, and zero-shot generalizable ToD systems for real-world applications.
zh

[NLP-101] Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback

【速读】：该论文旨在解决多轮任务型对话（Task-oriented Dialog, TOD）系统在复杂任务完成中的挑战，特别是在无需大量标注数据的情况下实现跨领域适应及准确生成API调用的问题。论文的关键解决方案是提出RealTOD框架，通过引入提示链（prompt chaining）和细粒度反馈机制（fine-grained feedback mechanism），实现在零样本条件下的领域适应，并通过验证API调用与领域模式的一致性来提高任务完成的准确性。

链接: https://arxiv.org/abs/2502.13298
作者: Moghis Fereidouni,Md Sajid Ahmed,Adib Mosharrof,A.B. Siddique
机构: University of Kentucky(肯塔基大学); Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language. While traditional approaches rely on extensive fine-tuning and annotated data for each domain, instruction-tuned large language models (LLMs) offer a more flexible alternative. However, LLMs struggle to reliably handle multi-turn task completion, particularly with accurately generating API calls and adapting to new domains without explicit demonstrations. To address these challenges, we propose RealTOD, a novel framework that enhances TOD systems through prompt chaining and fine-grained feedback mechanisms. Prompt chaining enables zero-shot domain adaptation via a two-stage prompting strategy, eliminating the need for human-curated demonstrations. Meanwhile, the fine-grained feedback mechanism improves task completion by verifying API calls against domain schemas and providing precise corrective feedback when errors are detected. We conduct extensive experiments on the SGD and BiTOD benchmarks using four LLMs. RealTOD improves API accuracy, surpassing AutoTOD by 37.74% on SGD and SimpleTOD by 11.26% on BiTOD. Human evaluations further confirm that LLMs integrated with RealTOD achieve superior task completion, fluency, and informativeness compared to existing methods.
zh

[NLP-102] Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding

【速读】：该论文旨在解决自然语言理解（Natural Language Understanding, NLU）任务中忽略个体因素的问题，导致推理困难、不可解释以及标签错误率高的现象。解决方案的关键在于提出一种基于个体层面因素的新标注指南，通过整合同一个体的其他帖子，并在考虑所有个体帖子后标注个体主观视角。此方法显著提高了数据集的准确性，使得大型语言模型（如GPT-4o和Llama3-70B）在重标注的数据集上的准确率超过了87%。

链接: https://arxiv.org/abs/2502.13297
作者: Yunpeng Xiao,Youpeng Zhao,Kai Shu
机构: Emory University (埃默里大学); University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7% and 23.3%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at this https URL
zh

[NLP-103] Performance Evaluation of Sentiment Analysis on Text and Emoji Data Using End-to-End Transfer Learning Distributed and Explainable AI Models

【速读】：该论文旨在解决在情感分析中处理emoji的问题，并评估其对文本分类准确性的影响。关键解决方案在于使用Universal Sentence Encoder (USE) 和Sentence Bidirectional Encoder Representations from Transformers (SBERT) 进行端到端的句子嵌入，并采用分布式训练方法以提高模型的可扩展性。研究发现，当验证集中包含训练集中未出现的emoji时，模型的准确性显著下降至70%，这凸显了模型在处理未知emoji时的局限性。此外，通过分布式训练方法，论文成功将运行时间减少了约15%，同时保持了较高的分类精度。

链接: https://arxiv.org/abs/2502.13278
作者: Sirisha Velampalli,Chandrashekar Muniyappa,Ashutosh Saxena
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emojis are being frequently used in todays digital world to express from simple to complex thoughts more than ever before. Hence, they are also being used in sentiment analysis and targeted marketing campaigns. In this work, we performed sentiment analysis of Tweets as well as on emoji dataset from the Kaggle. Since tweets are sentences we have used Universal Sentence Encoder (USE) and Sentence Bidirectional Encoder Representations from Transformers (SBERT) end-to-end sentence embedding models to generate the embeddings which are used to train the Standard fully connected Neural Networks (NN), and LSTM NN models. We observe the text classification accuracy was almost the same for both the models around 98 percent. On the contrary, when the validation set was built using emojis that were not present in the training set then the accuracy of both the models reduced drastically to 70 percent. In addition, the models were also trained using the distributed training approach instead of a traditional singlethreaded model for better scalability. Using the distributed training approach, we were able to reduce the run-time by roughly 15% without compromising on accuracy. Finally, as part of explainable AI the Shap algorithm was used to explain the model behaviour and check for model biases for the given feature set.
zh

[NLP-104] REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation

【速读】：该论文旨在解决长周期开放域对话能力中真实世界情感智能（Emotional Intelligence, EI）和角色一致性的问题。当前大多数研究依赖于合成数据，忽略了实际对话中的复杂性。为此，作者引入了REALTALK数据集，包含21天的真实消息应用对话记录，以提供与真实人类互动直接对比的标准。论文的关键解决方案在于提出两个基准任务：一是角色模拟，即模型需基于先前对话上下文继续代表特定用户进行对话；二是记忆探查，即模型需回答需要长期记忆过去交互细节的问题。研究表明，现有模型在仅依靠对话历史模拟用户时存在困难，而针对特定用户聊天进行微调可以改善角色模拟效果，同时在真实世界对话中回忆和利用长期上下文信息也面临显著挑战。

链接: https://arxiv.org/abs/2502.13270
作者: Dong-Ho Lee,Adyasha Maharana,Jay Pujara,Xiang Ren,Francesco Barbieri
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Long-term, open-domain dialogue capabilities are essential for chatbots aiming to recall past interactions and demonstrate emotional intelligence (EI). Yet, most existing research relies on synthetic, LLM-generated data, leaving open questions about real-world conversational patterns. To address this gap, we introduce REALTALK, a 21-day corpus of authentic messaging app dialogues, providing a direct benchmark against genuine human interactions. We first conduct a dataset analysis, focusing on EI attributes and persona consistency to understand the unique challenges posed by real-world dialogues. By comparing with LLM-generated conversations, we highlight key differences, including diverse emotional expressions and variations in persona stability that synthetic dialogues often fail to capture. Building on these insights, we introduce two benchmark tasks: (1) persona simulation where a model continues a conversation on behalf of a specific user given prior dialogue context; and (2) memory probing where a model answers targeted questions requiring long-term memory of past interactions. Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation. Additionally, existing models face significant challenges in recalling and leveraging long-term context within real-world conversations. Comments: 20 pages, 7 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.13270 [cs.CL] (or arXiv:2502.13270v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13270 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dong-Ho Lee [view email] [v1] Tue, 18 Feb 2025 20:29:01 UTC (334 KB)
zh

[NLP-105] Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models

【速读】：该论文旨在解决链式思维（Chain-of-Thought, CoT）推理在提升大型语言模型（LLMs）性能的同时，因包含不必要的步骤而导致推理过程耗时长且计算成本高的问题。关键解决方案在于通过困惑度（perplexity）衡量推理步骤的重要性，并识别出那些当其被移除时会导致显著困惑度增加的必要推理步骤。论文提出通过优化少量演示示例或微调模型来仅生成这些关键步骤，从而实现更优的推理准确性和效率平衡。

链接: https://arxiv.org/abs/2502.13260
作者: Yingqian Cui,Pengfei He,Jingying Zeng,Hui Liu,Xianfeng Tang,Zhenwei Dai,Yan Han,Chen Luo,Jing Huang,Zhen Li,Suhang Wang,Yue Xing,Jiliang Tang,Qi He
机构: Amazon(亚马逊); Michigan State University (密歇根州立大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning, which breaks down complex tasks into intermediate reasoning steps, has significantly enhanced the performance of large language models (LLMs) on challenging tasks. However, the detailed reasoning process in CoT often incurs long generation times and high computational costs, partly due to the inclusion of unnecessary steps. To address this, we propose a method to identify critical reasoning steps using perplexity as a measure of their importance: a step is deemed critical if its removal causes a significant increase in perplexity. Our method enables models to focus solely on generating these critical steps. This can be achieved through two approaches: refining demonstration examples in few-shot CoT or fine-tuning the model using selected examples that include only critical steps. Comprehensive experiments validate the effectiveness of our method, which achieves a better balance between the reasoning accuracy and efficiency of CoT.
zh

[NLP-106] HumT DumT: Measuring and controlling human-like language in LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）生成的人类相似语言所带来的潜在影响，包括用户体验提升与过度依赖及刻板印象之间的权衡。为系统性评估这些影响，论文引入了HumT和SocioT两种度量标准，用于衡量文本数据中人类相似语调及其他社会感知维度。研究发现用户更偏好较少人类相似输出的LLMs。论文的关键解决方案是DumT方法，它利用HumT来系统地控制并减少人类相似语调的程度，同时保持模型性能，从而提供了一种减轻拟人化语言生成相关风险的实际途径。

链接: https://arxiv.org/abs/2502.13259
作者: Myra Cheng,Sunny Yu,Dan Jurafsky
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Should LLMs generate language that makes them seem human? Human-like language might improve user experience, but might also lead to overreliance and stereotyping. Assessing these potential impacts requires a systematic way to measure human-like tone in LLM outputs. We introduce HumT and SocioT, metrics for human-like tone and other dimensions of social perceptions in text data based on relative probabilities from an LLM. By measuring HumT across preference and usage datasets, we find that users prefer less human-like outputs from LLMs. HumT also offers insights into the impacts of anthropomorphism: human-like LLM outputs are highly correlated with warmth, social closeness, femininity, and low status, which are closely linked to the aforementioned harms. We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance. DumT offers a practical approach for mitigating risks associated with anthropomorphic language generation.
zh

[NLP-107] Multilingual Language Model Pretraining using Machine-translated Data

【速读】：该论文旨在解决高质量大规模多语言大语言模型（Multilingual Large Language Models, LLMs）在非英语语言上的性能不足问题。论文的关键在于发现从单一高质量源语言机器翻译文本可以显著提升多语言LLMs的预训练质量。通过将一个高质量的英语网络数据集FineWeb-Edu翻译成九种其他语言，构建了一个包含1.7万亿词的TransWebEdu数据集，并在此基础上从零开始预训练了一个参数量为13亿的模型TransWebLLM。实验结果表明，尽管使用的数据量少了一个数量级，TransWebLLM在九种非英语推理任务上与使用封闭数据训练的最先进模型（如Llama3.2、Qwen2.5和Gemma）相比仍能匹配或超越其性能。此外，添加少于5%的TransWebEdu作为领域特定预训练数据，即可在阿拉伯语、意大利语、印尼语、斯瓦希里语和威尔士语的理解和常识推理任务中达到新的最佳水平。

链接: https://arxiv.org/abs/2502.13252
作者: Jiayi Wang,Yao Lu,Maurice Weber,Max Ryabinin,David Adelani,Yihong Chen,Raphael Tang,Pontus Stenetorp
机构: Centre for Artificial Intelligence, University College London (伦敦大学学院人工智能中心); Together AI (Together AI); Mila, McGill University, Canada CIFAR AI Chair (麦吉尔大学加拿大 CIFAR 人工智能主席, Mila); Research and Development Center for Large Language Models, National Institute of Informatics (国家信息学研究所大型语言模型研发中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.
zh

[NLP-108] Neural Attention Search

【速读】：该论文旨在解决Transformer模型在推理过程中KV缓存占用大量资源的问题。论文的关键解决方案是提出了一种名为Neural Attention Search (NAtS)的框架，通过学习每个token的重要性并决定是否可以在几个步骤后丢弃这些token，从而有效地减少所需的KV缓存大小，同时保持模型性能。

链接: https://arxiv.org/abs/2502.13251
作者: Difan Deng,Marius Lindauer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:We present Neural Attention Search (NAtS), a framework that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. This approach can efficiently reduce the KV cache sizes required by transformer-based models during inference and thus reduce inference costs. In this paper, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens. (ii) Local Tokens survive until the next global token appears. (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache size required for the models while maintaining the models’ performance.
zh

[NLP-109] Grounding LLM Reasoning with Knowledge Graphs

【速读】：该论文旨在解决在知识图谱（Knowledge Graphs, KGs）上进行自然语言问句回答（Question-Answering, QA）的挑战。这些挑战源于自然语言处理的复杂性以及知识图谱的规模。论文的关键解决方案在于将大型语言模型（Large Language Models, LLMs）的推理策略与知识图谱数据相结合，通过在推理链的每一步或“思考”步骤中锚定于知识图谱数据，以提升LLMs的性能和可靠性。具体而言，论文评估了多种推理策略，包括Chain-of-Thought (CoT)、Tree-of-Thought (ToT) 和 Graph-of-Thought (GoT)，并在GRBench基准数据集上验证了这种方法的有效性，结果表明该方法显著优于基线模型。

链接: https://arxiv.org/abs/2502.13247
作者: Alfonso Amayuelas,Joy Sain,Simerjot Kaur,Charese Smiley
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) are valuable tools for representing relationships between entities in a structured format. Traditionally, these knowledge bases are queried to extract specific information. However, question-answering (QA) over such KGs poses a challenge due to the intrinsic complexity of natural language compared to the structured format and the size of these graphs. Despite these challenges, the structured nature of KGs can provide a solid foundation for grounding the outputs of Large Language Models (LLMs), offering organizations increased reliability and control. Recent advancements in LLMs have introduced reasoning methods at inference time to improve their performance and maximize their capabilities. In this work, we propose integrating these reasoning strategies with KGs to anchor every step or “thought” of the reasoning chains in KG data. Specifically, we evaluate both agentic and automated search methods across several reasoning strategies, including Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Graph-of-Thought (GoT), using GRBench, a benchmark dataset for graph reasoning with domain-specific graphs. Our experiments demonstrate that this approach consistently outperforms baseline models, highlighting the benefits of grounding LLM reasoning processes in structured KG data. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.13247 [cs.CL] (or arXiv:2502.13247v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-110] When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models

【速读】：该论文旨在解决如何量化政治话语中的隐喻语言，特别是移民议题在社交媒体上的表达。论文的关键解决方案在于提出了一种结合词级(word-level)和文档级(document-level)信号的新技术，以衡量与特定概念相关的隐喻使用情况。通过这种方法，研究者分析了隐喻、政治意识形态和用户互动之间的关系，并揭示了不同政治倾向的用户在使用隐喻时的差异及其对用户参与度的影响。

链接: https://arxiv.org/abs/2502.13246
作者: Julia Mendelsohn,Ceren Budak
机构: University of Chicago(芝加哥大学); University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. “water” or “vermin”). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.
zh

[NLP-111] SearchRAG : Can Search Engines Be Helpful for LLM -based Medical Question Answering?

【速读】：该论文旨在解决大型语言模型（LLMs）在处理需要专门知识的任务时表现不佳的问题，特别是在医学问答任务中，常规的检索增强生成（RAG）技术依赖于静态知识库，这些知识库可能过时或不完整，缺乏精确的临床细节。论文的关键解决方案是提出了一种名为SearchRAG的新框架，通过利用实时搜索引擎克服上述限制。SearchRAG方法采用合成查询生成将复杂的医学问题转换为适合搜索引擎的查询，并使用基于不确定性的知识选择来筛选和整合最相关和信息量最大的医学知识到LLM的输入中。

链接: https://arxiv.org/abs/2502.13233
作者: Yucheng Shi,Tianze Yang,Canyu Chen,Quanzheng Li,Tianming Liu,Xiang Li,Ninghao Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注: 8 pages, three figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM’s input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.
zh

[NLP-112] hinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

【速读】：该论文旨在解决大型语言模型在创意任务中的输出缺乏多样性的问题，同时避免因采用高温采样等常见方法而牺牲结果质量。关键解决方案在于提出一个基于上下文的评分机制（context-based score），该机制结合信息论，用于定量评估价值和原创性。此评分机制不仅激励准确性与请求的一致性，还促进输出与学习分布的差异。通过将此评分作为奖励，在强化学习框架中微调大型语言模型，以实现最佳性能。论文通过诗歌生成和数学问题求解实验验证了该策略的有效性，证明其能够提升生成解决方案的价值和原创性。

链接: https://arxiv.org/abs/2502.13207
作者: Giorgio Franceschelli,Mirco Musolesi
机构: University of Bologna (博洛尼亚大学); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We propose using our score as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments in poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated solutions.
zh

[NLP-113] Linguistic Generalizations are not Rules: Impacts on Evaluation of LMs

【速读】：该论文旨在解决现有语言模型（Language Models, LMs）在评估其生成或理解新文本的能力时，过分依赖于基于严格规则的符号主义方法的问题。论文的关键在于提出自然语言并非基于严格的规则系统，而是通过灵活的、相互关联且依赖于具体情境的图式或构式来生成和理解新表达。因此，研究应重新设计基准测试和分析方法，以反映自然语言中的丰富而灵活的泛化能力。

链接: https://arxiv.org/abs/2502.13195
作者: Leonie Weissweiler,Kyle Mahowald,Adele Goldberg
机构: The University of Texas at Austin; Princeton University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linguistic evaluations of how well LMs generalize to produce or understand novel text often implicitly take for granted that natural languages are generated by symbolic rules. Grammaticality is thought to be determined by whether or not sentences obey such rules. Interpretation is believed to be compositionally generated by syntactic rules operating on meaningful words. Semantic parsing is intended to map sentences into formal logic. Failures of LMs to obey strict rules have been taken to reveal that LMs do not produce or understand language like humans. Here we suggest that LMs’ failures to obey symbolic rules may be a feature rather than a bug, because natural languages are not based on rules. New utterances are produced and understood by a combination of flexible interrelated and context-dependent schemata or constructions. We encourage researchers to reimagine appropriate benchmarks and analyses that acknowledge the rich flexible generalizations that comprise natural languages.
zh

[NLP-114] Private Text Generation by Seeding Large Language Model Prompts

【速读】：该论文旨在解决组织机构（如医院）在不泄露敏感信息的前提下分享文本数据以训练机器学习模型的问题。解决方案的关键在于提出了一种名为差分隐私关键词提示播种（Differentially Private Keyphrase Prompt Seeding, DP-KPS）的方法。该方法通过使用私有样本对提示进行播种，从敏感输入语料库生成私有合成文本语料库，仅通过私有化提示访问大型语言模型（LLM），从而在保持差异隐私的同时捕捉输入语料库的特征并实现所需的输出多样性。

链接: https://arxiv.org/abs/2502.13193
作者: Supriya Nagesh,Justin Y. Chen,Nina Mishra,Tal Wagner
机构: Amazon; MIT (麻省理工学院); Tel-Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.13193 [cs.CL] (or arXiv:2502.13193v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.13193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-115] MoBA: Mixture of Block Attention for Long-Context LLM s

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）扩展有效上下文长度的问题，以推进其向通用人工智能（Artificial General Intelligence, AGI）的发展。然而，传统注意力机制带来的二次计算复杂度增加构成了一种难以承受的负担。现有的方法要么引入强偏结构，如Sink或窗口注意力，这些方法任务特定性较强；要么大幅修改注意力机制为线性近似，但其在复杂推理任务中的表现尚未得到充分探索。论文提出的关键解决方案是“少结构”原则，使模型能够自主决定注意力分配，而不是引入预定义的偏见。具体而言，作者引入了块注意力混合（Mixture of Block Attention, MoBA），这是一种将专家混合（Mixture of Experts, MoE）原理应用于注意力机制的新颖架构。MoBA的关键优势在于能够在全注意力与稀疏注意力之间平滑过渡，从而提高效率而不牺牲性能。

链接: https://arxiv.org/abs/2502.13189
作者: Enzhe Lu,Zhejun Jiang,Jingyuan Liu,Yulun Du,Tao Jiang,Chao Hong,Shaowei Liu,Weiran He,Enming Yuan,Yuzhi Wang,Zhiqi Huang,Huan Yuan,Suting Xu,Xinran Xu,Guokun Lai,Yanru Chen,Huabin Zheng,Junjie Yan,Jianlin Su,Yuxin Wu,Neo Y. Zhang,Zhilin Yang,Xinyu Zhou,Mingxing Zhang,Jiezhong Qiu
机构: Moonshot AI; Tsinghua University; Zhejiang Lab/Zhejiang University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure’’ principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at this https URL. Comments: 15 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2502.13189 [cs.LG] (or arXiv:2502.13189v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.13189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-116] ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在面对对抗性越狱攻击时所面临的防御策略局限性问题。现有防御方法，包括参数修改和无参数方法，在适应性、可解释性和定制化方面存在不足，限制了其应对不断演化的威胁的有效性。论文提出的关键解决方案是ShieldLearner，这是一种模仿人类学习过程的新范式。通过试错机制，ShieldLearner自主提炼攻击模式到模式图谱，并将防御启发式综合到元分析框架中，从而实现系统的、可解释的威胁检测。此外，引入自适应对抗增强以生成成功防御提示的对抗变体，使系统能够持续自我改进而无需重新训练模型。这些措施共同构成了一个实用且高效的实时对抗防御方案。

链接: https://arxiv.org/abs/2502.13162
作者: Ziyi Ni,Hao Wang,Huacan Wang
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院认知与智能决策系统重点实验室); Institute of Artificial Intelligence, Beijing University of Aeronautics and Astronautics(北京航空航天大学人工智能学院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in various domains but remain vulnerable to adversarial jailbreak attacks. Existing prompt-defense strategies, including parameter-modifying and parameter-free approaches, face limitations in adaptability, interpretability, and customization, constraining their effectiveness against evolving threats. To address these challenges, we propose ShieldLearner, a novel paradigm that mimics human learning in defense. Through trial and error, it autonomously distills attack signatures into a Pattern Atlas and synthesizes defense heuristics into a Meta-analysis Framework, enabling systematic and interpretable threat detection. Furthermore, we introduce Adaptive Adversarial Augmentation to generate adversarial variations of successfully defended prompts, enabling continuous self-improvement without model retraining. In addition to standard benchmarks, we create a hard test set by curating adversarial prompts from the Wildjailbreak dataset, emphasizing more concealed malicious intent. Experimental results show that ShieldLearner achieves a significantly higher defense success rate than existing baselines on both conventional and hard test sets, while also operating with lower computational overhead, making it a practical and efficient solution for real-world adversarial defense.
zh

[NLP-117] Revisiting the Test-Time Scaling of o1 -like Models: Do they Truly Possess Test-Time Scaling Capabilities?

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在测试时扩展（test-time scaling）的能力，并评估其是否真正提升了推理能力。研究发现，较长的思维链条（Chain-of-Thought, CoT）并不总能提升准确性，且正确解往往比错误解更短。论文进一步指出，这一现象与模型的自我修正能力有关，长CoT包含更多自我修订，这通常导致性能下降。为了解决这些问题，论文比较了串行和并行扩展策略，并提出了一种结合并行扩展策略和CoT长度特性的最短多数投票法（Shortest Majority Vote），显著改善了模型的测试时可扩展性。关键在于通过优化扩展策略和控制CoT长度来提升模型性能。

链接: https://arxiv.org/abs/2502.12215
作者: Zhiyuan Zeng,Qinyuan Cheng,Zhangyue Yin,Yunhua Zhou,Xipeng Qiu
机构: School of Computer Science, Fudan University (复旦大学), Shanghai, China; Shanghai AI Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI’s o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models’ self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models’ test-time scalability compared to conventional majority voting approaches.
zh

[NLP-118] LaVCa: LLM -assisted Visual Cortex Captioning

【速读】：该论文旨在解决通过深度神经网络（DNNs）理解人脑神经元群体特性时所面临的黑箱问题，即如何准确解释这些模型预测的体素响应属性。论文的关键解决方案是提出了LaVCa（LLM辅助视觉皮层描述），这是一种数据驱动的方法，利用大规模语言模型（LLMs）为体素选择性图像生成自然语言描述。这种方法能够更准确地生成描述体素选择性的标题，并且在体素间和体素内层面捕捉到更详细的属性，从而揭示视觉皮层感兴趣区域内的细粒度功能分化及同时代表多个不同概念的体素。

链接: https://arxiv.org/abs/2502.13606
作者: Takuya Matsuyama,Shinji Nishimoto,Yu Takagi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 33 pages

点击查看摘要

Abstract:Understanding the property of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that uses large language models (LLMs) to generate natural-language captions for images to which voxels are selective. By applying LaVCa for image-evoked brain activity, we demonstrate that LaVCa generates captions that describe voxel selectivity more accurately than the previously proposed method. Furthermore, the captions generated by LaVCa quantitatively capture more detailed properties than the existing method at both the inter-voxel and intra-voxel levels. Furthermore, a more detailed analysis of the voxel-specific properties generated by LaVCa reveals fine-grained functional differentiation within regions of interest (ROIs) in the visual cortex and voxels that simultaneously represent multiple distinct concepts. These findings offer profound insights into human visual representations by assigning detailed captions throughout the visual cortex while highlighting the potential of LLM-based methods in understanding brain representations. Please check out our webpage at this https URL
zh

[NLP-119] Large Language Models Can Help Mitigate Barren Plateaus

【速读】：该论文旨在解决量子神经网络（Quantum Neural Networks, QNNs）在噪声中间尺度量子计算（noisy intermediate-scale quantum, NISQ）时代训练过程中遇到的“贫瘠高原”（barren plateaus）问题，即随着模型规模的增加，梯度方差以指数方式消失。论文的关键解决方案是提出了一种新的大型语言模型（Large Language Model, LLM）驱动搜索框架AdaInit，它通过迭代搜索来优化QNN的初始参数，以最大化梯度方差从而缓解贫瘠高原问题。不同于传统的单次初始化方法，AdaInit利用具有自适应提示的LLMs动态地优化QNN的初始化。理论分析证明了搜索过程存在一个上界，并最终能够识别出QNN的最佳初始参数。

链接: https://arxiv.org/abs/2502.13166
作者: Jun Zhuang,Chaowen Guan
机构: Boise State University; University of Cincinnati
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: TL;DR: We propose a new LLM-driven framework designed for mitigating barren plateaus

点击查看摘要

Abstract:In the era of noisy intermediate-scale quantum (NISQ) computing, Quantum Neural Networks (QNNs) have emerged as a promising approach for various applications, yet their training is often hindered by barren plateaus (BPs), where gradient variance vanishes exponentially as the model size increases. To address this challenge, we propose a new Large Language Model (LLM)-driven search framework, AdaInit, that iteratively searches for optimal initial parameters of QNNs to maximize gradient variance and therefore mitigate BPs. Unlike conventional one-time initialization methods, AdaInit dynamically refines QNN’s initialization using LLMs with adaptive prompting. Theoretical analysis of the Expected Improvement (EI) proves a supremum for the search, ensuring this process can eventually identify the optimal initial parameter of the QNN. Extensive experiments across four public datasets demonstrate that AdaInit significantly enhances QNN’s trainability compared to classic initialization methods, validating its effectiveness in mitigating BPs.
zh

计算机视觉

[CV-0] Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects

【速读】：该论文致力于解决多视角RGB图像中两个物体的可分离三维重建问题，特别是在严重相互遮挡和交互边界模糊的情况下。解决方案的关键在于引入了一种新的神经隐式方法，该方法能够在三维空间中重建两个相互作用物体的几何形状和外观，同时避免表面穿透，并支持观测场景的新视角合成。这种方法通过一种新颖的alpha混合正则化技术进行端到端训练，确保在极端遮挡条件下两个几何形状仍能清晰分离。

链接: https://arxiv.org/abs/2502.13968
作者: Suhas Gopal,Rishabh Dabral,Vladislav Golyanik,Christian Theobalt
机构: Saarland University (萨尔兰大学); Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 20 figures and 6 tables; International Conference on 3D Vision (3DV) 2025; Project page: this https URL

点击查看摘要

Abstract:Separable 3D reconstruction of multiple objects from multi-view RGB images – resulting in two different 3D shapes for the two objects with a clear separation between them – remains a sparsely researched problem. It is challenging due to severe mutual occlusions and ambiguities along the objects’ interaction boundaries. This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene. The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation that ensures that the two geometries are well separated even under extreme occlusions. Our reconstruction method is markerless and can be applied to rigid as well as articulated objects. We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts. The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting.
zh

[CV-1] FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

【速读】：该论文旨在解决传统图像标记方法（Image Tokenization）中固定长度标记序列无法适应图像内在复杂性的问题。解决方案的关键在于引入了一种名为FlexTok的新标记器（tokenizer），它能够将2D图像投影到可变长度的有序1D标记序列中。通过训练一个修正流模型（rectified flow model）作为解码器，并采用嵌套丢弃（nested dropout）技术，FlexTok能够在不同长度的标记序列下产生合理的重构结果，从而实现更高效且灵活的自回归图像生成。

链接: https://arxiv.org/abs/2502.13967
作者: Roman Bachmann,Jesse Allardice,David Mizrahi,Enrico Fini,Oğuzhan Fatih Kar,Elmira Amirloo,Alaaeldin El-Nouby,Amir Zamir,Afshin Dehghan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page at this https URL

点击查看摘要

Abstract:Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image’s inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine “visual vocabulary”, and that the number of tokens to generate depends on the complexity of the generation task.
zh

[CV-2] A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects

【速读】：该论文旨在解决移动操作器在复杂环境中进行精确操作小物体的任务，特别是在存在遮挡的情况下。解决方案的关键在于提出了Servoing with Vision Models (SVM)框架，采用RGB-D腕部相机和视觉伺服控制，并利用先进的视觉模型从腕部图像中可靠地计算三维目标。为了减轻遮挡带来的影响，SVM使用视觉模型进行外绘画（out-painting）以显著增强目标定位。此外，开放词汇对象检测器和点跟踪方法作为插件模块，能够识别语义目标和可靠的交互位点，从而实现无需训练的零样本成功率高达85%。

链接: https://arxiv.org/abs/2502.13964
作者: Arjun Gupta,Rishik Sathua,Saurabh Gupta
机构: University of Illinois at Urbana-Champaign
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop training-free framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM employs an RGB-D wrist camera and uses visual servoing for control. Our novelty lies in the use of state-of-the-art vision models to reliably compute 3D targets from the wrist image for diverse tasks and under occlusion due to the end-effector. To mitigate occlusion artifacts, we employ vision models to out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module to identify semantic targets (e.g. knobs) and point tracking methods can reliably track interaction sites indicated by user clicks. This training-free method obtains an 85% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method and an imitation learning baseline trained on 1000+ demonstrations by an absolute success rate of 50%.
zh

[CV-3] IP-Composer: Semantic Composition of Visual Concepts

【速读】：该论文旨在解决多视觉源图像组合生成中的精确控制与概念多样性问题。现有方法通常局限于捕捉有限的概念范围，且需要昂贵的训练过程或专门的数据。关键解决方案在于提出了一种名为IP-Composer的新型无需训练的方法，该方法同时利用多个图像参考，并通过自然语言描述从每个图像中提取的概念，以此实现更精确的视觉概念组合控制。这种方法基于IP-Adapter进行扩展，通过将多个输入图像的投影编织成特定概念的CLIP子空间，从而合成新颖图像。

链接: https://arxiv.org/abs/2502.13951
作者: Sara Dorfman,Dana Cohen-Bar,Rinon Gal,Daniel Cohen-Or
机构: Tel Aviv University(特拉维夫大学); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image’s CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.
zh

[CV-4] GPU-Friendly Laplacian Texture Blending

【速读】：该论文旨在解决虚拟世界渲染过程中纹理和材质混合时引入的可见接缝或对比度损失问题，这些问题会导致不自然的视觉效果。论文的关键解决方案在于提出了一种基于图像处理和拉普拉斯金字塔融合技术的新方法，这种方法无需预计算或增加内存使用，并且能够实时在GPU上运行，同时保持局部特征的清晰度和减少重影现象。

链接: https://arxiv.org/abs/2502.13945
作者: Bartlomiej Wronski
机构: NVIDIA(英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures, Journal of Computer Graphics Techniques (JCGT)

点击查看摘要

Abstract:Texture and material blending is one of the leading methods for adding variety to rendered virtual worlds, creating composite materials, and generating procedural content. When done naively, it can introduce either visible seams or contrast loss, leading to an unnatural look not representative of blended textures. Earlier work proposed addressing this problem through careful manual parameter tuning, lengthy per-texture statistics precomputation, look-up tables, or training deep neural networks. In this work, we propose an alternative approach based on insights from image processing and Laplacian pyramid blending. Our approach does not require any precomputation or increased memory usage (other than the presence of a regular, non-Laplacian, texture mipmap chain), does not produce ghosting, preserves sharp local features, and can run in real time on the GPU at the cost of a few additional lower mipmap texture taps.
zh

[CV-5] A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models

【速读】：该论文旨在解决大规模视觉和语言模型在少量样本情况下（few-shot settings）仍存在的模态间域间隙（domain gap）问题，特别是在图像描述生成任务中。解决方案的关键在于提出了一种链式思维（chain-of-thought, CoT）元学习策略，作为多步图像描述生成过程，以更好地模拟人类如何描述图像。此外，论文还建议在不同的子空间中学习每个CoT步骤的不同元参数，以避免干扰。这一方法在MSCOCO、Flickr8k和Flickr30k数据集的少量样本设置下进行了评估，并显示出优于基线方法的性能。

链接: https://arxiv.org/abs/2502.13942
作者: Hao Huang,Shuaihang Yuan,Yu Hao,Congcong Wen,Yi Fang
机构: NYU Embodied AI and Robotics Lab, New York University (纽约大学嵌入式人工智能与机器人实验室), New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 5 tables

点击查看摘要

Abstract:A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior, which makes it easier to generate images and language that are more natural and realistic. Despite this, there is still a significant domain gap between the modalities of vision and language, especially when training data is scarce in few-shot settings, where only very limited data are available for training. In order to mitigate this issue, a multi-modal meta-learning framework has been proposed to bridge the gap between two frozen pretrained large vision and language models by introducing a tunable prompt connecting these two large models. For few-shot image captioning, the existing multi-model meta-learning framework utilizes a one-step prompting scheme to accumulate the visual features of input images to guide the language model, which struggles to generate accurate image descriptions with only a few training samples. Instead, we propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images. In addition, we further propose to learn different meta-parameters of the model corresponding to each CoT step in distinct subspaces to avoid interference. We evaluated our method on three commonly used image captioning datasets, i.e., MSCOCO, Flickr8k, and Flickr30k, under few-shot settings. The results of our experiments indicate that our chain-of-thought subspace meta-learning strategy is superior to the baselines in terms of performance across different datasets measured by different metrics.
zh

[CV-6] Image compositing is all you need for data augmentation

【速读】：该论文旨在解决在有限标注数据条件下，提高目标检测模型的鲁棒性和检测精度的问题。解决方案的关键在于采用不同的数据增强技术，其中图像合成(image compositing)方法在提升检测性能方面表现最为突出，通过精确度(precision)、召回率(recall)以及平均精度均值(mAP@0.50)等指标衡量。此外，Stable Diffusion XL和ControlNet等高级生成模型也显示出显著改进，强调了先进数据增强技术在目标检测任务中的潜力。

链接: https://arxiv.org/abs/2502.13936
作者: Ang Jia Ning Shermaine,Michalis Lazarou,Tania Stathaki
机构: Imperial College London (帝国理工学院); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in VISAPP 2025

点击查看摘要

Abstract:This paper investigates the impact of various data augmentation techniques on the performance of object detection models. Specifically, we explore classical augmentation methods, image compositing, and advanced generative models such as Stable Diffusion XL and ControlNet. The objective of this work is to enhance model robustness and improve detection accuracy, particularly when working with limited annotated data. Using YOLOv8, we fine-tune the model on a custom dataset consisting of commercial and military aircraft, applying different augmentation strategies. Our experiments show that image compositing offers the highest improvement in detection performance, as measured by precision, recall, and mean Average Precision (mAP@0.50). Other methods, including Stable Diffusion XL and ControlNet, also demonstrate significant gains, highlighting the potential of advanced data augmentation techniques for object detection tasks. The results underline the importance of dataset diversity and augmentation in achieving better generalization and performance in real-world applications. Future work will explore the integration of semi-supervised learning methods and further optimizations to enhance model performance across larger and more complex datasets.
zh

[CV-7] Continually Learning Structured Visual Representations via Network Refinement with Rerelation

【速读】：该论文旨在解决传统神经网络在视觉处理中的信息损失和不可理解性问题。其关键是提出了一种结构化、持续学习的方法，通过精炼网络来捕捉对象的核心结构，并高效表示显著的结构变异，从而实现增量学习，避免知识覆盖，形成紧凑且易于理解的表征。

链接: https://arxiv.org/abs/2502.13935
作者: Zeki Doruk Erden,Boi Faltings
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current machine learning paradigm relies on continuous representations like neural networks, which iteratively adjust parameters to approximate outcomes rather than directly learning the structure of problem. This spreads information across the network, causing issues like information loss and incomprehensibility Building on prior work in environment dynamics modeling, we propose a method that learns visual space in a structured, continual manner. Our approach refines networks to capture the core structure of objects while representing significant subvariants in structure efficiently. We demonstrate this with 2D shape detection, showing incremental learning on MNIST without overwriting knowledge and creating compact, comprehensible representations. These results offer a promising step toward a transparent, continually learning alternative to traditional neural networks for visual processing.
zh

[CV-8] NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants ICRA2025

【速读】：该论文旨在解决家庭机器人在陌生环境中导航的挑战，特别是识别和推理新颖的装饰和布局。现有强化学习方法因依赖广泛的映射和探索而难以直接应用于新环境，导致时间和资源的浪费。为了解决这些问题，论文提出了一种名为\mname的方法，该方法将预训练基础模型的逻辑知识和泛化能力转移到零样本导航任务中。关键在于通过集成大型视觉-语言模型与扩散网络，构建一个视觉预测器，该预测器能够连续预测代理在下一步可能观察到的内容，从而辅助机器人生成稳健的动作。此外，为了适应导航的时间特性，引入了时间历史信息以确保预测图像与导航场景对齐，并设计了一个信息融合框架，将预测的未来帧作为指导嵌入目标到达策略中，以解决下游图像导航任务。这种方法增强了跨模拟和真实环境的导航控制和泛化能力。

链接: https://arxiv.org/abs/2502.13894
作者: Yiran Qin,Ao Sun,Yuze Hong,Benyou Wang,Ruimao Zhang
机构: Sun Yat-sen University (中山大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学，深圳)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICRA2025

点击查看摘要

Abstract:Navigating unfamiliar environments presents significant challenges for household robots, requiring the ability to recognize and reason about novel decoration and layout. Existing reinforcement learning methods cannot be directly transferred to new environments, as they typically rely on extensive mapping and exploration, leading to time-consuming and inefficient. To address these challenges, we try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation. By integrating a large vision-language model with a diffusion network, our approach named \mname ~constructs a visual predictor that continuously predicts the agent’s potential observations in the next step which can assist robots generate robust actions. Furthermore, to adapt the temporal property of navigation, we introduce temporal historical information to ensure that the predicted image is aligned with the navigation scene. We then carefully designed an information fusion framework that embeds the predicted future frames as guidance into goal-reaching policy to solve downstream image navigation tasks. This approach enhances navigation control and generalization across both simulated and real-world environments. Through extensive experimentation, we demonstrate the robustness and versatility of our method, showcasing its potential to improve the efficiency and effectiveness of robotic navigation in diverse settings.
zh

[CV-9] Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition

【速读】：该论文旨在解决手术过程中多视角手术活动识别（Surgical Activity Recognition, SAR）的问题，特别是在复杂手术室环境中难以捕捉细微临床动作及多视角知识的挑战。现有方法通常需要校准的多视角摄像机设置和高级点云处理以获得更好的结果。为了解决这些问题，论文提出了一种名为PreViPS的新型无校准多视角多模态预训练框架，其关键是通过跨视角对齐2D姿态和视觉嵌入，并引入离散表示法处理连续的2D人体姿态坐标，同时利用跨模态几何约束和掩码姿态令牌预测策略来增强表征学习。这种方法不仅提高了模型的数据效率，还在多个手术室数据集上展示了优越性能。

链接: https://arxiv.org/abs/2502.13883
作者: Idris Hamoud,Vinkle Srivastav,Muhammad Abdullah Jamal,Didier Mutter,Omid Mohareri,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, France(斯特拉斯堡大学, CNRS, INSERM, ICube, UMR7357, 法国); IHU Strasbourg, Strasbourg 67000, France(斯特拉斯堡IHU, 斯特拉斯堡 67000, 法国); Intuitive Surgical Inc., Sunnyvale, USA(直觉外科公司, 桑尼韦尔, 美国); University Hospital of Strasbourg, Strasbourg 67000, France(斯特拉斯堡大学医院, 斯特拉斯堡 67000, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the workflow of surgical procedures in complex operating rooms requires a deep understanding of the interactions between clinicians and their environment. Surgical activity recognition (SAR) is a key computer vision task that detects activities or phases from multi-view camera recordings. Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge, or they require calibrated multi-view camera setups and advanced point-cloud processing to obtain better results. In this work, we propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS, which aligns 2D pose and vision embeddings across camera views. Our model follows CLIP-style dual-encoder architecture: one encoder processes visual features, while the other encodes human pose embeddings. To handle the continuous 2D human pose coordinates, we introduce a tokenized discrete representation to convert the continuous 2D pose coordinates into discrete pose embeddings, thereby enabling efficient integration within the dual-encoder framework. To bridge the gap between these two modalities, we propose several pretraining objectives using cross- and in-modality geometric constraints within the embedding space and incorporating masked pose token prediction strategy to enhance representation learning. Extensive experiments and ablation studies demonstrate improvements over the strong baselines, while data-efficiency experiments on two distinct operating room datasets further highlight the effectiveness of our approach. We highlight the benefits of our approach for surgical activity recognition in both multi-view and single-view settings, showcasing its practical applicability in complex surgical environments. Code will be made available at: this https URL.
zh

[CV-10] MEX: Memory-efficient Approach to Referring Multi-Object Tracking ATC

【速读】：该论文旨在解决引用多目标跟踪（Referring Multi-Object Tracking, RMOT）的问题，这是一种新兴的研究方向，结合了计算机视觉和自然语言处理。不同于传统的多目标跟踪方法，RMOT不仅识别和跟踪物体，还引入了文本描述以增强对象类别的理解。论文的关键解决方案在于引入了一个名为Memory-Efficient Cross-modality（MEX）的新模块，该模块能够直接应用于现有的追踪器如iKUN，从而显著提升其性能。通过在单一GPU上进行推理测试，该方法证明了其有效性和高效性，特别是在HOTA跟踪评分方面，同时大幅改进了内存分配和处理速度。

链接: https://arxiv.org/abs/2502.13875
作者: Huu-Thien Tran,Phuoc-Sang Pham,Thai-Son Tran,Khoa Luu
机构: University of Science (科学大学), Vietnam National University (越南国家大学), Ho Chi Minh City (胡志明市), Vietnam (越南); Department of Computer Science and Computer Engineering (计算机科学与计算机工程系), University of Arkansas (阿肯色大学), Fayetteville (费耶维尔), USA (美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, 2024 International Conference on Advanced Technologies for Communications (ATC), Signal Processing Track

点击查看摘要

Abstract:Referring Multi-Object Tracking (RMOT) is a relatively new concept that has rapidly gained traction as a promising research direction at the intersection of computer vision and natural language processing. Unlike traditional multi-object tracking, RMOT identifies and tracks objects and incorporates textual descriptions for object class names, making the approach more intuitive. Various techniques have been proposed to address this challenging problem; however, most require the training of the entire network due to their end-to-end nature. Among these methods, iKUN has emerged as a particularly promising solution. Therefore, we further explore its pipeline and enhance its performance. In this paper, we introduce a practical module dubbed Memory-Efficient Cross-modality – MEX. This memory-efficient technique can be directly applied to off-the-shelf trackers like iKUN, resulting in significant architectural improvements. Our method proves effective during inference on a single GPU with 4 GB of memory. Among the various benchmarks, the Refer-KITTI dataset, which offers diverse autonomous driving scenes with relevant language expressions, is particularly useful for studying this problem. Empirically, our method demonstrates effectiveness and efficiency regarding HOTA tracking scores, substantially improving memory allocation and processing speed.
zh

[CV-11] MSVCOD:A Large-Scale Multi-Scene Dataset for Video Camouflage Object Detection

【速读】：该论文旨在解决视频中伪装物体检测（Video Camouflaged Object Detection, VCOD）的问题，特别是在非野生动物场景中的应用。解决方案的关键在于构建了一个大规模多领域的VCOD数据集MSVCOD，并设计了一种半自动迭代标注流程以确保高质量标注同时降低成本。此外，论文提出了一种单流视频伪装物体检测模型，该模型能够进行特征提取和信息融合，而无需额外的运动特征融合模块。这一框架在现有的动物VCOD数据集及新提出的MSVCOD数据集上均取得了最先进的成果。

链接: https://arxiv.org/abs/2502.13859
作者: Shuyong Gao,Yu’ang Feng,Qishan Wang,Lingyi Hong,Xinyu Zhou,Liu Fei,Yan Wang,Wenqiang Zhang
机构: Fudan University (复旦大学); Keenon Robotics Co. Ltd (快诺机器人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Video Camouflaged Object Detection (VCOD) is a challenging task which aims to identify objects that seamlessly concealed within the background in videos. The dynamic properties of video enable detection of camouflaged objects through motion cues or varied perspectives. Previous VCOD datasets primarily contain animal objects, limiting the scope of research to wildlife scenarios. However, the applications of VCOD extend beyond wildlife and have significant implications in security, art, and medical fields. Addressing this problem, we construct a new large-scale multi-domain VCOD dataset MSVCOD. To achieve high-quality annotations, we design a semi-automatic iterative annotation pipeline that reduces costs while maintaining annotation accuracy. Our MSVCOD is the largest VCOD dataset to date, introducing multiple object categories including human, animal, medical, and vehicle objects for the first time, while also expanding background diversity across various environments. This expanded scope increases the practical applicability of the VCOD task in camouflaged object detection. Alongside this dataset, we introduce a one-steam video camouflage object detection model that performs both feature extraction and information fusion without additional motion feature fusion modules. Our framework achieves state-of-the-art results on the existing VCOD animal dataset and the proposed MSVCOD. The dataset and code will be made publicly available.
zh

[CV-12] MagicGeo: Training-Free Text-Guided Geometric Diagram Generation

【速读】：该论文旨在解决几何图示生成难题，传统方法资源密集且难以保证精确的空间关系。论文提出MagicGeo框架，关键在于将图示生成转化为坐标优化问题，并通过形式语言求解器确保几何正确性，同时利用大型语言模型的强翻译能力实现坐标感知生成。这一方案为自动化图示生成提供了精确且可扩展的解决方案。

链接: https://arxiv.org/abs/2502.13855
作者: Junxiao Wang,Ting Zhang,Heng Yu,Jingdong Wang,Hua Huang
机构: Beijing Normal University (北京师范大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Geometric diagrams are critical in conveying mathematical and scientific concepts, yet traditional diagram generation methods are often manual and resource-intensive. While text-to-image generation has made strides in photorealistic imagery, creating accurate geometric diagrams remains a challenge due to the need for precise spatial relationships and the scarcity of geometry-specific datasets. This paper presents MagicGeo, a training-free framework for generating geometric diagrams from textual descriptions. MagicGeo formulates the diagram generation process as a coordinate optimization problem, ensuring geometric correctness through a formal language solver, and then employs coordinate-aware generation. The framework leverages the strong language translation capability of large language models, while formal mathematical solving ensures geometric correctness. We further introduce MagicGeoBench, a benchmark dataset of 220 geometric diagram descriptions, and demonstrate that MagicGeo outperforms current methods in both qualitative and quantitative evaluations. This work provides a scalable, accurate solution for automated diagram generation, with significant implications for educational and academic applications.
zh

[CV-13] Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge

【速读】：该论文旨在解决建筑物建造年代估计这一重要问题，这对于可持续发展至关重要。可持续建筑能够最小化能源消耗，并且是负责任及可持续城市规划与发展的重要组成部分，以有效应对气候变化。论文的关键解决方案在于利用人工智能（Artificial Intelligence, AI）特别是Transformer模型，从多模态数据集（即Map your City Dataset, MyCD）中估计建筑物的建造时期。MyCD包含来自多个欧洲城市的正射影像（Very High Resolution, VHR）、Copernicus Sentinel-2卫星的地球观测（Earth Observation, EO）多光谱数据以及街景图像，这些数据与目标建筑在空间上共定位，并标注了建造年代。论文评估了地球观测数据在未参与训练的新城市上的泛化性能，并展示了基于MyCD组织的数据挑战中表现最佳的四个模型及其主要评估结果。研究结果表明，即使仅使用正射影像和Sentinel-2数据这两大模态输入，模型也能在估计建筑物年龄这一复杂实际任务中表现出良好的性能，包括在之前未见过的城市中。

链接: https://arxiv.org/abs/2502.13818
作者: Nikolaos Dionelis,Nicolas Longépé,Alessandra Feliciotti,Mattia Marconcini,Devis Peressutti,Nika Oman Kadunc,JaeWan Park,Hagai Raja Sinulingga,Steve Andreas Immanuel,Ba Tran,Caroline Arnold
机构: European Space Agency (ESA)(欧洲空间局), ΦΦ\Phiroman_Φ-lab(Φroman_Φ实验室), ESRIN, Italy(意大利); MindEarth(明地地球), Switzerland(瑞士); Sinergise/ Planet(辛格瑞斯/星球), Slovenia(斯洛文尼亚); TelePIX(特莱皮克斯), Seoul, South Korea(韩国首尔); Axelspace Corporation(阿克塞尔太空公司), Tokyo, Japan(日本东京); Helmholtz Institute Hereon(赫姆霍兹赫伦研究所), Germany(德国); German Climate Computing Center DKRZ(德国气候计算中心DKRZ), Germany(德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 12 figures

点击查看摘要

Abstract:Estimating the construction year of buildings is of great importance for sustainability. Sustainable buildings minimize energy consumption and are a key part of responsible and sustainable urban planning and development to effectively combat climate change. By using Artificial Intelligence (AI) and recently proposed Transformer models, we are able to estimate the construction epoch of buildings from a multi-modal dataset. In this paper, we introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset (MyCD), containing top-view Very High Resolution (VHR) images, Earth Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite constellation, and street-view images in many different cities in Europe, co-localized with respect to the building under study and labelled with the construction epoch. We assess EO generalization performance on new/ previously unseen cities that have been held-out from training and appear only during inference. In this work, we present the community-based data challenge we organized based on MyCD. The ESA AI4EO Challenge MapYourCity was opened in 2024 for 4 months. Here, we present the Top-4 performing models, and the main evaluation results. During inference, the performance of the models using both all three input modalities and only the two top-view modalities, i.e. without the street-view images, is examined. The evaluation results show that the models are effective and can achieve good performance on this difficult real-world task of estimating the age of buildings, even on previously unseen cities, as well as even using only the two top-view modalities (i.e. VHR and Sentinel-2) during inference.
zh

[CV-14] 3D Gaussian Splatting aided Localization for Large and Complex Indoor-Environments

【速读】：该论文旨在解决视觉定位在复杂场景下准确性与可靠性不足的问题。关键在于通过将渲染图像融入现有方法中，利用现代视觉SLAM技术生成基于3D高斯点云（3D Gaussian Splatting, 3DGS）的地图，并在此基础上添加从随机采样姿态渲染的图像作为参考数据，从而显著提升基于几何的视觉定位和场景坐标回归（Scene Coordinate Regression, SCR）方法的性能。

链接: https://arxiv.org/abs/2502.13803
作者: Vincent Ress,Jonas Meyer,Wei Zhang,David Skuddis,Uwe Soergel,Norbert Haala
机构: Institute for Photogrammetry and Geoinformatics, University of Stuttgart, Germany(斯图加特大学摄影测量与地理信息学院,德国); Institute of Geomatics, University of Applied Sciences and Arts Northwestern Switzerland, Switzerland(瑞士西北应用科学与艺术大学测绘学院,瑞士)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The field of visual localization has been researched for several decades and has meanwhile found many practical applications. Despite the strong progress in this field, there are still challenging situations in which established methods fail. We present an approach to significantly improve the accuracy and reliability of established visual localization methods by adding rendered images. In detail, we first use a modern visual SLAM approach that provides a 3D Gaussian Splatting (3DGS) based map to create reference data. We demonstrate that enriching reference data with images rendered from 3DGS at randomly sampled poses significantly improves the performance of both geometry-based visual localization and Scene Coordinate Regression (SCR) methods. Through comprehensive evaluation in a large industrial environment, we analyze the performance impact of incorporating these additional rendered views.
zh

[CV-15] From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

【速读】：该论文旨在解决大型语言模型（LLMs）在个性化教育中的应用局限性，主要问题在于这些模型过于强调答案的正确性而忽视了错误诊断和反馈生成。为了应对这一挑战，论文提出了三个关键贡献：首先，引入了数学分类与建设性建议（MathCCS），这是一个多模态基准，用于系统的错误分析和定制化反馈；其次，开发了一个利用历史数据进行趋势跟踪和诊断精度提升的顺序错误分析框架；最后，提出了一种结合时间序列代理和大语言模型代理的多智能体协同框架，以增强错误分类和反馈生成能力。这些贡献共同提供了一个强大的平台，以推进个性化教育，并缩小当前人工智能能力与现实教学需求之间的差距。

链接: https://arxiv.org/abs/2502.13789
作者: Yi-Fan Zhang,Hang Li,Dingjie Song,Lichao Sun,Tianlong Xu,Qingsong Wen
机构: National Laboratory of Pattern Recognition, University of Chinese Academy of Sciences(模式识别国家重点实验室，中国科学院大学); Computer Science department, Michigan State University(密歇根州立大学计算机科学系); CUHK-Shenzhen Natural Language Processing group(香港中文大学（深圳）自然语言处理组); Computer Science and Engineering, Lehigh University(莱斯大学计算机科学与工程系); Squirrel Ai Learning group(松鼠Ai学习组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as GPT-4, have demonstrated impressive mathematical reasoning capabilities, achieving near-perfect performance on benchmarks like GSM8K. However, their application in personalized education remains limited due to an overemphasis on correctness over error diagnosis and feedback generation. Current models fail to provide meaningful insights into the causes of student mistakes, limiting their utility in educational contexts. To address these challenges, we present three key contributions. First, we introduce \textbfMathCCS (Mathematical Classification and Constructive Suggestions), a multi-modal benchmark designed for systematic error analysis and tailored feedback. MathCCS includes real-world problems, expert-annotated error categories, and longitudinal student data. Evaluations of state-of-the-art models, including \textitQwen2-VL, \textitLLaVA-OV, \textitClaude-3.5-Sonnet and \textitGPT-4o, reveal that none achieved classification accuracy above 30% or generated high-quality suggestions (average scores below 4/10), highlighting a significant gap from human-level performance. Second, we develop a sequential error analysis framework that leverages historical data to track trends and improve diagnostic precision. Finally, we propose a multi-agent collaborative framework that combines a Time Series Agent for historical analysis and an MLLM Agent for real-time refinement, enhancing error classification and feedback generation. Together, these contributions provide a robust platform for advancing personalized education, bridging the gap between current AI capabilities and the demands of real-world teaching.
zh

[CV-16] An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

【速读】：该论文旨在解决稻米分类和品质评估过程中依赖手动视觉检查导致耗时且易出错的问题。解决方案的关键在于提出了一种实时综合稻米粒评估机制，该机制整合了一阶段目标检测方法、深度卷积神经网络以及传统机器学习技术，实现了稻米品种识别、完整度分级及垩白度评估。

链接: https://arxiv.org/abs/2502.13764
作者: Wanke Xia,Ruxin Peng,Haoqi Chu,Xinlei Zhu,Zhiyu Yang,Yaojun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rice is one of the most widely cultivated crops globally and has been developed into numerous varieties. The quality of rice during cultivation is primarily determined by its cultivar and characteristics. Traditionally, rice classification and quality assessment rely on manual visual inspection, a process that is both time-consuming and prone to errors. However, with advancements in machine vision technology, automating rice classification and quality evaluation based on its cultivar and characteristics has become increasingly feasible, enhancing both accuracy and efficiency. This study proposes a real-time evaluation mechanism for comprehensive rice grain assessment, integrating a one-stage object detection approach, a deep convolutional neural network, and traditional machine learning techniques. The proposed framework enables rice variety identification, grain completeness grading, and grain chalkiness evaluation. The rice grain dataset used in this study comprises approximately 20,000 images from six widely cultivated rice varieties in China. Experimental results demonstrate that the proposed mechanism achieves a mean average precision (mAP) of 99.14% in the object detection task and an accuracy of 97.89% in the classification task. Furthermore, the framework attains an average accuracy of 97.56% in grain completeness grading within the same rice variety, contributing to an effective quality evaluation system.
zh

[CV-17] Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework

【速读】：该论文旨在解决现有地理定位方法所面临的粗略、不精确及难以解释的问题。主要挑战在于现有地理定位数据集的质量与规模不足，这些数据集通常较小且自动构建，导致数据噪声大、任务难度不一致，并且图片提供的线索不足以支持可靠的推理。为了解决这些问题，论文提出了一套全面的地理定位框架，关键组成部分包括：一个大规模的数据集GeoComp（Geolocation Competition Dataset），一种新的推理方法GeoCoT（Geographical Chain-of-Thought）以及一个新的评估指标GeoEval。其中，GeoComp数据集由一个地理定位游戏平台收集，包含来自740K用户的两年内数据，共计2500万条元数据和300万个地理位置标签，每个位置被人类用户标注数千到数万次，提供了多样化的难度水平以供详细分析。而GeoCoT通过多步推理过程整合上下文和空间线索，显著提升了地理定位准确性，同时增强了模型的可解释性，将性能提高了高达25%。

链接: https://arxiv.org/abs/2502.13759
作者: Zirui Song,Jingpu Yang,Yuan Huang,Jonathan Tonglet,Zeyu Zhang,Tao Cheng,Meng Fang,Iryna Gurevych,Xiuying Chen
机构: MBZUAI(迈阿密滨海大学), United Arab Emirates; Northeastern University(东北大学), China; TU Darmstadt and KU Leuven(杜塞尔多夫技术大学和鲁汶大学), Germany; Australian National University(澳大利亚国立大学), Australia; University College London(伦敦大学学院), United Kingdom; University of Liverpool(利物浦大学), United Kingdom; Iryna Gurevych, MBZUAI(迈阿密滨海大学), United Arab Emirates; Xiuying Chen, MBZUAI(迈阿密滨海大学), United Arab Emirates
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Access dataset: this https URL

点击查看摘要

Abstract:Geolocation, the task of identifying an image’s location, requires complex reasoning and is crucial for navigation, monitoring, and cultural preservation. However, current methods often produce coarse, imprecise, and non-interpretable localization. A major challenge lies in the quality and scale of existing geolocation datasets. These datasets are typically small-scale and automatically constructed, leading to noisy data and inconsistent task difficulty, with images that either reveal answers too easily or lack sufficient clues for reliable inference. To address these challenges, we introduce a comprehensive geolocation framework with three key components: GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric, collectively designed to address critical challenges and drive advancements in geolocation research. At the core of this framework is GeoComp (Geolocation Competition Dataset), a large-scale dataset collected from a geolocation game platform involving 740K users over two years. It comprises 25 million entries of metadata and 3 million geo-tagged locations spanning much of the globe, with each location annotated thousands to tens of thousands of times by human users. The dataset offers diverse difficulty levels for detailed analysis and highlights key gaps in current models. Building on this dataset, we propose Geographical Chain-of-Thought (GeoCoT), a novel multi-step reasoning framework designed to enhance the reasoning capabilities of Large Vision Models (LVMs) in geolocation tasks. GeoCoT improves performance by integrating contextual and spatial cues through a multi-step process that mimics human geolocation reasoning. Finally, using the GeoEval metric, we demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
zh

[CV-18] Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning ICASSP

【速读】：该论文旨在解决现有视频字幕方法仅提供浅显或简化的物体行为表示，导致描述表面化且模糊的问题。为全面捕捉物体行为的本质，论文提出了一种动态动作语义感知图变换器。关键解决方案包括设计一个多尺度时间建模模块以灵活学习长短期潜在动作特征，并引入视觉-动作语义感知模块以自适应捕捉与物体行为相关的语义表示。通过这两个模块的协同作用，获取丰富的行为表示以生成类人的自然描述，并利用这些表示构建时序对象-动作图，输入图变换器以建模物体与动作之间的复杂时序依赖关系。

链接: https://arxiv.org/abs/2502.13754
作者: Caihua Liu,Xu Li,Wenjing Xue,Wei Tang,Xia Feng
机构: College of Computer Science and Technology, Civil Aviation University of China(Tiānjīn, 中国); Key Laboratory of Smart Airport Theory and System, CAAC(中国民航局智能机场理论与系统重点实验室, 天津市河东区金北方大街2898号); Science and Technology Innovation Research Institute, Civil Aviation University of China(民航大学科技创新研究院, 天津)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, published ICASSP

点击查看摘要

Abstract:Existing video captioning methods merely provide shallow or simplistic representations of object behaviors, resulting in superficial and ambiguous descriptions. However, object behavior is dynamic and complex. To comprehensively capture the essence of object behavior, we propose a dynamic action semantic-aware graph transformer. Firstly, a multi-scale temporal modeling module is designed to flexibly learn long and short-term latent action features. It not only acquires latent action features across time scales, but also considers local latent action details, enhancing the coherence and sensitiveness of latent action representations. Secondly, a visual-action semantic aware module is proposed to adaptively capture semantic representations related to object behavior, enhancing the richness and accurateness of action representations. By harnessing the collaborative efforts of these two modules,we can acquire rich behavior representations to generate human-like natural descriptions. Finally, this rich behavior representations and object representations are used to construct a temporal objects-action graph, which is fed into the graph transformer to model the complex temporal dependencies between objects and actions. To avoid adding complexity in the inference phase, the behavioral knowledge of the objects will be distilled into a simple network through knowledge distillation. The experimental results on MSVD and MSR-VTT datasets demonstrate that the proposed method achieves significant performance improvements across multiple metrics.
zh

[CV-19] Benchmarking of Different YOLO Models for CAPTCHAs Detection and Classification

【速读】：该论文旨在分析和比较YOLOv5、YOLOv8和YOLOv10模型在网页CAPTCHA检测中的性能。研究使用从网络和暗网收集的数据集以及合成的网页数据，评估了YOLO架构的nano (n)、small (s) 和medium (m) 变体，并通过精确度(Precision)、召回率(Recall)、F1分数、mAP@50和推理速度等指标来确定其实际应用价值。此外，论文还探讨了调整训练模型以有效识别新CAPTCHA模式的可能性。为了改善超大输入图像的检测性能，文中提出了一种图像切割方法。关键解决方案在于优化不同复杂度的YOLO模型变体，并采用图像切割技术来处理大规模输入图像。

链接: https://arxiv.org/abs/2502.13740
作者: Mikołaj Wysocki,Henryk Gierszal,Piotr Tyczka,Sophia Karagiorgou,George Pantelis
机构: ITTI Sp.z o.o. (ITTI Sp.z o.o.); UBITECH (UBITECH); Adam Mickiewicz University (亚当·密茨凯维奇大学); UBITECH (UBITECH); ITTI Sp.z o.o. (ITTI Sp.z o.o.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper provides an analysis and comparison of the YOLOv5, YOLOv8 and YOLOv10 models for webpage CAPTCHAs detection using the datasets collected from the web and darknet as well as synthetized data of webpages. The study examines the nano (n), small (s), and medium (m) variants of YOLO architectures and use metrics such as Precision, Recall, F1 score, mAP@50 and inference speed to determine the real-life utility. Additionally, the possibility of tuning the trained model to detect new CAPTCHA patterns efficiently was examined as it is a crucial part of real-life applications. The image slicing method was proposed as a way to improve the metrics of detection on oversized input images which can be a common scenario in webpages analysis. Models in version nano achieved the best results in terms of speed, while more complexed architectures scored better in terms of other metrics.
zh

[CV-20] CARE: Confidence-Aware Regression Estimation of building density fine-tuning EO Foundation Models

【速读】：该论文旨在解决深度神经网络在像素级回归任务中的置信度量化与评估问题，特别是在实际应用中预测模型失效、提升性能和增强能力的需求。由于像素级回归任务不同于语义分割等分类任务，现有方法通常不使用softmax输出层来表达不确定性。论文的关键解决方案是提出并评估了一种名为Confidence-Aware Regression Estimation (CARE) 的模型，该模型能够计算并为回归输出结果分配置信度。实验结果显示，CARE模型在Copernicus Sentinel-2卫星数据集上的建筑物密度估计任务中表现出色，并优于其他方法。

链接: https://arxiv.org/abs/2502.13734
作者: Nikolaos Dionelis,Jente Bosmans,Nicolas Longépé
机构: European Space Agency (ESA)(欧洲空间局); ΦΦ\Phiroman_Φ-lab(菲罗马姆实验室); ESRIN(埃斯林), Italy(意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, Submitted

点击查看摘要

Abstract:Performing accurate confidence quantification and assessment is important for deep neural networks to predict their failures, improve their performance and enhance their capabilities in real-world applications, for their practical deployment in real life. For pixel-wise regression tasks, confidence quantification and assessment has not been well addressed in the literature, in contrast to classification tasks like semantic segmentation. The softmax output layer is not used in deep neural networks that solve pixel-wise regression problems. In this paper, to address these problems, we develop, train and evaluate the proposed model Confidence-Aware Regression Estimation (CARE). Our model CARE computes and assigns confidence to regression output results. We focus on solving regression problems as downstream tasks of an AI Foundation Model for Earth Observation (EO). We evaluate the proposed model CARE and experimental results on data from the Copernicus Sentinel-2 satellite constellation for estimating the density of buildings show that the proposed method can be successfully applied to regression problems. We also show that our approach outperforms other methods.
zh

[CV-21] Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields CVPR2023

【速读】：该论文旨在解决视频帧插值（VFI）在利用事件相机时因仅依赖事件或其近似方法估计双向帧间运动场而无法充分考虑真实场景中复杂运动的问题。关键解决方案在于提出了一种基于事件的VFI框架——EIF-BiOFNet，它通过直接估计帧间运动场而不使用任何近似方法，充分利用事件和图像的特性。此外，开发了一种交互式注意力机制帧合成网络，以高效融合基于扭曲和基于合成的特征。最后，构建了一个大规模事件基VFI数据集ERF-X170FPS，用于克服现有数据集的局限性。

链接: https://arxiv.org/abs/2502.13716
作者: Taewoo Kim,Yujeong Chae,Hyun-Kurl Jang,Kuk-Jin Yoon
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR2023(Highlight)

点击查看摘要

Abstract:Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods. Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets. Our project pages are available at: this https URL
zh

[CV-22] Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention

【速读】：该论文旨在解决医学图像分类任务中因多中心研究导致的图像退化问题，这些图像退化通常由不同制造商的成像设备差异引起。论文的关键解决方案在于引入了Medical Vision Transformer (MedViTV2)，其创新性地将Kolmogorov-Arnold Network (KAN) 层整合到transformer架构中，并开发了高效的KAN块以减少计算负载同时提升原始MedViT的准确性。此外，通过提出增强的Dilated Neighborhood Attention (DiNA)，改进了模型的扩展能力并解决了特征坍塌的问题。同时，采用层次混合策略高效地结合局部特征感知和全局特征感知模块，从而平衡局部与全局特征的认知，进一步提升了性能。

链接: https://arxiv.org/abs/2502.13693
作者: Omid Nejati Manzari,Hojat Asgariandehkordi,Taha Koleilat,Yiming Xiao,Hassan Rivaz
机构: Department of Electrical and Computer Engineering, Concordia University (康考迪亚大学), Montreal, Canada; Department of Computer Science and Software Engineering, Concordia University (康考迪亚大学), Montreal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6% on MedMNIST, 5.8% on NonMNIST, and 13.4% on the MedMNIST-C benchmark.
zh

[CV-23] Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation

【速读】：该论文旨在解决在二维场景中人类功能预测（Human Affordance Prediction）所面临的挑战，即在复杂场景中预测符合上下文的新型姿态（pose），以表示有效的行动。现有方法受限于处理场景中的姿态和动作变体数量巨大这一难题。论文的关键解决方案在于提出了一种新颖的交叉注意力机制，通过从两个不同模态（modalities）的空间特征图中相互关注来编码场景上下文，从而实现功能预测。该方法将任务解耦为多个子任务，以有效地降低问题复杂度，并采用条件变分自编码器（VAE）进行位置采样、姿态模板预测以及姿态尺度和变形参数的采样，从而显著提升了人类功能注入到复杂二维场景中的表现。

链接: https://arxiv.org/abs/2502.13637
作者: Prasun Roy,Saumik Bhattacharya,Subhankar Ghosh,Umapada Pal,Michael Blumenstein
机构: University of Technology Sydney (悉尼科技大学); Indian Institute of Technology, Kharagpur (印度理工学院克勒格布尔); Indian Statistical Institute, Kolkata (印度统计学院加尔各答)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 11 pages

点击查看摘要

Abstract:Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.
zh

[CV-24] CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement

【速读】：该论文旨在解决单模态方法（如仅使用RGB或射频（RF）信号）在心率（HR）估计中的鲁棒性和准确性问题，这些问题受到光照变化、运动伪影和肤色偏差的影响。解决方案的关键在于提出CardiacMamba框架，这是一个多模态RGB-RF融合方案，通过利用两种模态的优势来增强动态变化捕捉能力和跨模态对齐。CardiacMamba引入了时域差分眼镜蛇模块（TDMM）以利用帧间时间差异捕获RF信号的动态变化，并采用双向SSM进行跨模态对齐，同时使用通道式快速傅里叶变换（CFFT）有效提取RGB和RF信号的频率特性，从而提高心率估计的精度和周期性检测能力。

链接: https://arxiv.org/abs/2502.13624
作者: Zheng Wu,Yiping Xie,Bo Zhao,Jiguang He,Fei Luo,Ning Deng,Zitong Yu
机构: School of Computing and Information Technology, Great Bay University (计算与信息技术学院, 大海湾大学); Dongguan Key Laboratory for Intelligence and Information Technology (东莞智能信息技术重点实验室, 中国东莞); Computer Vision Institute, School of Computer Science & Software Engineering, Shenzhen University (计算机视觉研究所, 计算机科学与软件工程学院, 深圳大学, 中国深圳); School of Computing and Information Technology, Great Bay University (计算与信息技术学院, 大海湾大学, 中国东莞)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Heart rate (HR) estimation via remote photoplethysmography (rPPG) offers a non-invasive solution for health monitoring. However, traditional single-modality approaches (RGB or Radio Frequency (RF)) face challenges in balancing robustness and accuracy due to lighting variations, motion artifacts, and skin tone bias. In this paper, we propose CardiacMamba, a multimodal RGB-RF fusion framework that leverages the complementary strengths of both modalities. It introduces the Temporal Difference Mamba Module (TDMM) to capture dynamic changes in RF signals using timing differences between frames, enhancing the extraction of local and global features. Additionally, CardiacMamba employs a Bidirectional SSM for cross-modal alignment and a Channel-wise Fast Fourier Transform (CFFT) to effectively capture and refine the frequency domain characteristics of RGB and RF signals, ultimately improving heart rate estimation accuracy and periodicity detection. Extensive experiments on the EquiPleth dataset demonstrate state-of-the-art performance, achieving marked improvements in accuracy and robustness. CardiacMamba significantly mitigates skin tone bias, reducing performance disparities across demographic groups, and maintains resilience under missing-modality scenarios. By addressing critical challenges in fairness, adaptability, and precision, the framework advances rPPG technology toward reliable real-world deployment in healthcare. The codes are available at: this https URL.
zh

[CV-25] oward Robust Non-Transferable Learning: A Survey and Benchmark

【速读】：该论文旨在解决深度学习模型在未预见数据上的泛化能力被恶意利用的问题。论文的关键解决方案是非转移学习（Non-Transferable Learning, NTL），通过重塑模型的泛化能力来应对这些挑战，并提出了首个全面评估NTL性能和鲁棒性的基准NTLBench。此外，论文强调了NTL机制对抗多种攻击的鲁棒性问题，并通过实验验证了现有NTL方法在鲁棒性方面的局限性。

链接: https://arxiv.org/abs/2502.13593
作者: Ziming Hong,Yongli Xiang,Tongliang Liu
机构: Sydney AI Centre, The University of Sydney (悉尼大学悉尼AI中心)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Over the past decades, researchers have primarily focused on improving the generalization abilities of models, with limited attention given to regulating such generalization. However, the ability of models to generalize to unintended data (e.g., harmful or unauthorized data) can be exploited by malicious adversaries in unforeseen ways, potentially resulting in violations of model ethics. Non-transferable learning (NTL), a task aimed at reshaping the generalization abilities of deep learning models, was proposed to address these challenges. While numerous methods have been proposed in this field, a comprehensive review of existing progress and a thorough analysis of current limitations remain lacking. In this paper, we bridge this gap by presenting the first comprehensive survey on NTL and introducing NTLBench, the first benchmark to evaluate NTL performance and robustness within a unified framework. Specifically, we first introduce the task settings, general framework, and criteria of NTL, followed by a summary of NTL approaches. Furthermore, we emphasize the often-overlooked issue of robustness against various attacks that can destroy the non-transferable mechanism established by NTL. Experiments conducted via NTLBench verify the limitations of existing NTL methods in robustness. Finally, we discuss the practical applications of NTL, along with its future directions and associated challenges.
zh

[CV-26] MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis

【速读】：该论文旨在解决三维(3D)医学图像高效分割的问题。传统方法如卷积神经网络(CNNs)和视觉变换器(ViT)面临显著的计算挑战，促使需要架构上的改进。论文的关键解决方案是提出了MobileViM架构，它引入了一种新的与维度无关的机制和双方向遍历方法，并结合基于视觉-曼巴框架。此外，MobileViM采用了跨尺度桥接技术以提高不同医学成像模态下的效率和准确性。这些增强使得MobileViM在单个图形处理单元(NVIDIA RTX 4090)上实现了超过90帧每秒(FPS)的分割速度，比现有最先进的深度学习模型快24 FPS以上。实验评估显示MobileViM在多个数据集上达到了卓越的性能，Dice相似性分数显著超越现有模型。

链接: https://arxiv.org/abs/2502.13524
作者: Wei Dai,Steven Wang,Jun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: The code is accessible through: this https URL

点击查看摘要

Abstract:Efficient evaluation of three-dimensional (3D) medical images is crucial for diagnostic and therapeutic practices in healthcare. Recent years have seen a substantial uptake in applying deep learning and computer vision to analyse and interpret medical images. Traditional approaches, such as convolutional neural networks (CNNs) and vision transformers (ViTs), face significant computational challenges, prompting the need for architectural advancements. Recent efforts have led to the introduction of novel architectures like the ``Mamba’’ model as alternative solutions to traditional CNNs or ViTs. The Mamba model excels in the linear processing of one-dimensional data with low computational demands. However, Mamba’s potential for 3D medical image analysis remains underexplored and could face significant computational challenges as the dimension increases. This manuscript presents MobileViM, a streamlined architecture for efficient segmentation of 3D medical images. In the MobileViM network, we invent a new dimension-independent mechanism and a dual-direction traversing approach to incorporate with a vision-Mamba-based framework. MobileViM also features a cross-scale bridging technique to improve efficiency and accuracy across various medical imaging modalities. With these enhancements, MobileViM achieves segmentation speeds exceeding 90 frames per second (FPS) on a single graphics processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster than the state-of-the-art deep learning models for processing 3D images with the same computational resources. In addition, experimental evaluations demonstrate that MobileViM delivers superior performance, with Dice similarity scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses existing models.
zh

[CV-27] Improving Collision-Free Success Rate For Object Goal Visual Navigation Via Two-Stage Training With Collision Prediction

【速读】：该论文旨在解决在基于深度强化学习的端到端导航模型中目标物体视觉导航过程中碰撞问题未得到有效解决的问题。论文提出的关键解决方案是引入“无碰撞成功”(collision-free success)这一新概念来评估导航模型找到无碰撞路径的能力，并提出了一种两阶段训练方法，该方法结合碰撞预测模块。第一阶段训练中，碰撞预测模块监督代理(agent)的碰撞状态以学习预测可能发生的碰撞；第二阶段则利用训练好的碰撞预测，使代理能够在不发生碰撞的情况下导航至目标物体。实验结果表明，所提方法显著提高了不同导航模型的无碰撞成功率，并优于其他可比较的避障方法。

链接: https://arxiv.org/abs/2502.13498
作者: Shiwei Lian,Feitian Zhang
机构: Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The object goal visual navigation is the task of navigating to a specific target object using egocentric visual observations. Recent end-to-end navigation models based on deep reinforcement learning have achieved remarkable performance in finding and reaching target objects. However, the collision problem of these models during navigation remains unresolved, since the collision is typically neglected when evaluating the success. Although incorporating a negative reward for collision during training appears straightforward, it results in a more conservative policy, thereby limiting the agent’s ability to reach targets. In addition, many of these models utilize only RGB observations, further increasing the difficulty of collision avoidance without depth information. To address these limitations, a new concept – collision-free success is introduced to evaluate the ability of navigation models to find a collision-free path towards the target object. A two-stage training method with collision prediction is proposed to improve the collision-free success rate of the existing navigation models using RGB observations. In the first training stage, the collision prediction module supervises the agent’s collision states during exploration to learn to predict the possible collision. In the second stage, leveraging the trained collision prediction, the agent learns to navigate to the target without collision. The experimental results in the AI2-THOR environment demonstrate that the proposed method greatly improves the collision-free success rate of different navigation models and outperforms other comparable collision-avoidance methods.
zh

[CV-28] 2.5D U-Net with Depth Reduction for 3D CryoET Object Identification

【速读】：该论文旨在解决自动分析冷冻电子断层扫描（Cryo-electron Tomography, cryoET）所捕获断层图像中的蛋白质复合物识别问题。解决方案的关键在于采用了一种基于热图的关键点检测方法，利用了两种不同类型的2.5D U-Net模型的集成，并结合深度降维技术。尽管架构统一且简单，该方法在CZII - CryoET Object Identification竞赛中取得了第4名的成绩，展示了其有效性。

链接: https://arxiv.org/abs/2502.13484
作者: Yusuke Uchida,Takaaki Fukui
机构: GO Inc. (GO公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cryo-electron tomography (cryoET) is a crucial technique for unveiling the structure of protein complexes. Automatically analyzing tomograms captured by cryoET is an essential step toward understanding cellular structures. In this paper, we introduce the 4th place solution from the CZII - CryoET Object Identification competition, which was organized to advance the development of automated tomogram analysis techniques. Our solution adopted a heatmap-based keypoint detection approach, utilizing an ensemble of two different types of 2.5D U-Net models with depth reduction. Despite its highly unified and simple architecture, our method achieved 4th place, demonstrating its effectiveness.
zh

[CV-29] Semi-supervised classification of bird vocalizations

【速读】：该论文旨在解决鸟类种群监测中的两个主要挑战：一是需要大量专家标注的数据集进行训练，二是难以检测时间重叠的叫声。解决方案的关键在于提出了一种半监督的声学鸟类检测器，能够识别频率分离的时间重叠叫声，并且仅需少量标注样本即可完成训练。这种检测器在包含110种鸟类的315个类别上达到了0.701的F0.5评分，显著优于现有的BirdNET分类器，尤其是在使用有限标注数据的情况下。

链接: https://arxiv.org/abs/2502.13440
作者: Simen Hexeberg,Mandar Chitre,Matthias Hoffmann-Kuhnt,Bing Wen Low
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Changes in bird populations can indicate broader changes in ecosystems, making birds one of the most important animal groups to monitor. Combining machine learning and passive acoustics enables continuous monitoring over extended periods without direct human involvement. However, most existing techniques require extensive expert-labeled datasets for training and cannot easily detect time-overlapping calls in busy soundscapes. We propose a semi-supervised acoustic bird detector designed to allow both the detection of time-overlapping calls (when separated in frequency) and the use of few labeled training samples. The classifier is trained and evaluated on a combination of community-recorded open-source data and long-duration soundscape recordings from Singapore. It achieves a mean F0.5 score of 0.701 across 315 classes from 110 bird species on a hold-out test set, with an average of 11 labeled training samples per class. It outperforms the state-of-the-art BirdNET classifier on a test set of 103 bird species despite significantly fewer labeled training samples. The detector is further tested on 144 microphone-hours of continuous soundscape data. The rich soundscape in Singapore makes suppression of false positives a challenge on raw, continuous data streams. Nevertheless, we demonstrate that achieving high precision in such environments with minimal labeled training data is possible.
zh

[CV-30] JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework

【速读】：该论文旨在解决两个主要问题：一是高质量遥感图像变化检测（Change Detection, CD）数据集的稀缺性，特别是亚米级、全面覆盖的数据集；二是实现不同变化区域图像之间一致且令人满意的检测结果的难度。为了解决这些问题，论文提出了JL1-CD数据集，并引入了一种多教师知识蒸馏（Multi-Teacher Knowledge Distillation, MTKD）框架。其中，MTKD框架是解决方案的关键，它显著提升了不同网络架构和参数规模下的CD模型性能，实现了新的最先进成果。

链接: https://arxiv.org/abs/2502.13407
作者: Ziyuan Liu,Ruifei Zhu,Long Gao,Yuanxiu Zhou,Jingyu Ma,Yuantao Gu
机构: Department of Electronic Engineering, Beijing National Research Center for Information Science and Technology, Tsinghua University(清华大学), China; Chang Guang Satellite Technology Co., Ltd. (CGSTL)(长春光华卫星科技有限公司), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures. Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS)

点击查看摘要

Abstract:Deep learning has achieved significant success in the field of remote sensing image change detection (CD), yet two major challenges remain: the scarcity of sub-meter, all-inclusive open-source CD datasets, and the difficulty of achieving consistent and satisfactory detection results across images with varying change areas. To address these issues, we introduce the JL1-CD dataset, which contains 5,000 pairs of 512 x 512 pixel images with a resolution of 0.5 to 0.75 meters. Additionally, we propose a multi-teacher knowledge distillation (MTKD) framework for CD. Experimental results on the JL1-CD and SYSU-CD datasets demonstrate that the MTKD framework significantly improves the performance of CD models with various network architectures and parameter sizes, achieving new state-of-the-art results. The code is available at this https URL.
zh

[CV-31] MaizeEar-SAM: Zero-Shot Maize Ear Phenotyping

【速读】：该论文旨在解决在植物遗传学研究、植物育种及改进农业实践过程中，量化玉米产量构成性状变化的问题。传统手动测量方法耗时且限制大规模数据收集，而基于图像处理和深度学习的自动化尝试则面临标注成本高和泛化不确定性等挑战。论文的关键解决方案在于探索使用大型视觉模型进行零样本、无标注的玉米籽粒分割。通过采用开源的大规模视觉模型Segment Anything Model (SAM)，论文实现了对玉米穗RGB图像中籽粒的分割，并应用基于图的算法计算每行籽粒数量。这种方法成功识别了不同玉米穗的每行籽粒数，展示了基础视觉模型结合图像处理技术在零样本学习中的潜力，从而提高了自动化程度并减少了农学数据收集中的主观性。所有代码已开源，以使这些经济高效的表型分析方法能够普及。

链接: https://arxiv.org/abs/2502.13399
作者: Hossein Zaremehrjerdi,Lisa Coffey,Talukder Jubery,Huyu Liu,Jon Turkus,Kyle Linders,James C. Schnable,Patrick S. Schnable,Baskar Ganapathysubramanian
机构: Iowa State University(爱荷华州立大学); Translational AI Research and Education Center(翻译人工智能研究中心), Iowa State University(爱荷华州立大学); Department of Electrical and Computer Engineering(电气与计算机工程系), Iowa State University(爱荷华州立大学); Department of Agronomy(农学系), University of Nebraska-Lincoln(内布拉斯加林肯大学); Department of Mechanical Engineering(机械工程系), Iowa State University(爱荷华州立大学); Plant Science Institute(植物科学研究所), Iowa State University(爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantifying the variation in yield component traits of maize (Zea mays L.), which together determine the overall productivity of this globally important crop, plays a critical role in plant genetics research, plant breeding, and the development of improved farming practices. Grain yield per acre is calculated by multiplying the number of plants per acre, ears per plant, number of kernels per ear, and the average kernel weight. The number of kernels per ear is determined by the number of kernel rows per ear multiplied by the number of kernels per row. Traditional manual methods for measuring these two traits are time-consuming, limiting large-scale data collection. Recent automation efforts using image processing and deep learning encounter challenges such as high annotation costs and uncertain generalizability. We tackle these issues by exploring Large Vision Models for zero-shot, annotation-free maize kernel segmentation. By using an open-source large vision model, the Segment Anything Model (SAM), we segment individual kernels in RGB images of maize ears and apply a graph-based algorithm to calculate the number of kernels per row. Our approach successfully identifies the number of kernels per row across a wide range of maize ears, showing the potential of zero-shot learning with foundation vision models combined with image processing techniques to improve automation and reduce subjectivity in agronomic data collection. All our code is open-sourced to make these affordable phenotyping methods accessible to everyone. Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07, 68U10 Cite as: arXiv:2502.13399 [cs.CV] (or arXiv:2502.13399v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.13399 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-32] SNN-Driven Multimodal Human Action Recognition via Event Camera and Skeleton Data Fusion

【速读】：该论文旨在解决基于RGB和骨架数据融合的多模态人体动作识别方法在资源受限场景中的应用限制，这些问题包括高计算复杂度、过度的内存消耗以及显著的能量需求，尤其是在使用人工神经网络（Artificial Neural Networks, ANN）时。为了解决这些挑战，论文提出了一种新颖的基于脉冲神经网络（Spiking Neural Network, SNN）的框架，用于多模态人体动作识别，利用事件相机和骨架数据。该方案的关键创新点在于：(1) 提出了一种新的多模态SNN架构，分别采用不同的主干网络处理每种模态的数据——对于事件相机数据使用基于SNN的Mamba，对于骨架数据使用脉冲图卷积网络（Spiking Graph Convolutional Network, SGN），并结合一个脉冲语义提取模块以捕捉深层次的语义表示；(2) 引入了一种基于SNN的离散化信息瓶颈机制，用于模态融合，在保持模态特定语义的同时实现高效的信息压缩。通过验证表明，所提方法在识别准确性和能量效率方面均表现出色，为实际应用提供了有前景的解决方案。

链接: https://arxiv.org/abs/2502.13385
作者: Naichuan Zheng,Hailun Xia
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal human action recognition based on RGB and skeleton data fusion, while effective, is constrained by significant limitations such as high computational complexity, excessive memory consumption, and substantial energy demands, particularly when implemented with Artificial Neural Networks (ANN). These limitations restrict its applicability in resource-constrained scenarios. To address these challenges, we propose a novel Spiking Neural Network (SNN)-driven framework for multimodal human action recognition, utilizing event camera and skeleton data. Our framework is centered on two key innovations: (1) a novel multimodal SNN architecture that employs distinct backbone networks for each modality-an SNN-based Mamba for event camera data and a Spiking Graph Convolutional Network (SGN) for skeleton data-combined with a spiking semantic extraction module to capture deep semantic representations; and (2) a pioneering SNN-based discretized information bottleneck mechanism for modality fusion, which effectively balances the preservation of modality-specific semantics with efficient information compression. To validate our approach, we propose a novel method for constructing a multimodal dataset that integrates event camera and skeleton data, enabling comprehensive evaluation. Extensive experiments demonstrate that our method achieves superior performance in both recognition accuracy and energy efficiency, offering a promising solution for practical applications.
zh

[CV-33] MoVer: Motion Verification for Motion Graphics Animations

【速读】：该论文旨在解决大型视觉语言模型在从文本提示生成运动图形动画时，无法全面包含提示中描述的所有时空属性的问题。解决方案的关键在于引入MoVer，这是一种基于一阶逻辑的运动验证领域特定语言（DSL），能够检查运动图形动画的时空属性。通过实现这些属性作为谓词，并提供一个执行引擎，MoVer可以应用于任何基于SVG的运动图形动画输入。此外，论文展示了如何将MoVer用于基于大型语言模型（LLM）的合成与验证管道中，以迭代方式优化运动图形动画。这一方法使得在无需迭代的情况下，自动为58.8%的测试提示生成正确的运动图形动画，在最多50次修正迭代后，这一比例提高到93.6%。

链接: https://arxiv.org/abs/2502.13372
作者: Jiaju Ma,Maneesh Agrawala
机构: Stanford University (斯坦福大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While large vision-language models can generate motion graphics animations from text prompts, they regularly fail to include all of spatio-temporal properties described in the prompt. We introduce MoVer, a motion verification DSL based on first-order logic that can check spatio-temporal properties of a motion graphics animation. We identify a general set of such properties that people commonly use to describe animations (e.g., the direction and timing of motions, the relative positioning of objects, etc.). We implement these properties as predicates in MoVer and provide an execution engine that can apply a MoVer program to any input SVG-based motion graphics animation. We then demonstrate how MoVer can be used in an LLM-based synthesis and verification pipeline for iteratively refining motion graphics animations. Given a text prompt, our pipeline synthesizes a motion graphics animation and a corresponding MoVer program. Executing the verification program on the animation yields a report of the predicates that failed and the report can be automatically fed back to LLM to iteratively correct the animation. To evaluate our pipeline, we build a synthetic dataset of 5600 text prompts paired with ground truth MoVer verification programs. We find that while our LLM-based pipeline is able to automatically generate a correct motion graphics animation for 58.8% of the test prompts without any iteration, this number raises to 93.6% with up to 50 correction iterations. Project website: this https URL
zh

[CV-34] Pretrained Image-Text Models are Secretly Video Captioners NAACL2025

【速读】：该论文旨在解决视频字幕生成任务中的高计算成本及复杂性问题。论文的关键解决方案在于通过仅使用少量视频文本对（6,000对）和简单地将帧进行拼接，对一个基于图像的模型进行后训练以生成视频字幕。这种方法显著减少了所需的数据量（与其他方法相比，数据量仅为其一小部分），同时保持了高性能，在主要基准测试中达到了领先水平。该研究强调了在资源优化方面，轻量级的基于图像的适应策略可以与最先进的视频字幕系统相媲美，为低资源场景提供了实用的解决方案。

链接: https://arxiv.org/abs/2502.13363
作者: Chunhui Zhang,Yiren Jian,Zhongyu Ouyang,Soroush Vosoughi
机构: Department of Computer Science, Dartmouth College (计算机科学系, 达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). The first two authors contributed equally and were listed in random order

点击查看摘要

Abstract:Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.
zh

[CV-35] Geometry-Aware Diffusion Models for Multiview Scene Inpainting

【速读】：该论文旨在解决3D场景修复（3D Scene Inpainting）中的难题，即在输入图像集中部分区域被遮挡的情况下，生成几何一致性跨视图的逼真图像补全。现有方法通常通过结合生成模型与三维辐射场（3D Radiance Field）来融合不同视角的信息，但这种方法常因融合不一致的跨视角图像而产生模糊结果。本文的关键解决方案在于引入了一种几何感知条件生成模型（Geometry-Aware Conditional Generative Model），能够在学习空间中融合跨视角信息，从而避免模糊修复结果。此外，该方法的独特优势在于能够使用有限数量的视角进行少视图修复（Few-View Inpainting），而无需像先前方法那样需要较大的图像集。

链接: https://arxiv.org/abs/2502.13335
作者: Ahmad Salimi,Tristan Aumentado-Armstrong,Marcus A. Brubaker,Konstantinos G. Derpanis
机构: York University (约克大学); Vector Institute for AI (向量研究所); Samsung AI Centre Toronto (三星人工智能中心多伦多); Google DeepMind (谷歌深思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page is available at this https URL

点击查看摘要

Abstract:In this paper, we focus on 3D scene inpainting, where parts of an input image set, captured from different viewpoints, are masked out. The main challenge lies in generating plausible image completions that are geometrically consistent across views. Most recent work addresses this challenge by combining generative models with a 3D radiance field to fuse information across viewpoints. However, a major drawback of these methods is that they often produce blurry images due to the fusion of inconsistent cross-view images. To avoid blurry inpaintings, we eschew the use of an explicit or implicit radiance field altogether and instead fuse cross-view information in a learned space. In particular, we introduce a geometry-aware conditional generative model, capable of inpainting multi-view consistent images based on both geometric and appearance cues from reference images. A key advantage of our approach over existing methods is its unique ability to inpaint masked scenes with a limited number of views (i.e., few-view inpainting), whereas previous methods require relatively large image sets for their 3D model fitting step. Empirically, we evaluate and compare our scene-centric inpainting method on two datasets, SPIn-NeRF and NeRFiller, which contain images captured at narrow and wide baselines, respectively, and achieve state-of-the-art 3D inpainting performance on both. Additionally, we demonstrate the efficacy of our approach in the few-view setting compared to prior methods.
zh

[CV-36] MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching WWW ATC

【速读】：该论文旨在解决通过文本到视频（Text-to-Video, T2V）扩散模型生成视频时，仅凭输入文本描述难以精确控制物体运动和相机构图的问题。解决方案的关键在于提出MotionMatcher框架，该框架在特征层面微调预训练的T2V扩散模型，通过比较高层时空运动特征而非像素级目标来实现更精准的运动学习，从而避免现有方法中存在的内容泄露和复杂运动捕捉不准确的问题。

链接: https://arxiv.org/abs/2502.13234
作者: Yen-Siang Wu,Chi-Pin Huang,Fu-En Yang,Yu-Chiang Frank Wang
机构: National Taiwan University; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.
zh

[CV-37] GS-QA: Comprehensive Quality Assessment Benchmark for Gaussian Splatting View Synthesis

【速读】：该论文旨在评估使用高斯散射(Gaussian Splatting, GS)生成的静态内容的主观质量，并分析18种客观质量度量方法在GS视图合成中的性能。论文的关键在于通过主观质量评估研究，对比不同GS方法生成的合成视频的质量，并结合客观质量度量的结果，提供这些方法在人类感知下的优劣及适用性的见解。

链接: https://arxiv.org/abs/2502.13196
作者: Pedro Martin,António Rodrigues,João Ascenso,Maria Paula Queluz
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS) offers a promising alternative to Neural Radiance Fields (NeRF) for real-time 3D scene rendering. Using a set of 3D Gaussians to represent complex geometry and appearance, GS achieves faster rendering times and reduced memory consumption compared to the neural network approach used in NeRF. However, quality assessment of GS-generated static content is not yet explored in-depth. This paper describes a subjective quality assessment study that aims to evaluate synthesized videos obtained with several static GS state-of-the-art methods. The methods were applied to diverse visual scenes, covering both 360-degree and forward-facing (FF) camera trajectories. Moreover, the performance of 18 objective quality metrics was analyzed using the scores resulting from the subjective study, providing insights into their strengths, limitations, and alignment with human perception. All videos and scores are made available providing a comprehensive database that can be used as benchmark on GS view synthesis and objective quality metrics.
zh

[CV-38] Generative Topology Optimization: Exploring Diverse Solutions in Structural Design

【速读】：该论文旨在解决传统拓扑优化方法（Topology Optimization, TO）仅能生成单一最优解，从而限制了对替代设计方案探索的问题。解决方案的关键在于引入了一种名为生成式拓扑优化（Generative Topology Optimization, GenTO）的数据自由方法，通过训练神经网络生成结构合规的形状，并利用显式的多样性约束来探索多样化的解决方案。该方法在每次迭代中与求解器结合，优化材料分布，最终实现生成符合设计要求且多样化的新颖结构。

链接: https://arxiv.org/abs/2502.13174
作者: Andreas Radler,Eric Volkmann,Johannes Brandstetter,Arturs Berzins
机构: LIT AI Lab, Institute for Machine Learning, JKU Linz (JKU林茨学院), Austria; Emmi AI GmbH (Emmi AI有限公司), Linz, Austria
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Topology optimization (TO) is a family of computational methods that derive near-optimal geometries from formal problem descriptions. Despite their success, established TO methods are limited to generating single solutions, restricting the exploration of alternative designs. To address this limitation, we introduce Generative Topology Optimization (GenTO) - a data-free method that trains a neural network to generate structurally compliant shapes and explores diverse solutions through an explicit diversity constraint. The network is trained with a solver-in-the-loop, optimizing the material distribution in each iteration. The trained model produces diverse shapes that closely adhere to the design requirements. We validate GenTO on 2D and 3D TO problems. Our results demonstrate that GenTO produces more diverse solutions than any prior method while maintaining near-optimality and being an order of magnitude faster due to inherent parallelism. These findings open new avenues for engineering and design, offering enhanced flexibility and innovation in structural optimization.
zh

[CV-39] Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model

【速读】：该论文旨在解决传统基于香农理论的语法通信方法难以满足6G沉浸式通信需求，特别是在具有挑战性的传输条件下的问题。解决方案的关键在于提出了一种可扩展的生成式视频语义通信框架，通过提取和传输高级语义信息来实现高质量的视频重建。具体而言，在发射端提取描述和其他条件信号（如首帧、草图等），在接收端利用基于扩散的大规模生成式人工智能模型融合多模态语义信息以重构视频。仿真结果表明，该方案在极低信道带宽比下仍能有效捕捉语义信息，并在不同信噪比条件下实现与人类感知一致的视频重建。

链接: https://arxiv.org/abs/2502.13838
作者: Hang Yin,Li Qiao,Yu Ma,Shuo Sun,Kan Li,Zhen Gao,Dusit Niyato
机构: Beijing Institute of Technology (北京理工大学); Nanyang Technological University (南洋理工大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Despite significant advancements in traditional syntactic communications based on Shannon’s theory, these methods struggle to meet the requirements of 6G immersive communications, especially under challenging transmission conditions. With the development of generative artificial intelligence (GenAI), progress has been made in reconstructing videos using high-level semantic information. In this paper, we propose a scalable generative video semantic communication framework that extracts and transmits semantic information to achieve high-quality video reconstruction. Specifically, at the transmitter, description and other condition signals (e.g., first frame, sketches, etc.) are extracted from the source video, functioning as text and structural semantics, respectively. At the receiver, the diffusion-based GenAI large models are utilized to fuse the semantics of the multiple modalities for reconstructing the video. Simulation results demonstrate that, at an ultra-low channel bandwidth ratio (CBR), our scheme effectively captures semantic information to reconstruct videos aligned with human perception under different signal-to-noise ratios. Notably, the proposed ``First Frame+Desc." scheme consistently achieves CLIP score exceeding 0.92 at CBR = 0.0057 for SNR 0 dB. This demonstrates its robust performance even under low SNR conditions.
zh

[CV-40] MGFI-Net: A Multi-Grained Feature Integration Network for Enhanced Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中的准确性挑战，特别是在噪声、低对比度或复杂解剖结构存在的情况下。论文的关键解决方案在于提出了一种名为多粒度特征集成网络（MGFI-Net）的新模型，该模型包含两个专用模块：一是多粒度特征提取模块，通过利用不同特征尺度之间的层次关系，选择性地关注最相关的信息以增强分割精度；二是边缘增强模块，有效保留并整合边界信息，以细化分割结果。这些改进显著提升了分割精度，并实现了卓越的时间效率。

链接: https://arxiv.org/abs/2502.13808
作者: Yucheng Zeng
机构: Shanghai University (上海大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation plays a crucial role in various clinical applications. A major challenge in medical image segmentation is achieving accurate delineation of regions of interest in the presence of noise, low contrast, or complex anatomical structures. Existing segmentation models often neglect the integration of multi-grained information and fail to preserve edge details, which are critical for precise segmentation. To address these challenges, we propose a novel image semantic segmentation model called the Multi-Grained Feature Integration Network (MGFI-Net). Our MGFI-Net is designed with two dedicated modules to tackle these issues. First, to enhance segmentation accuracy, we introduce a Multi-Grained Feature Extraction Module, which leverages hierarchical relationships between different feature scales to selectively focus on the most relevant information. Second, to preserve edge details, we incorporate an Edge Enhancement Module that effectively retains and integrates boundary information to refine segmentation results. Extensive experiments demonstrate that MGFI-Net not only outperforms state-of-the-art methods in terms of segmentation accuracy but also achieves superior time efficiency, establishing it as a leading solution for real-time medical image segmentation.
zh

[CV-41] Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management

【速读】：该论文旨在解决通过低成本且常规可用的方法来识别和量化病理性近视患者的眼球形状异常问题。当前理解基于眼形的生物标志物需要使用磁共振成像（MRI），但这种方法成本高昂且在日常眼科诊所中不现实。论文的关键解决方案是提出Fundus2Globe，这是一种人工智能框架，可以从普遍存在的二维彩色眼底照片（CFPs）和常规元数据（如轴向长度、球镜等效值）中合成特定患者的三维眼球模型，从而避免了对MRI的依赖。通过结合三维可变形眼球模型（编码生物力学形状先验）与潜在扩散模型，该方法能够在高效重建眼后部解剖结构的同时实现亚毫米级精度。

链接: https://arxiv.org/abs/2502.13182
作者: Danli Shi,Bowen Liu,Zhen Tian,Yue Wu,Jiancheng Yang,Ruoyu Chen,Bo Yang,Ou Xiao,Mingguang He
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 24 pages, 6 figures

点击查看摘要

Abstract:Myopia, projected to affect 50% population globally by 2050, is a leading cause of vision loss. Eyes with pathological myopia exhibit distinctive shape distributions, which are closely linked to the progression of vision-threatening complications. Recent understanding of eye-shape-based biomarkers requires magnetic resonance imaging (MRI), however, it is costly and unrealistic in routine ophthalmology clinics. We present Fundus2Globe, the first AI framework that synthesizes patient-specific 3D eye globes from ubiquitous 2D color fundus photographs (CFPs) and routine metadata (axial length, spherical equivalent), bypassing MRI dependency. By integrating a 3D morphable eye model (encoding biomechanical shape priors) with a latent diffusion model, our approach achieves submillimeter accuracy in reconstructing posterior ocular anatomy efficiently. Fundus2Globe uniquely quantifies how vision-threatening lesions (e.g., staphylomas) in CFPs correlate with MRI-validated 3D shape abnormalities, enabling clinicians to simulate posterior segment changes in response to refractive shifts. External validation demonstrates its robust generation performance, ensuring fairness across underrepresented groups. By transforming 2D fundus imaging into 3D digital replicas of ocular structures, Fundus2Globe is a gateway for precision ophthalmology, laying the foundation for AI-driven, personalized myopia management.
zh

人工智能

[AI-0] Autellix: An Efficient Serving Engine for LLM Agents as General Programs

链接: https://arxiv.org/abs/2502.13965
作者: Michael Luo,Xiaoxiang Shi,Colin Cai,Tianjun Zhang,Justin Wong,Yichuan Wang,Chi Wang,Yanping Huang,Zhifeng Chen,Joseph E. Gonzalez,Ion Stoica
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs’ previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.

[AI-1] Neurosymbolic artificial intelligence via large language models and coherence-driven inference

链接: https://arxiv.org/abs/2502.13953
作者: Steve Huntsman,Jewell Thomas
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We devise an algorithm to generate sets of propositions that objectively instantiate graphs that support coherence-driven inference. We then benchmark the ability of large language models (LLMs) to reconstruct coherence graphs from (a straightforward transformation of) propositions expressed in natural language, with promising results from a single prompt to models optimized for reasoning. Combining coherence-driven inference with consistency evaluations by neural models may advance the state of the art in machine cognition.

[AI-2] Lost in Sequence: Do Large Language Models Understand Sequential Recommendation?

链接: https://arxiv.org/abs/2502.13909
作者: Sein Kim,Hongseok Kang,Kibum Kim,Jiwan Kim,Donghyun Kim,Minchul Yang,Kwangjin Oh,Julian McAuley,Chanyoung Park
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) have recently emerged as promising tools for recommendation thanks to their advanced textual understanding ability and context-awareness. Despite the current practice of training and evaluating LLM-based recommendation (LLM4Rec) models under a sequential recommendation scenario, we found that whether these models understand the sequential information inherent in users’ item interaction sequences has been largely overlooked. In this paper, we first demonstrate through a series of experiments that existing LLM4Rec models do not fully capture sequential information both during training and inference. Then, we propose a simple yet effective LLM-based sequential recommender, called LLM-SRec, a method that enhances the integration of sequential information into LLMs by distilling the user representations extracted from a pre-trained CF-SRec model into LLMs. Our extensive experiments show that LLM-SRec enhances LLMs’ ability to understand users’ item interaction sequences, ultimately leading to improved recommendation performance. Furthermore, unlike existing LLM4Rec models that require fine-tuning of LLMs, LLM-SRec achieves state-of-the-art performance by training only a few lightweight MLPs, highlighting its practicality in real-world applications. Our code is available at this https URL.

[AI-3] Partially Observable Gaussian Process Network and Doubly Stochastic Variational Inference

链接: https://arxiv.org/abs/2502.13905
作者: Saksham Kiroriwal,Julius Pfrommer,Jürgen Beyerer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:To reduce the curse of dimensionality for Gaussian processes (GP), they can be decomposed into a Gaussian Process Network (GPN) of coupled subprocesses with lower dimensionality. In some cases, intermediate observations are available within the GPN. However, intermediate observations are often indirect, noisy, and incomplete in most real-world systems. This work introduces the Partially Observable Gaussian Process Network (POGPN) to model real-world process networks. We model a joint distribution of latent functions of subprocesses and make inferences using observations from all subprocesses. POGPN incorporates observation lenses (observation likelihoods) into the well-established inference method of deep Gaussian processes. We also introduce two training methods for POPGN to make inferences on the whole network using node observations. The application to benchmark problems demonstrates how incorporating partial observations during training and inference can improve the predictive performance of the overall network, offering a promising outlook for its practical application.

[AI-4] NVR: Vector Runahead on NPUs for Sparse Memory Access

链接: https://arxiv.org/abs/2502.13873
作者: Hui Wang,Zhengpeng Zhao,Jing Wang,Yushu Du,Yuan Cheng,Bing Guo,He Xiao,Chenhao Ma,Xiaomeng Han,Dean You,Jiapeng Guan,Ran Wei,Dawei Yang,Zhe Jiang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU Vector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. Rather than optimising memory patterns with high overhead and poor portability, NVR adapts runahead execution to the unique architecture of NPUs. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOTA prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount.

[AI-5] Enhancing LLM -Based Recommendations Through Personalized Reasoning

链接: https://arxiv.org/abs/2502.13845
作者: Jiahao Liu,Xueshuo Yan,Dongsheng Li,Guangping Zhang,Hansu Gu,Peng Zhang,Tun Lu,Li Shang,Ning Gu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 7 pages, under review

点击查看摘要

Abstract:Current recommendation systems powered by large language models (LLMs) often underutilize their reasoning capabilities due to a lack of explicit logical structuring. To address this limitation, we introduce CoT-Rec, a framework that integrates Chain-of-Thought (CoT) reasoning into LLM-driven recommendations by incorporating two crucial processes: user preference analysis and item perception evaluation. CoT-Rec operates in two key phases: (1) personalized data extraction, where user preferences and item perceptions are identified, and (2) personalized data application, where this information is leveraged to refine recommendations. Our experimental analysis demonstrates that CoT-Rec improves recommendation accuracy by making better use of LLMs’ reasoning potential. The implementation is publicly available at this https URL.

[AI-6] Enhancing Cross-Domain Recommendations with Memory-Optimized LLM -Based User Agents

链接: https://arxiv.org/abs/2502.13843
作者: Jiahao Liu,Shengkang Gu,Dongsheng Li,Guangping Zhang,Mingzhe Han,Hansu Gu,Peng Zhang,Tun Lu,Li Shang,Ning Gu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 6 pages, under review

点击查看摘要

Abstract:Large Language Model (LLM)-based user agents have emerged as a powerful tool for improving recommender systems by simulating user interactions. However, existing methods struggle with cross-domain scenarios due to inefficient memory structures, leading to irrelevant information retention and failure to account for social influence factors such as popularity. To address these limitations, we introduce AgentCF++, a novel framework featuring a dual-layer memory architecture and a two-step fusion mechanism to filter domain-specific preferences effectively. Additionally, we propose interest groups with shared memory, allowing the model to capture the impact of popularity trends on users with similar interests. Through extensive experiments on multiple cross-domain datasets, AgentCF++ demonstrates superior performance over baseline models, highlighting its effectiveness in refining user behavior simulation for recommender systems. Our code is available at this https URL.

[AI-7] Mitigating Popularity Bias in Collaborative Filtering through Fair Sampling

链接: https://arxiv.org/abs/2502.13840
作者: Jiahao Liu,Dongsheng Li,Hansu Gu,Peng Zhang,Tun Lu,Li Shang,Ning Gu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 6 pages, under review

点击查看摘要

Abstract:Recommender systems often suffer from popularity bias, where frequently interacted items are overrepresented in recommendations. This bias stems from propensity factors influencing training data, leading to imbalanced exposure. In this paper, we introduce a Fair Sampling (FS) approach to address this issue by ensuring that both users and items are selected with equal probability as positive and negative instances. Unlike traditional inverse propensity score (IPS) methods, FS does not require propensity estimation, eliminating errors associated with inaccurate calculations. Our theoretical analysis demonstrates that FS effectively neutralizes the influence of propensity factors, achieving unbiased learning. Experimental results validate that FS outperforms state-of-the-art methods in both point-wise and pair-wise recommendation tasks, enhancing recommendation fairness without sacrificing accuracy. The implementation is available at this https URL.

[AI-8] Quantifying Memorization and Retriever Performance in Retrieval-Augmented Vision-Language Models

链接: https://arxiv.org/abs/2502.13836
作者: Peter Carragher,Abhinand Jha,R Raghav,Kathleen M. Carley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in question answering (QA), but metrics for assessing their reliance on memorization versus retrieval remain underdeveloped. Moreover, while finetuned models are state-of-the-art on closed-domain tasks, general-purpose models like GPT-4o exhibit strong zero-shot performance. This raises questions about the trade-offs between memorization, generalization, and retrieval. In this work, we analyze the extent to which multimodal retrieval-augmented VLMs memorize training data compared to baseline VLMs. Using the WebQA benchmark, we contrast finetuned models with baseline VLMs on multihop retrieval and question answering, examining the impact of finetuning on data memorization. To quantify memorization in end-to-end retrieval and QA systems, we propose several proxy metrics by investigating instances where QA succeeds despite retrieval failing. Our results reveal the extent to which finetuned models rely on memorization. In contrast, retrieval-augmented VLMs have lower memorization scores, at the cost of accuracy (72% vs 52% on WebQA test set). As such, our measures pose a challenge for future work to reconcile memorization and generalization in both Open-Domain QA and joint Retrieval-QA tasks.

[AI-9] Proving Olympiad Inequalities by Synergizing LLM s and Symbolic Reasoning ICLR2025

链接: https://arxiv.org/abs/2502.13834
作者: Zenan Li,Zhaoyu Li,Wen Tang,Xian Zhang,Yuan Yao,Xujie Si,Fan Yang,Kaiyu Yang,Xiaoxing Ma
类目: Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2025. Code is available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textita.k.a. tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

[AI-10] AnDB: Breaking Boundaries with an AI-Native Database for Universal Semantic Analysis

链接: https://arxiv.org/abs/2502.13805
作者: Tianqing Wang,Xun Xue,Guoliang Li,Yong Wang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 5 figures, conference

点击查看摘要

Abstract:In this demonstration, we present AnDB, an AI-native database that supports traditional OLTP workloads and innovative AI-driven tasks, enabling unified semantic analysis across structured and unstructured data. While structured data analytics is mature, challenges remain in bridging the semantic gap between user queries and unstructured data. AnDB addresses these issues by leveraging cutting-edge AI-native technologies, allowing users to perform semantic queries using intuitive SQL-like statements without requiring AI expertise. This approach eliminates the ambiguity of traditional text-to-SQL systems and provides a seamless end-to-end optimization for analyzing all data types. AnDB automates query processing by generating multiple execution plans and selecting the optimal one through its optimizer, which balances accuracy, execution time, and financial cost based on user policies and internal optimizing mechanisms. AnDB future-proofs data management infrastructure, empowering users to effectively and efficiently harness the full potential of all kinds of data without starting from scratch.

[AI-11] Poster: SpiderSim: Multi-Agent Driven Theoretical Cybersecurity Simulation for Industrial Digitalization

链接: https://arxiv.org/abs/2502.13778
作者: Jiaqi Li,Xizhong Guo,Yang Zhao,Lvyang Zhang,Lidong Zhai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Rapid industrial digitalization has created intricate cybersecurity demands that necessitate effective validation methods. While cyber ranges and simulation platforms are widely deployed, they frequently face limitations in scenario diversity and creation efficiency. In this paper, we present SpiderSim, a theoretical cybersecurity simulation platform enabling rapid and lightweight scenario generation for industrial digitalization security research. At its core, our platform introduces three key innovations: a structured framework for unified scenario modeling, a multi-agent collaboration mechanism for automated generation, and modular atomic security capabilities for flexible scenario composition. Extensive implementation trials across multiple industrial digitalization contexts, including marine ranch monitoring systems, validate our platform’s capacity for broad scenario coverage with efficient generation processes. Built on solid theoretical foundations and released as open-source software, SpiderSim facilitates broader research and development in automated security testing for industrial digitalization.

[AI-12] A consensus set for the aggregation of partial rankings: the case of the Optimal Set of Bucket Orders Problem

链接: https://arxiv.org/abs/2502.13769
作者: Juan A. Aledo,José A. Gámez,Alejandro Rosete
类目: Artificial Intelligence (cs.AI)
*备注: 26 pages, 2 figures

点击查看摘要

Abstract:In rank aggregation problems (RAP), the solution is usually a consensus ranking that generalizes a set of input orderings. There are different variants that differ not only in terms of the type of rankings that are used as input and output, but also in terms of the objective function employed to evaluate the quality of the desired output ranking. In contrast, in some machine learning tasks (e.g. subgroup discovery) or multimodal optimization tasks, attention is devoted to obtaining several models/results to account for the diversity in the input data or across the search landscape. Thus, in this paper we propose to provide, as the solution to an RAP, a set of rankings to better explain the preferences expressed in the input orderings. We exemplify our proposal through the Optimal Bucket Order Problem (OBOP), an RAP which consists in finding a single consensus ranking (with ties) that generalizes a set of input rankings codified as a precedence matrix. To address this, we introduce the Optimal Set of Bucket Orders Problem (OSBOP), a generalization of the OBOP that aims to produce not a single ranking as output but a set of consensus rankings. Experimental results are presented to illustrate this proposal, showing how, by providing a set of consensus rankings, the fitness of the solution significantly improves with respect to the one of the original OBOP, without losing comprehensibility.

[AI-13] AI Software Engineer: Programming with Trust

链接: https://arxiv.org/abs/2502.13767
作者: Abhik Roychoudhury,Corina Pasareanu,Michael Pradel,Baishakhi Ray
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown surprising proficiency in generating code snippets, promising to automate large parts of software engineering via artificial intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust equal to or even greater than the trust established by human-driven software engineering practices. The recent trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the power of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust.

[AI-14] RobustX: Robust Counterfactual Explanations Made Easy

链接: https://arxiv.org/abs/2502.13751
作者: Junqi Jiang,Luca Marzari,Aaryan Purohit,Francesco Leofante
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing use of Machine Learning (ML) models to aid decision-making in high-stakes industries demands explainability to facilitate trust. Counterfactual Explanations (CEs) are ideally suited for this, as they can offer insights into the predictions of an ML model by illustrating how changes in its input data may lead to different outcomes. However, for CEs to realise their explanatory potential, significant challenges remain in ensuring their robustness under slight changes in the scenario being explained. Despite the widespread recognition of CEs’ robustness as a fundamental requirement, a lack of standardised tools and benchmarks hinders a comprehensive and effective comparison of robust CE generation methods. In this paper, we introduce RobustX, an open-source Python library implementing a collection of CE generation and evaluation methods, with a focus on the robustness property. RobustX provides interfaces to several existing methods from the literature, enabling streamlined access to state-of-the-art techniques. The library is also easily extensible, allowing fast prototyping of novel robust CE generation and evaluation methods.

[AI-15] Inference of Abstraction for Grounded Predicate Logic

链接: https://arxiv.org/abs/2502.13743
作者: Hiroyuki Kido
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An important open question in AI is what simple and natural principle enables a machine to reason logically for meaningful abstraction with grounded symbols. This paper explores a conceptually new approach to combining probabilistic reasoning and predicative symbolic reasoning over data. We return to the era of reasoning with a full joint distribution before the advent of Bayesian networks. We then discuss that a full joint distribution over models of exponential size in propositional logic and of infinite size in predicate logic should be simply derived from a full joint distribution over data of linear size. We show that the same process is not only enough to generalise the logical consequence relation of predicate logic but also to provide a new perspective to rethink well-known limitations such as the undecidability of predicate logic, the symbol grounding problem and the principle of explosion. The reproducibility of this theoretical work is fully demonstrated by the included proofs.

[AI-16] Robust Counterfactual Inference in Markov Decision Processes

链接: https://arxiv.org/abs/2502.13731
作者: Jessica Lally,Milad Kazemi,Nicola Paoletti
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

[AI-17] Secure Federated Data Distillation

链接: https://arxiv.org/abs/2502.13728
作者: Marco Arazzi,Mert Cihangiroglu,Serena Nicolazzo,Antonino Nocera
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dataset Distillation (DD) is a powerful technique for reducing large datasets into compact, representative synthetic datasets, accelerating Machine Learning training. However, traditional DD methods operate in a centralized manner, which poses significant privacy threats and reduces its applicability. To mitigate these risks, we propose a Secure Federated Data Distillation framework (SFDD) to decentralize the distillation process while preserving this http URL existing Federated Distillation techniques that focus on training global models with distilled knowledge, our approach aims to produce a distilled dataset without exposing local contributions. We leverage the gradient-matching-based distillation method, adapting it for a distributed setting where clients contribute to the distillation process without sharing raw data. The central aggregator iteratively refines a synthetic dataset by integrating client-side updates while ensuring data confidentiality. To make our approach resilient to inference attacks perpetrated by the server that could exploit gradient updates to reconstruct private data, we create an optimized Local Differential Privacy approach, called LDPO-RLD (Label Differential Privacy Obfuscation via Randomized Linear Dispersion). Furthermore, we assess the framework’s resilience against malicious clients executing backdoor attacks and demonstrate robustness under the assumption of a sufficient number of participating clients. Our experimental results demonstrate the effectiveness of SFDD and that the proposed defense concretely mitigates the identified vulnerabilities, with minimal impact on the performance of the distilled dataset. By addressing the interplay between privacy and federation in dataset distillation, this work advances the field of privacy-preserving Machine Learning making our SFDD framework a viable solution for sensitive data-sharing applications.

[AI-18] rustRAG : An Information Assistant with Retrieval Augmented Generation

链接: https://arxiv.org/abs/2502.13719
作者: Yixing Fan,Qiang Yan,Wenshan Wang,Jiafeng Guo,Ruqing Zhang,Xueqi Cheng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:\AcRAG has emerged as a crucial technique for enhancing large models with real-time and domain-specific knowledge. While numerous improvements and open-source tools have been proposed to refine the \acRAG framework for accuracy, relatively little attention has been given to improving the trustworthiness of generated results. To address this gap, we introduce TrustRAG, a novel framework that enhances \acRAG from three perspectives: indexing, retrieval, and generation. Specifically, in the indexing stage, we propose a semantic-enhanced chunking strategy that incorporates hierarchical indexing to supplement each chunk with contextual information, ensuring semantic completeness. In the retrieval stage, we introduce a utility-based filtering mechanism to identify high-quality information, supporting answer generation while reducing input length. In the generation stage, we propose fine-grained citation enhancement, which detects opinion-bearing sentences in responses and infers citation relationships at the sentence-level, thereby improving citation accuracy. We open-source the TrustRAG framework and provide a demonstration studio designed for excerpt-based question answering tasks \footnotethis https URL. Based on these, we aim to help researchers: 1) systematically enhancing the trustworthiness of \acRAG systems and (2) developing their own \acRAG systems with more reliable outputs.

[AI-19] Causes and Strategies in Multiagent Systems AAMAS2025

链接: https://arxiv.org/abs/2502.13701
作者: Sylvia S. Kerkhove,Natasha Alechina,Mehdi Dastani
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted at AAMAS 2025

点击查看摘要

Abstract:Causality plays an important role in daily processes, human reasoning, and artificial intelligence. There has however not been much research on causality in multi-agent strategic settings. In this work, we introduce a systematic way to build a multi-agent system model, represented as a concurrent game structure, for a given structural causal model. In the obtained so-called causal concurrent game structure, transitions correspond to interventions on agent variables of the given causal model. The Halpern and Pearl framework of causality is used to determine the effects of a certain value for an agent variable on other variables. The causal concurrent game structure allows us to analyse and reason about causal effects of agents’ strategic decisions. We formally investigate the relation between causal concurrent game structures and the original structural causal models.

[AI-20] Integrating Inverse and Forward Modeling for Sparse Temporal Data from Sensor Networks

链接: https://arxiv.org/abs/2502.13638
作者: Julian Vexler,Björn Vieten,Martin Nelke,Stefan Kramer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present CavePerception, a framework for the analysis of sparse data from sensor networks that incorporates elements of inverse modeling and forward modeling. By integrating machine learning with physical modeling in a hypotheses space, we aim to improve the interpretability of sparse, noisy, and potentially incomplete sensor data. The framework assumes data from a two-dimensional sensor network laid out in a graph structure that detects certain objects, with certain motion patterns. Examples of such sensors are magnetometers. Given knowledge about the objects and the way they act on the sensors, one can develop a data generator that produces data from simulated motions of the objects across the sensor field. The framework uses the simulated data to infer object behaviors across the sensor network. The approach is experimentally tested on real-world data, where magnetometers are used on an airport to detect and identify aircraft motions. Experiments demonstrate the value of integrating inverse and forward modeling, enabling intelligent systems to better understand and predict complex, sensor-driven events.

[AI-21] Decentralized Planning Using Probabilistic Hyperproperties AAMAS2025

链接: https://arxiv.org/abs/2502.13621
作者: Francesco Pontiggia,Filip Macák,Roman Andriushchenko,Michele Chiari,Milan Češka
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 11 pages, 1 figure, 2 tables. Accepted at AAMAS 2025: the 24th International Conference on Autonomous Agents and Multiagent Systems

点击查看摘要

Abstract:Multi-agent planning under stochastic dynamics is usually formalised using decentralized (partially observable) Markov decision processes ( MDPs) and reachability or expected reward specifications. In this paper, we propose a different approach: we use an MDP describing how a single agent operates in an environment and probabilistic hyperproperties to capture desired temporal objectives for a set of decentralized agents operating in the environment. We extend existing approaches for model checking probabilistic hyperproperties to handle temporal formulae relating paths of different agents, thus requiring the self-composition between multiple MDPs. Using several case studies, we demonstrate that our approach provides a flexible and expressive framework to broaden the specification capabilities with respect to existing planning techniques. Additionally, we establish a close connection between a subclass of probabilistic hyperproperties and planning for a particular type of Dec-MDPs, for both of which we show undecidability. This lays the ground for the use of existing decentralized planning tools in the field of probabilistic hyperproperty verification.

[AI-22] Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation

链接: https://arxiv.org/abs/2502.13576
作者: Peiwen Yuan,Yueqi Zhang,Shaoxiong Feng,Yiwei Li,Xinglin Wang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating models on large benchmarks is very resource-intensive, especially during the period of rapid model evolution. Existing efficient evaluation methods estimate the performance of target models by testing them only on a small and static coreset of the benchmark, which is derived from the publicly available evaluation results of source models. These methods rely on the assumption that target models have high prediction consistency with source models. However, we demonstrate that it doesn’t generalize well in practice. To alleviate the inconsistency issue, we present TailoredBench, a method that conducts customized evaluation tailored to each target model. Specifically, a Global-coreset is first constructed as a probe to identify the most consistent source models for each target model with an adaptive source model selection strategy. Afterwards, a scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model. According to the predictions on Native-coresets, we obtain the performance of target models on the whole benchmark with a calibrated estimation strategy. Comprehensive experiments on 5 benchmarks across over 300 models demonstrate that compared to best performing baselines, TailoredBench achieves an average reduction of 31.4% in MAE of accuracy estimates under the same inference budgets, showcasing strong effectiveness and generalizability.

[AI-23] Model Evolution Framework with Genetic Algorithm for Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2502.13569
作者: Yan Yu,Wengang Zhou,Yaodong Yang,Wanxuan Lu,Yingyan Hou,Houqiang Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-task reinforcement learning employs a single policy to complete various tasks, aiming to develop an agent with generalizability across different scenarios. Given the shared characteristics of tasks, the agent’s learning efficiency can be enhanced through parameter sharing. Existing approaches typically use a routing network to generate specific routes for each task and reconstruct a set of modules into diverse models to complete multiple tasks simultaneously. However, due to the inherent difference between tasks, it is crucial to allocate resources based on task difficulty, which is constrained by the model’s structure. To this end, we propose a Model Evolution framework with Genetic Algorithm (MEGA), which enables the model to evolve during training according to the difficulty of the tasks. When the current model is insufficient for certain tasks, the framework will automatically incorporate additional modules, enhancing the model’s capabilities. Moreover, to adapt to our model evolution framework, we introduce a genotype module-level model, using binary sequences as genotype policies for model reconstruction, while leveraging a non-gradient genetic algorithm to optimize these genotype policies. Unlike routing networks with fixed output dimensions, our approach allows for the dynamic adjustment of the genotype policy length, enabling it to accommodate models with a varying number of modules. We conducted experiments on various robotics manipulation tasks in the Meta-World benchmark. Our state-of-the-art performance demonstrated the effectiveness of the MEGA framework. We will release our source code to the public.

[AI-24] Are Large Language Models In-Context Graph Learners?

链接: https://arxiv.org/abs/2502.13562
作者: Jintang Li,Ruofan Wu,Yuchang Zhu,Huizhe Zhang,Liang Chen,Zibin Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable in-context reasoning capabilities across a wide range of tasks, particularly with unstructured inputs such as language or images. However, LLMs struggle to handle structured data, such as graphs, due to their lack of understanding of non-Euclidean structures. As a result, without additional fine-tuning, their performance significantly lags behind that of graph neural networks (GNNs) in graph learning tasks. In this paper, we show that learning on graph data can be conceptualized as a retrieval-augmented generation (RAG) process, where specific instances (e.g., nodes or edges) act as queries, and the graph itself serves as the retrieved context. Building on this insight, we propose a series of RAG frameworks to enhance the in-context learning capabilities of LLMs for graph learning tasks. Comprehensive evaluations demonstrate that our proposed RAG frameworks significantly improve LLM performance on graph-based tasks, particularly in scenarios where a pretrained LLM must be used without modification or accessed via an API.

[AI-25] Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs

链接: https://arxiv.org/abs/2502.13555
作者: Yushi Feng,Tsai Hor Chan,Guosheng Yin,Lequan Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation is necessary for graph representation learning due to the scarcity and noise present in graph data. Most of the existing augmentation methods overlook the context information inherited from the dataset as they rely solely on the graph structure for augmentation. Despite the success of some large language model-based (LLM) graph learning methods, they are mostly white-box which require access to the weights or latent features from the open-access LLMs, making them difficult to be democratized for everyone as existing LLMs are mostly closed-source for commercial considerations. To overcome these limitations, we propose a black-box context-driven graph data augmentation approach, with the guidance of LLMs – DemoGraph. Leveraging the text prompt as context-related information, we task the LLM with generating knowledge graphs (KGs), which allow us to capture the structural interactions from the text outputs. We then design a dynamic merging schema to stochastically integrate the LLM-generated KGs into the original graph during training. To control the sparsity of the augmented graph, we further devise a granularity-aware prompting strategy and an instruction fine-tuning module, which seamlessly generates text prompts according to different granularity levels of the dataset. Extensive experiments on various graph learning tasks validate the effectiveness of our method over existing graph data augmentation methods. Notably, our approach excels in scenarios involving electronic health records (EHRs), which validates its maximal utilization of contextual knowledge, leading to enhanced predictive performance and interpretability.

[AI-26] Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking

链接: https://arxiv.org/abs/2502.13527
作者: Yanzeng Li,Yunfan Xiong,Jialun Zhong,Jinchao Zhang,Jie Zhou,Lei Zou
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has led to significant applications but also introduced serious security threats, particularly from jailbreak attacks that manipulate output generation. These attacks utilize prompt engineering and logit manipulation to steer models toward harmful content, prompting LLM providers to implement filtering and safety alignment strategies. We investigate LLMs’ safety mechanisms and their recent applications, revealing a new threat model targeting structured output interfaces, which enable attackers to manipulate the inner logit during LLM generation, requiring only API access permissions. To demonstrate this threat model, we introduce a black-box attack framework called AttackPrefixTree (APT). APT exploits structured output interfaces to dynamically construct attack patterns. By leveraging prefixes of models’ safety refusal response and latent harmful outputs, APT effectively bypasses safety measures. Experiments on benchmark datasets indicate that this approach achieves higher attack success rate than existing methods. This work highlights the urgent need for LLM providers to enhance security protocols to address vulnerabilities arising from the interaction between safety patterns and structured outputs.

[AI-27] MILE: Model-based Intervention Learning ICRA

链接: https://arxiv.org/abs/2502.13519
作者: Yigit Korkmaz,Erdem Bıyık
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:Imitation learning techniques have been shown to be highly effective in real-world control scenarios, such as robotics. However, these approaches not only suffer from compounding error issues but also require human experts to provide complete trajectories. Although there exist interactive methods where an expert oversees the robot and intervenes if needed, these extensions usually only utilize the data collected during intervention periods and ignore the feedback signal hidden in non-intervention timesteps. In this work, we create a model to formulate how the interventions occur in such cases, and show that it is possible to learn a policy with just a handful of expert interventions. Our key insight is that it is possible to get crucial information about the quality of the current state and the optimality of the chosen action from expert feedback, regardless of the presence or the absence of intervention. We evaluate our method on various discrete and continuous simulation environments, a real-world robotic manipulation task, as well as a human subject study. Videos and the code can be found at this https URL .

[AI-28] SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

链接: https://arxiv.org/abs/2502.13516
作者: Hao Yi,Qingyang Li,Yulan Hu,Fuzheng Zhang,Di Zhang,Yong Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbfSelf-training framework integrating \textbfProcess \textbfPreference learning using \textbfDynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbfdynamic value margin on step-level preference optimization, which employs tree-based self-sampling on model responses \textbfwithout any distillation from other models. Furthermore, we theoretically prove that SPPD is \textbfequivalent to on-policy policy gradient methods under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at \hrefthis https URLthis https URL.

[AI-29] Hidden Darkness in LLM -Generated Designs: Exploring Dark Patterns in Ecommerce Web Components Generated by LLM s

链接: https://arxiv.org/abs/2502.13499
作者: Ziwei Chen,Jiawen Shen,Luna,Kristen Vaccaro
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Recent work has highlighted the risks of LLM-generated content for a wide range of harmful behaviors, including incorrect and harmful code. In this work, we extend this by studying whether LLM-generated web design contains dark patterns. This work evaluated designs of ecommerce web components generated by four popular LLMs: Claude, GPT, Gemini, and Llama. We tested 13 commonly used ecommerce components (e.g., search, product reviews) and used them as prompts to generate a total of 312 components across all models. Over one-third of generated components contain at least one dark pattern. The majority of dark pattern strategies involve hiding crucial information, limiting users’ actions, and manipulating them into making decisions through a sense of urgency. Dark patterns are also more frequently produced in components that are related to company interests. These findings highlight the need for interventions to prevent dark patterns during front-end code generation with LLMs and emphasize the importance of expanding ethical design education to a broader audience.

[AI-30] Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

链接: https://arxiv.org/abs/2502.13480
作者: Peiran Wang,Haibing Li,Fu Haohan,Shiyong Li,Yanpeng Wang,Dou Shen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

[AI-31] Integration of Agent ic AI with 6G Networks for Mission-Critical Applications: Use-case and Challenges WWW

链接: https://arxiv.org/abs/2502.13476
作者: Sunder Ali Khowaja,Kapal Dev,Muhammad Salman Pathan,Engin Zeydan,Merouane Debbah
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: FEMA [ this https URL ] National Oceanic and Atmospheric Administration [ this https URL ] packages Pytorch [ this https URL ] RLib [ this https URL ] Neo4j [ this https URL ] Apache Kafka [ this https URL ]

点击查看摘要

Abstract:We are in a transformative era, and advances in Artificial Intelligence (AI), especially the foundational models, are constantly in the news. AI has been an integral part of many applications that rely on automation for service delivery, and one of them is mission-critical public safety applications. The problem with AI-oriented mission-critical applications is the humanin-the-loop system and the lack of adaptability to dynamic conditions while maintaining situational awareness. Agentic AI (AAI) has gained a lot of attention recently due to its ability to analyze textual data through a contextual lens while quickly adapting to conditions. In this context, this paper proposes an AAI framework for mission-critical applications. We propose a novel framework with a multi-layer architecture to realize the AAI. We also present a detailed implementation of AAI layer that bridges the gap between network infrastructure and missioncritical applications. Our preliminary analysis shows that the AAI reduces initial response time by 5.6 minutes on average, while alert generation time is reduced by 15.6 seconds on average and resource allocation is improved by up to 13.4%. We also show that the AAI methods improve the number of concurrent operations by 40, which reduces the recovery time by up to 5.2 minutes. Finally, we highlight some of the issues and challenges that need to be considered when implementing AAI frameworks.

[AI-32] Some Insights of Construction of Feature Graph to Learn Pairwise Feature Interactions with Graph Neural Networks

链接: https://arxiv.org/abs/2502.13471
作者: Phaphontee Yamchote,Saw Nay Htet Win,Chainarong Amornbunchornvej,Thanapon Noraset
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: This is the draft before submitting to any journal

点击查看摘要

Abstract:Feature interaction is crucial in predictive machine learning models, as it captures the relationships between features that influence model performance. In this work, we focus on pairwise interactions and investigate their importance in constructing feature graphs for Graph Neural Networks (GNNs). Rather than proposing new methods, we leverage existing GNN models and tools to explore the relationship between feature graph structures and their effectiveness in modeling interactions. Through experiments on synthesized datasets, we uncover that edges between interacting features are important for enabling GNNs to model feature interactions effectively. We also observe that including non-interaction edges can act as noise, degrading model performance. Furthermore, we provide theoretical support for sparse feature graph selection using the Minimum Description Length (MDL) principle. We prove that feature graphs retaining only necessary interaction edges yield a more efficient and interpretable representation than complete graphs, aligning with Occam’s Razor. Our findings offer both theoretical insights and practical guidelines for designing feature graphs that improve the performance and interpretability of GNN models. Comments: This is the draft before submitting to any journal Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) MSC classes: 68T07 68T07 68T07 ACMclasses: I.2.6 Cite as: arXiv:2502.13471 [cs.LG] (or arXiv:2502.13471v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.13471 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] Interleaved Gibbs Diffusion for Constrained Generation

链接: https://arxiv.org/abs/2502.13450
作者: Gautham Govind Anil,Sachin Yadav,Dheeraj Nagaraj,Karthikeyan Shanmugam,Prateek Jain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous-discrete data, focusing on constrained generation problems. Prior works on discrete and continuous-discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained generation. IGD moves beyond this by interleaving continuous and discrete denoising algorithms via a discrete time Gibbs sampling type Markov chain. IGD provides flexibility in the choice of denoisers, allows conditional generation via state-space doubling and inference time scaling via the ReDeNoise method. Empirical evaluations on three challenging tasks-solving 3-SAT, generating molecule structures, and generating layouts-demonstrate state-of-the-art performance. Notably, IGD achieves a 7% improvement on 3-SAT out of the box and achieves state-of-the-art results in molecule generation without relying on equivariant diffusion or domain-specific architectures. We explore a wide range of modeling, and interleaving strategies along with hyperparameters in each of these problems.

[AI-34] Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.13430
作者: Hao Ma,Shijie Wang,Zhiqiang Pu,Siyao Zhao,Xiaolin Ai
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

[AI-35] Explore-Construct-Filter: An Automated Framework for Rich and Reliable API Knowledge Graph Construction

链接: https://arxiv.org/abs/2502.13412
作者: Yanbang Sun,Qing Huang,Xiaoxue Ren,Zhenchang Xing,Xiaohong Li,Junjie Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The API Knowledge Graph (API KG) is a structured network that models API entities and their relations, providing essential semantic insights for tasks such as API recommendation, code generation, and API misuse detection. However, constructing a knowledge-rich and reliable API KG presents several challenges. Existing schema-based methods rely heavily on manual annotations to design KG schemas, leading to excessive manual overhead. On the other hand, schema-free methods, due to the lack of schema guidance, are prone to introducing noise, reducing the KG’s reliability. To address these issues, we propose the Explore-Construct-Filter framework, an automated approach for API KG construction based on large language models (LLMs). This framework consists of three key modules: 1) KG exploration: LLMs simulate the workflow of annotators to automatically design a schema with comprehensive type triples, minimizing human intervention; 2) KG construction: Guided by the schema, LLMs extract instance triples to construct a rich yet unreliable API KG; 3) KG filtering: Removing invalid type triples and suspicious instance triples to construct a rich and reliable API KG. Experimental results demonstrate that our method surpasses the state-of-the-art method, achieving a 25.2% improvement in F1 score. Moreover, the Explore-Construct-Filter framework proves effective, with the KG exploration module increasing KG richness by 133.6% and the KG filtering module improving reliability by 26.6%. Finally, cross-model experiments confirm the generalizability of our framework.

[AI-36] Me Why: Incentivizing Explanations

链接: https://arxiv.org/abs/2502.13410
作者: Siddarth Srinivasan,Ezra Karger,Michiel Bakker,Yiling Chen
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:Common sense suggests that when individuals explain why they believe something, we can arrive at more accurate conclusions than when they simply state what they believe. Yet, there is no known mechanism that provides incentives to elicit explanations for beliefs from agents. This likely stems from the fact that standard Bayesian models make assumptions (like conditional independence of signals) that preempt the need for explanations, in order to show efficient information aggregation. A natural justification for the value of explanations is that agents’ beliefs tend to be drawn from overlapping sources of information, so agents’ belief reports do not reveal all that needs to be known. Indeed, this work argues that rationales-explanations of an agent’s private information-lead to more efficient aggregation by allowing agents to efficiently identify what information they share and what information is new. Building on this model of rationales, we present a novel ‘deliberation mechanism’ to elicit rationales from agents in which truthful reporting of beliefs and rationales is a perfect Bayesian equilibrium.

[AI-37] Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks

链接: https://arxiv.org/abs/2502.13406
作者: Vince Kurtz,Joel W. Burdick
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Generative control policies have recently unlocked major progress in robotics. These methods produce action sequences via diffusion or flow matching, with training data provided by demonstrations. But despite enjoying considerable success on difficult manipulation problems, generative policies come with two key limitations. First, behavior cloning requires expert demonstrations, which can be time-consuming and expensive to obtain. Second, existing methods are limited to relatively slow, quasi-static tasks. In this paper, we leverage a tight connection between sampling-based predictive control and generative modeling to address each of these issues. In particular, we introduce generative predictive control, a supervised learning framework for tasks with fast dynamics that are easy to simulate but difficult to demonstrate. We then show how trained flow-matching policies can be warm-started at run-time, maintaining temporal consistency and enabling fast feedback rates. We believe that generative predictive control offers a complementary approach to existing behavior cloning methods, and hope that it paves the way toward generalist policies that extend beyond quasi-static demonstration-oriented tasks.

[AI-38] Atomic Proximal Policy Optimization for Electric Robo-Taxi Dispatch and Charger Allocation

链接: https://arxiv.org/abs/2502.13392
作者: Jim Dai,Manxi Wu,Zhanhao Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pioneering companies such as Waymo have deployed robo-taxi services in several U.S. cities. These robo-taxis are electric vehicles, and their operations require the joint optimization of ride matching, vehicle repositioning, and charging scheduling in a stochastic environment. We model the operations of the ride-hailing system with robo-taxis as a discrete-time, average reward Markov Decision Process with infinite horizon. As the fleet size grows, the dispatching is challenging as the set of system state and the fleet dispatching action set grow exponentially with the number of vehicles. To address this, we introduce a scalable deep reinforcement learning algorithm, called Atomic Proximal Policy Optimization (Atomic-PPO), that reduces the action space using atomic action decomposition. We evaluate our algorithm using real-world NYC for-hire vehicle data and we measure the performance using the long-run average reward achieved by the dispatching policy relative to a fluid-based reward upper bound. Our experiments demonstrate the superior performance of our Atomic-PPO compared to benchmarks. Furthermore, we conduct extensive numerical experiments to analyze the efficient allocation of charging facilities and assess the impact of vehicle range and charger speed on fleet performance.

[AI-39] Reasoning with Reinforced Functional Token Tuning

链接: https://arxiv.org/abs/2502.13389
作者: Kongcheng Zhang,Qi Yao,Baisheng Lai,Jiaxing Huang,Wenkai Fang,Dacheng Tao,Mingli Song,Shunyu Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., analyze, verify, refine) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at this https URL.

[AI-40] Reflection of Episodes: Learning to Play Game from Expert and Self Experiences

链接: https://arxiv.org/abs/2502.13388
作者: Xiaojie Xu,Zongyuan Li,Chang Lu,Runnan Qi,Yanan Ni,Lumin Jiang,Xiangbei Liu,Xuebo Zhang,Yongchun Fang,Kuihua Huang,Xian Guo,Zhanghua Wu,Zhenya Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.

[AI-41] Learning Symbolic Task Decompositions for Multi-Agent Teams AAMAS2025

链接: https://arxiv.org/abs/2502.13376
作者: Ameesh Shah,Niklas Lauffer,Thomas Chen,Nikhil Pitta,Sanjit A. Seshia
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, main track full paper at AAMAS 2025

点击查看摘要

Abstract:One approach for improving sample efficiency in cooperative multi-agent learning is to decompose overall tasks into sub-tasks that can be assigned to individual agents. We study this problem in the context of reward machines: symbolic tasks that can be formally decomposed into sub-tasks. In order to handle settings without a priori knowledge of the environment, we introduce a framework that can learn the optimal decomposition from model-free interactions with the environment. Our method uses a task-conditioned architecture to simultaneously learn an optimal decomposition and the corresponding agents’ policies for each sub-task. In doing so, we remove the need for a human to manually design the optimal decomposition while maintaining the sample-efficiency benefits of improved credit assignment. We provide experimental results in several deep reinforcement learning settings, demonstrating the efficacy of our approach. Our results indicate that our approach succeeds even in environments with codependent agent dynamics, enabling synchronous multi-agent learning not achievable in previous works.

[AI-42] Fighter Jet Navigation and Combat using Deep Reinforcement Learning with Explainable AI

链接: https://arxiv.org/abs/2502.13373
作者: Swati Kar,Soumyabrata Dey,Mahesh K Banavar,Shahnewaz Karim Sakib
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents the development of an Artificial Intelligence (AI) based fighter jet agent within a customized Pygame simulation environment, designed to solve multi-objective tasks via deep reinforcement learning (DRL). The jet’s primary objectives include efficiently navigating the environment, reaching a target, and selectively engaging or evading an enemy. A reward function balances these goals while optimized hyperparameters enhance learning efficiency. Results show more than 80% task completion rate, demonstrating effective decision-making. To enhance transparency, the jet’s action choices are analyzed by comparing the rewards of the actual chosen action (factual action) with those of alternate actions (counterfactual actions), providing insights into the decision-making rationale. This study illustrates DRL’s potential for multi-objective problem-solving with explainable AI. Project page is available at: \hrefthis https URLProject GitHub Link.

[AI-43] Secure and Efficient Watermarking for Latent Diffusion Models in Model Distribution Scenarios

链接: https://arxiv.org/abs/2502.13345
作者: Liangqi Lei,Keke Gai,Jing Yu,Liehuang Zhu,Qi Wu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Latent diffusion models have exhibited considerable potential in generative tasks. Watermarking is considered to be an alternative to safeguard the copyright of generative models and prevent their misuse. However, in the context of model distribution scenarios, the accessibility of models to large scale of model users brings new challenges to the security, efficiency and robustness of existing watermark solutions. To address these issues, we propose a secure and efficient watermarking solution. A new security mechanism is designed to prevent watermark leakage and watermark escape, which considers watermark randomness and watermark-model association as two constraints for mandatory watermark injection. To reduce the time cost of training the security module, watermark injection and the security mechanism are decoupled, ensuring that fine-tuning VAE only accomplishes the security mechanism without the burden of learning watermark patterns. A watermark distribution-based verification strategy is proposed to enhance the robustness against diverse attacks in the model distribution scenarios. Experimental results prove that our watermarking consistently outperforms existing six baselines on effectiveness and robustness against ten image processing attacks and adversarial attacks, while enhancing security in the distribution scenarios.

[AI-44] How Expressive are Knowledge Graph Foundation Models?

链接: https://arxiv.org/abs/2502.13339
作者: Xingyue Huang,Pablo Barceló,Michael M. Bronstein,İsmail İlkan Ceylan,Mikhail Galkin,Juan L Reutter,Miguel Romero Orth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show that the expressive power of KGFMs directly depends on the motifs that are used to learn the relation representations. We then observe that the most typical motifs used in the existing literature are binary, as the representations are learned based on how pairs of relations interact, which limits the model’s expressiveness. As part of our study, we design more expressive KGFMs using richer motifs, which necessitate learning relation representations based on, e.g., how triples of relations interact with each other. Finally, we empirically validate our theoretical findings, showing that the use of richer motifs results in better performance on a wide range of datasets drawn from different domains.

[AI-45] Revisiting Privacy Utility and Efficiency Trade-offs when Fine-Tuning Large Language Models

链接: https://arxiv.org/abs/2502.13313
作者: Soumi Das,Camila Kolling,Mohammad Aflah Khan,Mahsa Amani,Bishwamittra Ghosh,Qinyuan Wu,Till Speicher,Krishna P. Gummadi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This is a work in progress. The draft may change in future

点击查看摘要

Abstract:We study the inherent trade-offs in minimizing privacy risks and maximizing utility, while maintaining high computational efficiency, when fine-tuning large language models (LLMs). A number of recent works in privacy research have attempted to mitigate privacy risks posed by memorizing fine-tuning data by using differentially private training methods (e.g., DP), albeit at a significantly higher computational cost (inefficiency). In parallel, several works in systems research have focussed on developing (parameter) efficient fine-tuning methods (e.g., LoRA), but few works, if any, investigated whether such efficient methods enhance or diminish privacy risks. In this paper, we investigate this gap and arrive at a surprising conclusion: efficient fine-tuning methods like LoRA mitigate privacy risks similar to private fine-tuning methods like DP. Our empirical finding directly contradicts prevailing wisdom that privacy and efficiency objectives are at odds during fine-tuning. Our finding is established by (a) carefully defining measures of privacy and utility that distinguish between memorizing sensitive and non-sensitive tokens in training and test datasets used in fine-tuning and (b) extensive evaluations using multiple open-source language models from Pythia, Gemma, and Llama families and different domain-specific datasets.

[AI-46] Demonstrating specification gaming in reasoning models

链接: https://arxiv.org/abs/2502.13295
作者: Alexander Bondarenko,Denis Volk,Dmitrii Volkov,Jeffrey Ladish
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won’t work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2502.13295 [cs.AI] (or arXiv:2502.13295v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.13295 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] Prediction of Clinical Complication Onset using Neural Point Processes

链接: https://arxiv.org/abs/2502.13290
作者: Sachini Weerasekara,Sagar Kamarthi,Jacqueline Isaacs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting medical events in advance within critical care settings is paramount for patient outcomes and resource management. Utilizing predictive models, healthcare providers can anticipate issues such as cardiac arrest, sepsis, or respiratory failure before they manifest. Recently, there has been a surge in research focusing on forecasting adverse medical event onsets prior to clinical manifestation using machine learning. However, while these models provide temporal prognostic predictions for the occurrence of a specific adverse event of interest within defined time intervals, their interpretability often remains a challenge. In this work, we explore the applicability of neural temporal point processes in the context of adverse event onset prediction, with the aim of explaining clinical pathways and providing interpretable insights. Our experiments span six state-of-the-art neural point processes and six critical care datasets, each focusing on the onset of distinct adverse events. This work represents a novel application class of neural temporal point processes in event prediction.

[AI-48] HyperGCL: Multi-Modal Graph Contrastive Learning via Learnable Hypergraph Views

链接: https://arxiv.org/abs/2502.13277
作者: Khaled Mohammed Saifuddin,Jonathan Shihao Ji,Esra Akbas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Recent advancements in Graph Contrastive Learning (GCL) have demonstrated remarkable effectiveness in improving graph representations. However, relying on predefined augmentations (e.g., node dropping, edge perturbation, attribute masking) may result in the loss of task-relevant information and a lack of adaptability to diverse input data. Furthermore, the selection of negative samples remains rarely explored. In this paper, we introduce HyperGCL, a novel multimodal GCL framework from a hypergraph perspective. HyperGCL constructs three distinct hypergraph views by jointly utilizing the input graph’s structure and attributes, enabling a comprehensive integration of multiple modalities in contrastive learning. A learnable adaptive topology augmentation technique enhances these views by preserving important relations and filtering out noise. View-specific encoders capture essential characteristics from each view, while a network-aware contrastive loss leverages the underlying topology to define positive and negative samples effectively. Extensive experiments on benchmark datasets demonstrate that HyperGCL achieves state-of-the-art node classification performance.

[AI-49] A Survey of Anomaly Detection in Cyber-Physical Systems

链接: https://arxiv.org/abs/2502.13256
作者: Danial Abshari,Meera Sridhar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In our increasingly interconnected world, Cyber-Physical Systems (CPS) play a crucial role in industries like healthcare, transportation, and manufacturing by combining physical processes with computing power. These systems, however, face many challenges, especially regarding security and system faults. Anomalies in CPS may indicate unexpected problems, from sensor malfunctions to cyber-attacks, and must be detected to prevent failures that can cause harm or disrupt services. This paper provides an overview of the different ways researchers have approached anomaly detection in CPS. We categorize and compare methods like machine learning, deep learning, mathematical models, invariant, and hybrid techniques. Our goal is to help readers understand the strengths and weaknesses of these methods and how they can be used to create safer, more reliable CPS. By identifying the gaps in current solutions, we aim to encourage future research that will make CPS more secure and adaptive in our increasingly automated world.

[AI-50] Communication Strategy on Macro-and-Micro Traffic State in Cooperative Deep Reinforcement Learning for Regional Traffic Signal Control

链接: https://arxiv.org/abs/2502.13248
作者: Hankang Gu,Shangbo Wang,Dongyao Jia,Yuli Zhang,Yanrong Luo,Guoqiang Mao,Jianping Wang,Eng Gee Lim
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive Traffic Signal Control (ATSC) has become a popular research topic in intelligent transportation systems. Regional Traffic Signal Control (RTSC) using the Multi-agent Deep Reinforcement Learning (MADRL) technique has become a promising approach for ATSC due to its ability to achieve the optimum trade-off between scalability and optimality. Most existing RTSC approaches partition a traffic network into several disjoint regions, followed by applying centralized reinforcement learning techniques to each region. However, the pursuit of cooperation among RTSC agents still remains an open issue and no communication strategy for RTSC agents has been investigated. In this paper, we propose communication strategies to capture the correlation of micro-traffic states among lanes and the correlation of macro-traffic states among intersections. We first justify the evolution equation of the RTSC process is Markovian via a system of store-and-forward queues. Next, based on the evolution equation, we propose two GAT-Aggregated (GA2) communication modules–GA2-Naive and GA2-Aug to extract both intra-region and inter-region correlations between macro and micro traffic states. While GA2-Naive only considers the movements at each intersection, GA2-Aug also considers the lane-changing behavior of vehicles. Two proposed communication modules are then aggregated into two existing novel RTSC frameworks–RegionLight and Regional-DRL. Experimental results demonstrate that both GA2-Naive and GA2-Aug effectively improve the performance of existing RTSC frameworks under both real and synthetic scenarios. Hyperparameter testing also reveals the robustness and potential of our communication modules in large-scale traffic networks.

[AI-51] Conformal Prediction as Bayesian Quadrature

链接: https://arxiv.org/abs/2502.13228
作者: Jake C. Snell,Thomas L. Griffiths
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:As machine learning-based prediction systems are increasingly used in high-stakes situations, it is important to understand how such predictive models will perform upon deployment. Distribution-free uncertainty quantification techniques such as conformal prediction provide guarantees about the loss black-box models will incur even when the details of the models are hidden. However, such methods are based on frequentist probability, which unduly limits their applicability. We revisit the central aspects of conformal prediction from a Bayesian perspective and thereby illuminate the shortcomings of frequentist guarantees. We propose a practical alternative based on Bayesian quadrature that provides interpretable guarantees and offers a richer representation of the likely range of losses to be observed at test time.

[AI-52] wo Tickets are Better than One: Fair and Accurate Hiring Under Strategic LLM Manipulations

链接: https://arxiv.org/abs/2502.13221
作者: Lee Cohen,Jack Hsieh,Connie Hong,Judy Hanwen Shen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In an era of increasingly capable foundation models, job seekers are turning to generative AI tools to enhance their application materials. However, unequal access to and knowledge about generative AI tools can harm both employers and candidates by reducing the accuracy of hiring decisions and giving some candidates an unfair advantage. To address these challenges, we introduce a new variant of the strategic classification framework tailored to manipulations performed using large language models, accommodating varying levels of manipulations and stochastic outcomes. We propose a ``two-ticket’’ scheme, where the hiring algorithm applies an additional manipulation to each submitted resume and considers this manipulated version together with the original submitted resume. We establish theoretical guarantees for this scheme, showing improvements for both the fairness and accuracy of hiring decisions when the true positive rate is maximized subject to a no false positives constraint. We further generalize this approach to an n -ticket scheme and prove that hiring outcomes converge to a fixed, group-independent decision, eliminating disparities arising from differential LLM access. Finally, we empirically validate our framework and the performance of our two-ticket scheme on real resumes using an open-source resume screening tool.

[AI-53] Learning To Explore With Predictive World Model Via Self-Supervised Learning

链接: https://arxiv.org/abs/2502.13200
作者: Alana Santana,Paula P. Costa,Esther L. Colombini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous artificial agents must be able to learn behaviors in complex environments without humans to design tasks and rewards. Designing these functions for each environment is not feasible, thus, motivating the development of intrinsic reward functions. In this paper, we propose using several cognitive elements that have been neglected for a long time to build an internal world model for an intrinsically motivated agent. Our agent performs satisfactory iterations with the environment, learning complex behaviors without needing previously designed reward functions. We used 18 Atari games to evaluate what cognitive skills emerge in games that require reactive and deliberative behaviors. Our results show superior performance compared to the state-of-the-art in many test cases with dense and sparse rewards.

[AI-54] he Role of GitHub Copilot on Software Development: A Perspec-tive on Productivity Security Best Practices and Future Directions

链接: https://arxiv.org/abs/2502.13199
作者: Suresh Babu Nettur,Shanthi Karpurapu,Unnati Nettur,Likhit Sagar Gajja,Sravanthy Myneni,Akhil Dusi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Correspondence and co-first authors: nettursuresh@gmail.com, this http URL @gmail.com

点击查看摘要

Abstract:GitHub Copilot is transforming software development by automating tasks and boosting productivity through AI-driven code generation. In this paper, we con-duct a literature survey to synthesize insights on Copilot’s impact on productivity and security. We review academic journal databases, industry reports, and official docu-mentation to highlight key findings and challenges. While Copilot accelerates coding and prototyping, concerns over security vulnerabilities and intellectual property risks persist. Drawing from the literature, we provide a perspective on best practices and future directions for responsible AI adoption in software engineering, offering action-able insights for developers and organizations to integrate Copilot effectively while maintaining high standards of quality and security.

[AI-55] Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

链接: https://arxiv.org/abs/2502.13198
作者: Manal Rahal,Bestoun S. Ahmed,Gergely Szabados,Torgny Fornstedt,Jorgen Samuelsson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 42 pages

点击查看摘要

Abstract:Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.

[AI-56] Conditional Max-Sum for Asynchronous Multiagent Decision Making AAMAS2025

链接: https://arxiv.org/abs/2502.13194
作者: Dimitrios Troullinos,Georgios Chalkiadakis,Ioannis Papamichail,Markos Papageorgiou
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Accepted Full Paper (Main Technical Track) - 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025). This extended version includes the Appendix at the end

点击查看摘要

Abstract:In this paper we present a novel approach for multiagent decision making in dynamic environments based on Factor Graphs and the Max-Sum algorithm, considering asynchronous variable reassignments and distributed message-passing among agents. Motivated by the challenging domain of lane-free traffic where automated vehicles can communicate and coordinate as agents, we propose a more realistic communication framework for Factor Graph formulations that satisfies the above-mentioned restrictions, along with Conditional Max-Sum: an extension of Max-Sum with a revised message-passing process that is better suited for asynchronous settings. The overall application in lane-free traffic can be viewed as a hybrid system where the Factor Graph formulation undertakes the strategic decision making of vehicles, that of desired lateral alignment in a coordinated manner; and acts on top of a rule-based method we devise that provides a structured representation of the lane-free environment for the factors, while also handling the underlying control of vehicles regarding core operations and safety. Our experimental evaluation showcases the capabilities of the proposed framework in problems with intense coordination needs when compared to a domain-specific baseline without communication, and an increased adeptness of Conditional Max-Sum with respect to the standard algorithm.

[AI-57] On the Privacy Risks of Spiking Neural Networks: A Membership Inference Analysis

链接: https://arxiv.org/abs/2502.13191
作者: Junyi Guan,Abhijith Sharma,Chong Tian,Salem Lahlou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are increasingly explored for their energy efficiency and robustness in real-world applications, yet their privacy risks remain largely unexamined. In this work, we investigate the susceptibility of SNNs to Membership Inference Attacks (MIAs) – a major privacy threat where an adversary attempts to determine whether a given sample was part of the training dataset. While prior work suggests that SNNs may offer inherent robustness due to their discrete, event-driven nature, we find that its resilience diminishes as latency (T) increases. Furthermore, we introduce an input dropout strategy under black box setting, that significantly enhances membership inference in SNNs. Our findings challenge the assumption that SNNs are inherently more secure, and even though they are expected to be better, our results reveal that SNNs exhibit privacy vulnerabilities that are equally comparable to Artificial Neural Networks (ANNs). Our code is available at this https URL.

[AI-58] A Survey of Sim-to-Real Methods in RL: Progress Prospects and Challenges with Foundation Models

链接: https://arxiv.org/abs/2502.13187
作者: Longchao Da,Justin Turnau,Thirulogasankar Pranav Kutralingam,Alvaro Velasquez,Paulo Shakarian,Hua Wei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 19 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Deep Reinforcement Learning (RL) has been explored and verified to be effective in solving decision-making tasks in various domains, such as robotics, transportation, recommender systems, etc. It learns from the interaction with environments and updates the policy using the collected experience. However, due to the limited real-world data and unbearable consequences of taking detrimental actions, the learning of RL policy is mainly restricted within the simulators. This practice guarantees safety in learning but introduces an inevitable sim-to-real gap in terms of deployment, thus causing degraded performance and risks in execution. There are attempts to solve the sim-to-real problems from different domains with various techniques, especially in the era with emerging techniques such as large foundations or language models that have cast light on the sim-to-real. This survey paper, to the best of our knowledge, is the first taxonomy that formally frames the sim-to-real techniques from key elements of the Markov Decision Process (State, Action, Transition, and Reward). Based on the framework, we cover comprehensive literature from the classic to the most advanced methods including the sim-to-real techniques empowered by foundation models, and we also discuss the specialties that are worth attention in different domains of sim-to-real problems. Then we summarize the formal evaluation process of sim-to-real performance with accessible code or benchmarks. The challenges and opportunities are also presented to encourage future exploration of this direction. We are actively maintaining a to include the most up-to-date sim-to-real research outcomes to help the researchers in their work.

[AI-59] RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals

链接: https://arxiv.org/abs/2502.13181
作者: Jaemu Heo,Eldor Fozilov,Hyunmin Song,Taehwan Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel manner, which makes them very efficient to train and effective in sequence modeling. Even though they have shown strong performance in processing sequential data, the size of their parameters is considerably larger when compared to other architectures such as RNN and CNN based models. Therefore, several approaches have explored parameter sharing and recurrence in Transformer models to address their computational demands. However, such methods struggle to maintain high performance compared to the original transformer model. To address this challenge, we propose our novel approach, RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner, while utilizing low-rank matrices to generate input-dependent level signals. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification, as validated in the experiments.

[AI-60] Uncertain Multi-Objective Recommendation via Orthogonal Meta-Learning Enhanced Bayesian Optimization

链接: https://arxiv.org/abs/2502.13180
作者: Hongxu Wang,Zhu Sun,Yingpeng Du,Lu Zhang,Tiantian He,Yew-Soon Ong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender systems (RSs) play a crucial role in shaping our digital interactions, influencing how we access and engage with information across various domains. Traditional research has predominantly centered on maximizing recommendation accuracy, often leading to unintended side effects such as echo chambers and constrained user experiences. Drawing inspiration from autonomous driving, we introduce a novel framework that categorizes RS autonomy into five distinct levels, ranging from basic rule-based accuracy-driven systems to behavior-aware, uncertain multi-objective RSs - where users may have varying needs, such as accuracy, diversity, and fairness. In response, we propose an approach that dynamically identifies and optimizes multiple objectives based on individual user preferences, fostering more ethical and intelligent user-centric recommendations. To navigate the uncertainty inherent in multi-objective RSs, we develop a Bayesian optimization (BO) framework that captures personalized trade-offs between different objectives while accounting for their uncertain interdependencies. Furthermore, we introduce an orthogonal meta-learning paradigm to enhance BO efficiency and effectiveness by leveraging shared knowledge across similar tasks and mitigating conflicts among objectives through the discovery of orthogonal information. Finally, extensive empirical evaluations demonstrate the effectiveness of our method in optimizing uncertain multi-objectives for individual users, paving the way for more adaptive and user-focused RSs.

[AI-61] PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

链接: https://arxiv.org/abs/2502.13179
作者: Jiaqi Zhao,Miao Zhang,Ming Wang,Yuzhang Shang,Kaihao Zhang,Weili Guan,Yaowei Wang,Min Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at this https URL.

[AI-62] Benchmarking Post-Training Quantization in LLM s: Comprehensive Taxonomy Unified Evaluation and Comparative Analysis

链接: https://arxiv.org/abs/2502.13178
作者: Jiaqi Zhao,Ming Wang,Miao Zhang,Yuzhang Shang,Xuebo Liu,Yaowei Wang,Min Zhang,Liqiang Nie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 3 fugures

点击查看摘要

Abstract:Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation this http URL comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.

[AI-63] KL Penalty Control via Perturbation for Direct Preference Optimization

链接: https://arxiv.org/abs/2502.13177
作者: Sangkyu Lee,Janghoon Han,Hosung Song,Stanley Jungkyu Choi,Honglak Lee,Youngjae Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint; Under review

点击查看摘要

Abstract:Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods try to turn this static KL penalty into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose \varepsilon -Direct Preference Optimization ( \varepsilon -DPO), which allows adaptive control of the KL penalty strength \beta for each preference pair. Specifically, \varepsilon -DPO adaptively controls \beta for each preference pair based on the monotonicity of logits as a preference model under the perturbation of \beta during training by simply reusing the logit of the current policy and the reference policy. Experimental results show that \varepsilon -DPO outperforms existing direct alignment algorithms and KL penalty relaxation methods on general chatbot benchmarks, highlighting the significance of adaptive KL penalty relaxation at the instance-level in DPO.

[AI-64] BaKlaVa – Budgeted Allocation of KV cache for Long-context Inference

链接: https://arxiv.org/abs/2502.13176
作者: Ahmed Burak Gulhan,Krishna Teja Chitty-Venkata,Murali Emani,Mahmut Kandemir,Venkatram Vishwanath
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.

[AI-65] owards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks

链接: https://arxiv.org/abs/2502.13175
作者: Wenpeng Xing,Minghao Li,Mohan Li,Meng Han
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Embodied AI systems, including robots and autonomous vehicles, are increasingly integrated into real-world applications, where they encounter a range of vulnerabilities stemming from both environmental and system-level factors. These vulnerabilities manifest through sensor spoofing, adversarial attacks, and failures in task and motion planning, posing significant challenges to robustness and safety. Despite the growing body of research, existing reviews rarely focus specifically on the unique safety and security challenges of embodied AI systems. Most prior work either addresses general AI vulnerabilities or focuses on isolated aspects, lacking a dedicated and unified framework tailored to embodied AI. This survey fills this critical gap by: (1) categorizing vulnerabilities specific to embodied AI into exogenous (e.g., physical attacks, cybersecurity threats) and endogenous (e.g., sensor failures, software flaws) origins; (2) systematically analyzing adversarial attack paradigms unique to embodied AI, with a focus on their impact on perception, decision-making, and embodied interaction; (3) investigating attack vectors targeting large vision-language models (LVLMs) and large language models (LLMs) within embodied systems, such as jailbreak attacks and instruction misinterpretation; (4) evaluating robustness challenges in algorithms for embodied perception, decision-making, and task planning; and (5) proposing targeted strategies to enhance the safety and reliability of embodied AI systems. By integrating these dimensions, we provide a comprehensive framework for understanding the interplay between vulnerabilities and safety in embodied AI.

[AI-66] hinking Preference Optimization

链接: https://arxiv.org/abs/2502.13173
作者: Wang Yang,Hongye Jin,Jingfeng Yang,Vipin Chaudhary,Xiaotian Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B’s performance on MATH500 from 87.4% to 91.2%.

[AI-67] Unveiling Privacy Risks in LLM Agent Memory

链接: https://arxiv.org/abs/2502.13172
作者: Bo Wang,Weiyi He,Pengfei He,Shenglai Zeng,Zhen Xiang,Yue Xing,Jiliang Tang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent’s and the attacker’s perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.

[AI-68] Web Phishing Net (WPN): A scalable machine learning approach for real-time phishing campaign detection

链接: https://arxiv.org/abs/2502.13171
作者: Muhammad Fahad Zia,Sri Harish Kalidass
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: IEEE Intelligent Cybersecurity Conference (ICSC2024)

点击查看摘要

Abstract:Phishing is the most prevalent type of cyber-attack today and is recognized as the leading source of data breaches with significant consequences for both individuals and corporations. Web-based phishing attacks are the most frequent with vectors such as social media posts and emails containing links to phishing URLs that once clicked on render host systems vulnerable to more sinister attacks. Research efforts to detect phishing URLs have involved the use of supervised learning techniques that use large amounts of data to train models and have high computational requirements. They also involve analysis of features derived from vectors including email contents thus affecting user privacy. Additionally, they suffer from a lack of resilience against evolution of threats especially with the advent of generative AI techniques to bypass these systems as with AI-generated phishing URLs. Unsupervised methods such as clustering techniques have also been used in phishing detection in the past, however, they are at times unscalable due to the use of pair-wise comparisons. They also lack high detection rates while detecting phishing campaigns. In this paper, we propose an unsupervised learning approach that is not only fast but scalable, as it does not involve pair-wise comparisons. It is able to detect entire campaigns at a time with a high detection rate while preserving user privacy; this includes the recent surge of campaigns with targeted phishing URLs generated by malicious entities using generative AI techniques.

[AI-69] Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment ICLR2025

链接: https://arxiv.org/abs/2502.13170
作者: Yuze Zhao,Tianyun Ji,Wenjun Feng,Zhenya Huang,Qi Liu,Zhiding Liu,Yixiao Ma,Kai Zhang,Enhong Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICLR 2025 Poster;23 pages, 7 figures

点击查看摘要

Abstract:The reasoning abilities are one of the most enigmatic and captivating aspects of large language models (LLMs). Numerous studies are dedicated to exploring and expanding the boundaries of this reasoning capability. However, tasks that embody both reasoning and recall characteristics are often overlooked. In this paper, we introduce such a novel task, code reasoning, to provide a new perspective for the reasoning abilities of LLMs. We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks. Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problem-solving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to 3\times . Finally, we expanded this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in VirtualHome, enhancing the handling of failure cases. We release our code and all of results at this https URL.

[AI-70] SmartLLM : Smart Contract Auditing using Custom Generative AI

链接: https://arxiv.org/abs/2502.13167
作者: Jun Kevin,Pujianto Yugopuspito
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Smart contracts are essential to decentralized finance (DeFi) and blockchain ecosystems but are increasingly vulnerable to exploits due to coding errors and complex attack vectors. Traditional static analysis tools and existing vulnerability detection methods often fail to address these challenges comprehensively, leading to high false-positive rates and an inability to detect dynamic vulnerabilities. This paper introduces SmartLLM, a novel approach leveraging fine-tuned LLaMA 3.1 models with Retrieval-Augmented Generation (RAG) to enhance the accuracy and efficiency of smart contract auditing. By integrating domain-specific knowledge from ERC standards and employing advanced techniques such as QLoRA for efficient fine-tuning, SmartLLM achieves superior performance compared to static analysis tools like Mythril and Slither, as well as zero-shot large language model (LLM) prompting methods such as GPT-3.5 and GPT-4. Experimental results demonstrate a perfect recall of 100% and an accuracy score of 70%, highlighting the model’s robustness in identifying vulnerabilities, including reentrancy and access control issues. This research advances smart contract security by offering a scalable and effective auditing solution, supporting the secure adoption of decentralized applications.

[AI-71] HedgeAgents : A Balanced-aware Multi-agent Financial Trading System WWW2025

链接: https://arxiv.org/abs/2502.13165
作者: Xiangyu Li,Yawen Zeng,Xiaofen Xing,Jin Xu,Xiangmin Xu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
*备注: This paper has been accepted by The Web Conference 2025 (WWW 2025) and selected for an oral presentation

点击查看摘要

Abstract:As automated trading gains traction in the financial market, algorithmic investment strategies are increasingly prominent. While Large Language Models (LLMs) and Agent-based models exhibit promising potential in real-time market analysis and trading decisions, they still experience a significant -20% loss when confronted with rapid declines or frequent fluctuations, impeding their practical application. Hence, there is an imperative to explore a more robust and resilient framework. This paper introduces an innovative multi-agent system, HedgeAgents, aimed at bolstering system robustness via ``hedging’’ strategies. In this well-balanced system, an array of hedging agents has been tailored, where HedgeAgents consist of a central fund manager and multiple hedging experts specializing in various financial asset classes. These agents leverage LLMs’ cognitive capabilities to make decisions and coordinate through three types of conferences. Benefiting from the powerful understanding of LLMs, our HedgeAgents attained a 70% annualized return and a 400% total return over a period of 3 years. Moreover, we have observed with delight that HedgeAgents can even formulate investment experience comparable to those of human experts (this https URL).

[AI-72] Multi-Agent Actor-Critic Generative AI for Query Resolution and Analysis

链接: https://arxiv.org/abs/2502.13164
作者: Mohammad Wali Ur Rahman,Ric Nevarez,Lamia Tasnim Mim,Salim Hariri
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:In this paper, we introduce MASQRAD (Multi-Agent Strategic Query Resolution and Diagnostic tool), a transformative framework for query resolution based on the actor-critic model, which utilizes multiple generative AI agents. MASQRAD is excellent at translating imprecise or ambiguous user inquiries into precise and actionable requests. This framework generates pertinent visualizations and responses to these focused queries, as well as thorough analyses and insightful interpretations for users. MASQRAD addresses the common shortcomings of existing solutions in domains that demand fast and precise data interpretation, such as their incapacity to successfully apply AI for generating actionable insights and their challenges with the inherent ambiguity of user queries. MASQRAD functions as a sophisticated multi-agent system but “masquerades” to users as a single AI entity, which lowers errors and enhances data interaction. This approach makes use of three primary AI agents: Actor Generative AI, Critic Generative AI, and Expert Analysis Generative AI. Each is crucial for creating, enhancing, and evaluating data interactions. The Actor AI generates Python scripts to generate data visualizations from large datasets within operational constraints, and the Critic AI rigorously refines these scripts through multi-agent debate. Finally, the Expert Analysis AI contextualizes the outcomes to aid in decision-making. With an accuracy rate of 87% when handling tasks related to natural language visualization, MASQRAD establishes new benchmarks for automated data interpretation and showcases a noteworthy advancement that has the potential to revolutionize AI-driven applications.

[AI-73] Understanding Dynamic Diffusion Process of LLM -based Agents under Information Asymmetry

链接: https://arxiv.org/abs/2502.13160
作者: Yiwen Zhang,Yifu Wu,Wenyue Hua,Xuming Hu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Large language models have been used to simulate human society using multi-agent systems. Most current social simulation research emphasizes interactive behaviors in fixed environments, ignoring information opacity, relationship variability and diffusion diversity. In this paper, we study the dynamics of information diffusion in 12 asymmetric open environments defined by information content and distribution mechanisms. We first present a general framework to capture the features of information diffusion. Then, we designed a dynamic attention mechanism to help agents allocate attention to different information, addressing the limitations of LLM-based attention. Agents start by responding to external information stimuli within a five-agent group, increasing group size and forming information circles while developing relationships and sharing information. Additionally, we observe the emergence of information cocoons, the evolution of information gaps, and the accumulation of social capital, which are closely linked to psychological, sociological, and communication theories.

[AI-74] Bi-Fact: A Bidirectional Factorization-based Evaluation of Intent Extraction from UI Trajectories

链接: https://arxiv.org/abs/2502.13149
作者: Sapir Caduri
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bi-Fact, a novel approach to automatic evaluation for Intent Understanding, is presented. Drawing inspiration from FactScore, Bi-Fact enables fine-grained intent comparison by splitting both gold and predicted intents into facts and calculating precision and recall, considering the UI trajectory. This paper outlines a comprehensive evaluation of Bi-Fact, assessing its performance and comparing it to existing metrics.

[AI-75] NestQuant: Nested Lattice Quantization for Matrix Products and LLM s

链接: https://arxiv.org/abs/2502.09720
作者: Semyon Savkin,Eitan Porat,Or Ordentlich,Yury Polyanskiy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 16 pages

点击查看摘要

Abstract:Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent work have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3). Comparisons on various LLM evaluation benchmarks also show a reduction in performance degradation induced by quantization.

[AI-76] Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics

链接: https://arxiv.org/abs/2502.13785
作者: Matthew Wood,Mathieu Klop,Maxime Allard
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures, 3 tables

点击查看摘要

Abstract:mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine’s effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (this https URL) and model weights (this https URL).

[AI-77] GPA: Grover Policy Agent for Generating Optimal Quantum Sensor Circuits

链接: https://arxiv.org/abs/2502.13755
作者: Ahmad Alomari,Sathish A. P. Kumar
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:This study proposes a GPA for designing optimal Quantum Sensor Circuits (QSCs) to address complex quantum physics problems. The GPA consists of two parts: the Quantum Policy Evaluation (QPE) and the Quantum Policy Improvement (QPI). The QPE performs phase estimation to generate the search space, while the QPI utilizes Grover search and amplitude amplification techniques to efficiently identify an optimal policy that generates optimal QSCs. The GPA generates QSCs by selecting sequences of gates that maximize the Quantum Fisher Information (QFI) while minimizing the number of gates. The QSCs generated by the GPA are capable of producing entangled quantum states, specifically the squeezed states. High QFI indicates increased sensitivity to parameter changes, making the circuit useful for quantum state estimation and control tasks. Evaluation of the GPA on a QSC that consists of two qubits and a sequence of R_x, R_y, and S gates demonstrates its efficiency in generating optimal QSCs with a QFI of 1. Compared to existing quantum agents, the GPA achieves higher QFI with fewer gates, demonstrating a more efficient and scalable approach to the design of QSCs. This work illustrates the potential computational power of quantum agents for solving quantum physics problems

[AI-78] Solving the Encoding Bottleneck: Of the HHL Algorithm By the HHL Algorithm

链接: https://arxiv.org/abs/2502.13534
作者: Guang Ping He
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:The Harrow-Hassidim-Lloyd (HHL) algorithm offers exponential speedup for solving the quantum linear-system problem. But some caveats for the speedup could be hard to met. One of the difficulties is the encoding bottleneck, i.e., the efficient preparation of the initial quantum state. To prepare an arbitrary N -dimensional state exactly, existing state-preparation approaches generally require a runtime of O(N) , which will ruin the speedup of the HHL algorithm. Here we show that the states can be prepared approximately with a runtime of O(poly(\log N)) by employing a slightly modified version of the HHL algorithm itself. Thus, applying this approach to prepare the initial state of the original HHL algorithm can preserve the exponential speedup advantage. It can also serve as a standalone solution for other applications demanding rapid state preparation.

[AI-79] CondensNet: Enabling stable long-term climate simulations via hybrid deep learning models with adaptive physical constraints

链接: https://arxiv.org/abs/2502.13185
作者: Xin Wang,Juntao Yang,Jeff Adie,Simon See,Kalli Furtado,Chen Chen,Troy Arcomano,Romit Maulik,Gianmarco Mengaldo
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and efficient climate simulations are crucial for understanding Earth’s evolving climate. However, current general circulation models (GCMs) face challenges in capturing unresolved physical processes, such as cloud and convection. A common solution is to adopt cloud resolving models, that provide more accurate results than the standard subgrid parametrisation schemes typically used in GCMs. However, cloud resolving models, also referred to as super paramtetrizations, remain computationally prohibitive. Hybrid modeling, which integrates deep learning with equation-based GCMs, offers a promising alternative but often struggles with long-term stability and accuracy issues. In this work, we find that water vapor oversaturation during condensation is a key factor compromising the stability of hybrid models. To address this, we introduce CondensNet, a novel neural network architecture that embeds a self-adaptive physical constraint to correct unphysical condensation processes. CondensNet effectively mitigates water vapor oversaturation, enhancing simulation stability while maintaining accuracy and improving computational efficiency compared to super parameterization schemes. We integrate CondensNet into a GCM to form PCNN-GCM (Physics-Constrained Neural Network GCM), a hybrid deep learning framework designed for long-term stable climate simulations in real-world conditions, including ocean and land. PCNN-GCM represents a significant milestone in hybrid climate modeling, as it shows a novel way to incorporate physical constraints adaptively, paving the way for accurate, lightweight, and stable long-term climate simulations. Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2502.13185 [physics.ao-ph] (or arXiv:2502.13185v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2502.13185 Focus to learn more arXiv-issued DOI via DataCite

[AI-80] Noumenal Labs White Paper: How To Build A Brain

链接: https://arxiv.org/abs/2502.13161
作者: Maxwell J. D. Ramstead,Candice Pattisapu,Jason Fox,Jeff Beck
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This white paper describes some of the design principles for artificial or machine intelligence that guide efforts at Noumenal Labs. These principles are drawn from both nature and from the means by which we come to represent and understand it. The end goal of research and development in this field should be to design machine intelligences that augment our understanding of the world and enhance our ability to act in it, without replacing us. In the first two sections, we examine the core motivation for our approach: resolving the grounding problem. We argue that the solution to the grounding problem rests in the design of models grounded in the world that we inhabit, not mere word models. A machine super intelligence that is capable of significantly enhancing our understanding of the human world must represent the world as we do and be capable of generating new knowledge, building on what we already know. In other words, it must be properly grounded and explicitly designed for rational, empirical inquiry, modeled after the scientific method. A primary implication of this design principle is that agents must be capable of engaging autonomously in causal physics discovery. We discuss the pragmatic implications of this approach, and in particular, the use cases in realistic 3D world modeling and multimodal, multidimensional time series analysis.

机器学习

[LG-0] Wheres the Bug? Attention Probing for Scalable Fault Localization

链接: https://arxiv.org/abs/2502.13966
作者: Adam Stein,Arthur Wayne,Aaditya Naik,Mayur Naik,Eric Wong
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user’s bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs. In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs. We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages. Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.

[LG-1] Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark Infrastructure and Analysis

链接: https://arxiv.org/abs/2502.13921
作者: Jiahao Gai, Hao (Mark)Chen,Zhican Wang,Hongyu Zhou,Wanru Zhao,Nicholas Lane,Hongxiang Fan
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Software Engineering (cs.SE)
*备注: Paper accepted by ASP-DAC’25

点击查看摘要

Abstract:Recent advances in code generation have illuminated the potential of employing large language models (LLMs) for general-purpose programming languages such as Python and C++, opening new opportunities for automating software development and enhancing programmer productivity. The potential of LLMs in software programming has sparked significant interest in exploring automated hardware generation and automation. Although preliminary endeavors have been made to adopt LLMs in generating hardware description languages (HDLs), several challenges persist in this direction. First, the volume of available HDL training data is substantially smaller compared to that for software programming languages. Second, the pre-trained LLMs, mainly tailored for software code, tend to produce HDL designs that are more error-prone. Third, the generation of HDL requires a significantly higher number of tokens compared to software programming, leading to inefficiencies in cost and energy consumption. To tackle these challenges, this paper explores leveraging LLMs to generate High-Level Synthesis (HLS)-based hardware design. Although code generation for domain-specific programming languages is not new in the literature, we aim to provide experimental results, insights, benchmarks, and evaluation infrastructure to investigate the suitability of HLS over low-level HDLs for LLM-assisted hardware design generation. To achieve this, we first finetune pre-trained models for HLS-based hardware generation, using a collected dataset with text prompts and corresponding reference HLS designs. An LLM-assisted framework is then proposed to automate end-to-end hardware code generation, which also investigates the impact of chain-of-thought and feedback loops promoting techniques on HLS-design generation. Limited by the timeframe of this research, we plan to evaluate more advanced reasoning models in the future.

[LG-2] Playing Hex and Counter Wargames using Reinforcement Learning and Recurrent Neural Networks

链接: https://arxiv.org/abs/2502.13918
作者: Guilherme Palma,Pedro A. Santos,João Dias
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hex and Counter Wargames are adversarial two-player simulations of real military conflicts requiring complex strategic decision-making. Unlike classical board games, these games feature intricate terrain/unit interactions, unit stacking, large maps of varying sizes, and simultaneous move and combat decisions involving hundreds of units. This paper introduces a novel system designed to address the strategic complexity of Hex and Counter Wargames by integrating cutting-edge advancements in Recurrent Neural Networks with AlphaZero, a reliable modern Reinforcement Learning algorithm. The system utilizes a new Neural Network architecture developed from existing research, incorporating innovative state and action representations tailored to these specific game environments. With minimal training, our solution has shown promising results in typical scenarios, demonstrating the ability to generalize across different terrain and tactical situations. Additionally, we explore the system’s potential to scale to larger map sizes. The developed system is openly accessible, facilitating continued research and exploration within this challenging domain.

[LG-3] Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

链接: https://arxiv.org/abs/2502.13900
作者: Antoine Moulin,Gergely Neu,Luca Viano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving near-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order \tilde\mathcalO (\sqrtd^3 (1 - \gamma)^- 7 / 2 T) , where T is the total number of sample transitions, \gamma \in (0,1) is the discount factor, and d is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

[LG-4] Geometric Principles for Machine Learning of Dynamical Systems

链接: https://arxiv.org/abs/2502.13895
作者: Zack Xuereb Conti,David J Wagg,Nick Pepper
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical descriptions of dynamical systems are deeply rooted in topological spaces defined by non-Euclidean geometry. This paper proposes leveraging structure-rich geometric spaces for machine learning to achieve structural generalization when modeling physical systems from data, in contrast to embedding physics bias within model-free architectures. We consider model generalization to be a function of symmetry, invariance and uniqueness, defined as a topological mapping from state space dynamics to the parameter space. We illustrate this view through the machine learning of linear time-invariant dynamical systems, whose dynamics reside on the symmetric positive definite manifold.

[LG-5] Highly Dynamic and Flexible Spatio-Temporal Spectrum Management with AI-Driven O-RAN: A Multi-Granularity Marketplace Framework

链接: https://arxiv.org/abs/2502.13891
作者: Mehdi Rasti,Elaheh Ataeebojd,Shiva Kazemi Taskooh,Mehdi Monemi,Siavash Razmi,Matti Latva-aho
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current spectrum-sharing frameworks struggle with adaptability, often being either static or insufficiently dynamic. They primarily emphasize temporal sharing while overlooking spatial and spectral dimensions. We propose an adaptive, AI-driven spectrum-sharing framework within the O-RAN architecture, integrating discriminative and generative AI (GenAI) to forecast spectrum needs across multiple timescales and spatial granularities. A marketplace model, managed by an authorized spectrum broker, enables operators to trade spectrum dynamically, balancing static assignments with real-time trading. GenAI enhances traffic prediction, spectrum estimation, and allocation, optimizing utilization while reducing costs. This modular, flexible approach fosters operator collaboration, maximizing efficiency and revenue. A key research challenge is refining allocation granularity and spatio-temporal dynamics beyond existing models.

[LG-6] Refining embeddings with fill-tuning: data-efficient generalised performance improvements for materials foundation models

链接: https://arxiv.org/abs/2502.13886
作者: Matthew P. Wilson,Edward O. Pyzer-Knapp,Nicolas Galichet,Luke Dicks
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Pretrained foundation models learn embeddings that can be used for a wide range of downstream tasks. These embeddings optimise general performance, and if insufficiently accurate at a specific task the model can be fine-tuned to improve performance. For all current methodologies this operation necessarily degrades performance on all out-of-distribution tasks. In this work we present ‘fill-tuning’, a novel methodology to generate datasets for continued pretraining of foundation models that are not suited to a particular downstream task, but instead aim to correct poor regions of the embedding. We present the application of roughness analysis to latent space topologies and illustrate how it can be used to propose data that will be most valuable to improving the embedding. We apply fill-tuning to a set of state-of-the-art materials foundation models trained on O(10^9) data points and show model improvement of almost 1% in all downstream tasks with the addition of only 100 data points. This method provides a route to the general improvement of foundation models at the computational cost of fine-tuning.

[LG-7] Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets

链接: https://arxiv.org/abs/2502.13833
作者: Milton Nicolás Plasencia Palacios,Sebastiano Saccani,Gabriele Sgroi,Alexander Boudewijn,Luca Bortolussi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance. When using synthetic data in practical applications, it is important to provide protection guarantees. In the literature, two family of approaches are proposed for tabular data: on the one hand, Similarity-based methods aim at finding the level of similarity between training and synthetic data. Indeed, a privacy breach can occur if the generated data is consistently too similar or even identical to the train data. On the other hand, Attack-based methods conduce deliberate attacks on synthetic datasets. The success rates of these attacks reveal how secure the synthetic datasets are. In this paper, we introduce a contrastive method that improves privacy assessment of synthetic datasets by embedding the data in a more representative space. This overcomes obstacles surrounding the multitude of data types and attributes. It also makes the use of intuitive distance metrics possible for similarity measurements and as an attack vector. In a series of experiments with publicly available datasets, we compare the performances of similarity-based and attack-based methods, both with and without use of the contrastive learning-based embeddings. Our results show that relatively efficient, easy to implement privacy metrics can perform equally well as more advanced metrics explicitly modeling conditions for privacy referred to by the GDPR. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2502.13833 [cs.LG] (or arXiv:2502.13833v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.13833 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Bayesian Physics Informed Neural Networks for Linear Inverse problems

链接: https://arxiv.org/abs/2502.13827
作者: Ali Mohammad-Djafari
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 9 pages

点击查看摘要

Abstract:Inverse problems arise almost everywhere in science and engineering where we need to infer on a quantity from indirect observation. The cases of medical, biomedical, and industrial imaging systems are the typical examples. A very high overview of classification of the inverse problems method can be: i) Analytical, ii) Regularization, and iii) Bayesian inference methods. Even if there are straight links between them, we can say that the Bayesian inference based methods are the most powerful, as they give the possibility of accounting for prior knowledge and can account for errors and uncertainties in general. One of the main limitations stay in computational costs in particular for high dimensional imaging systems. Neural Networks (NN), and in particular Deep NNs (DNN), have been considered as a way to push farther this limit. Physics Informed Neural Networks (PINN) concept integrates physical laws with deep learning techniques to enhance the speed, accuracy and efficiency of the above mentioned problems. In this work, a new Bayesian framework for the concept of PINN (BPINN) is presented and discussed which includes the deterministic one if we use the Maximum A Posteriori (MAP) estimation framework. We consider two cases of supervised and unsupervised for training step, obtain the expressions of the posterior probability of the unknown variables, and deduce the posterior laws of the NN’s parameters. We also discuss about the challenges of implementation of these methods in real applications. Comments: 9 pages Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2502.13827 [cs.LG] (or arXiv:2502.13827v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.13827 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Mixup Regularization: A Probabilistic Perspective

链接: https://arxiv.org/abs/2502.13825
作者: Yousef El-Laham,Niccolo Dalmasso,Svitlana Vyetrenko,Vamsi Potluru,Manuela Veloso
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, mixup regularization has gained popularity as an effective way to improve the generalization performance of deep learning models by training on convex combinations of training data. While many mixup variants have been explored, the proper adoption of the technique to conditional density estimation and probabilistic machine learning remains relatively unexplored. This work introduces a novel framework for mixup regularization based on probabilistic fusion that is better suited for conditional density estimation tasks. For data distributed according to a member of the exponential family, we show that likelihood functions can be analytically fused using log-linear pooling. We further propose an extension of probabilistic mixup, which allows for fusion of inputs at an arbitrary intermediate layer of the neural network. We provide a theoretical analysis comparing our approach to standard mixup variants. Empirical results on synthetic and real datasets demonstrate the benefits of our proposed framework compared to existing mixup variants.

[LG-10] Learning to explore when mistakes are not allowed AAMAS2025

链接: https://arxiv.org/abs/2502.13801
作者: Charly Pecqueux-Guézénec,Stéphane Doncieux,Nicolas Perrin-Gilbert
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages, 13 figures, Published as an extended abstract at AAMAS 2025

点击查看摘要

Abstract:Goal-Conditioned Reinforcement Learning (GCRL) provides a versatile framework for developing unified controllers capable of handling wide ranges of tasks, exploring environments, and adapting behaviors. However, its reliance on trial-and-error poses challenges for real-world applications, as errors can result in costly and potentially damaging consequences. To address the need for safer learning, we propose a method that enables agents to learn goal-conditioned behaviors that explore without the risk of making harmful mistakes. Exploration without risks can seem paradoxical, but environment dynamics are often uniform in space, therefore a policy trained for safety without exploration purposes can still be exploited globally. Our proposed approach involves two distinct phases. First, during a pretraining phase, we employ safe reinforcement learning and distributional techniques to train a safety policy that actively tries to avoid failures in various situations. In the subsequent safe exploration phase, a goal-conditioned (GC) policy is learned while ensuring safety. To achieve this, we implement an action-selection mechanism leveraging the previously learned distributional safety critics to arbitrate between the safety policy and the GC policy, ensuring safe exploration by switching to the safety policy when needed. We evaluate our method in simulated environments and demonstrate that it not only provides substantial coverage of the goal space but also reduces the occurrence of mistakes to a minimum, in stark contrast to traditional GCRL approaches. Additionally, we conduct an ablation study and analyze failure modes, offering insights for future research directions.

[LG-11] Herglotz-NET: Implicit Neural Representation of Spherical~Data with Harmonic Positional Encoding

链接: https://arxiv.org/abs/2502.13777
作者: Théo Hanon,Nicolas Mil-Homens Cavaco,John Kiely,Laurent Jacques
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Keywords: Herglotz, spherical harmonics, spectral analysis, implicit neural representation. Remarks: 4 pages + 1 reference page, 4 figures (submitted to SAMPTA2025)

点击查看摘要

Abstract:Representing and processing data in spherical domains presents unique challenges, primarily due to the curvature of the domain, which complicates the application of classical Euclidean techniques. Implicit neural representations (INRs) have emerged as a promising alternative for high-fidelity data representation; however, to effectively handle spherical domains, these methods must be adapted to the inherent geometry of the sphere to maintain both accuracy and stability. In this context, we propose Herglotz-NET (HNET), a novel INR architecture that employs a harmonic positional encoding based on complex Herglotz mappings. This encoding yields a well-posed representation on the sphere with interpretable and robust spectral properties. Moreover, we present a unified expressivity analysis showing that any spherical-based INR satisfying a mild condition exhibits a predictable spectral expansion that scales with network depth. Our results establish HNET as a scalable and flexible framework for accurate modeling of spherical data.

[LG-12] Reverse Markov Learning: Multi-Step Generative Models for Complex Distributions

链接: https://arxiv.org/abs/2502.13747
作者: Xinwei Shen,Nicolai Meinshausen,Tong Zhang
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. While effective, engression struggles with highly complex distributions, such as those encountered in image data. In this work, we extend engression to improve its capability in learning complex distributions. We propose a framework that defines a general forward process transitioning from the target distribution to a known distribution (e.g., Gaussian) and then learns a reverse Markov process using multiple engression models. This reverse process reconstructs the target distribution step by step. Our approach supports general forward processes, allows for dimension reduction, and naturally discretizes the generative process. As a special case, when using a diffusion-based forward process, our framework offers a method to discretize the training and inference of diffusion models efficiently. Empirical evaluations on simulated and climate data validate our theoretical insights, demonstrating the effectiveness of our approach in capturing complex distributions.

[LG-13] Homophily Heterogeneity Matters in Graph Federated Learning: A Spectrum Sharing and Complementing Perspective

链接: https://arxiv.org/abs/2502.13732
作者: Wentao Yu
类目: Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Since heterogeneity presents a fundamental challenge in graph federated learning, many existing methods are proposed to deal with node feature heterogeneity and structure heterogeneity. However, they overlook the critical homophily heterogeneity, which refers to the substantial variation in homophily levels across graph data from different clients. The homophily level represents the proportion of edges connecting nodes that belong to the same class. Due to adapting to their local homophily, local models capture inconsistent spectral properties across different clients, significantly reducing the effectiveness of collaboration. Specifically, local models trained on graphs with high homophily tend to capture low-frequency information, whereas local models trained on graphs with low homophily tend to capture high-frequency information. To effectively deal with homophily heterophily, we introduce the spectral Graph Neural Network (GNN) and propose a novel Federated learning method by mining Graph Spectral Properties (FedGSP). On one hand, our proposed FedGSP enables clients to share generic spectral properties (i.e., low-frequency information), allowing all clients to benefit through collaboration. On the other hand, inspired by our theoretical findings, our proposed FedGSP allows clients to complement non-generic spectral properties by acquiring the spectral properties they lack (i.e., high-frequency information), thereby obtaining additional information gain. Extensive experiments conducted on six homophilic and five heterophilic graph datasets, across both non-overlapping and overlapping settings, validate the superiority of our method over eleven state-of-the-art methods. Notably, our FedGSP outperforms the second-best method by an average margin of 3.28% on all heterophilic datasets.

[LG-14] Emergence of the Primacy Effect in Structured State-Space Models

链接: https://arxiv.org/abs/2502.13729
作者: Takashi Morita
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Human and animal memory for sequentially presented items is well-documented to be more accurate for those at the beginning and end of a sequence, phenomena known as the primacy and recency effects, respectively. By contrast, artificial neural network (ANN) models are typically designed with a memory that decays monotonically over time. Accordingly, ANNs are expected to show the recency effect but not the primacy effect. Contrary to this theoretical expectation, however, the present study reveals a counterintuitive finding: a recently developed ANN architecture, called structured state-space models, exhibits the primacy effect when trained and evaluated on a synthetic task that mirrors psychological memory experiments. Given that this model was originally designed for recovering neuronal activity patterns observed in biological brains, this result provides a novel perspective on the psychological primacy effect while also posing a non-trivial puzzle for the current theories in machine learning.

[LG-15] ght Generalization Bounds for Large-Margin Halfspaces

链接: https://arxiv.org/abs/2502.13692
作者: Kasper Green Larsen,Natascha Schalburg
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We prove the first generalization bound for large-margin halfspaces that is asymptotically tight in the tradeoff between the margin, the fraction of training points with the given margin, the failure probability and the number of training points.

[LG-16] Generalization error bound for denoising score matching under relaxed manifold assumption

链接: https://arxiv.org/abs/2502.13662
作者: Konstantin Yakovlev,Nikita Puchkin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 59 pages

点击查看摘要

Abstract:We examine theoretical properties of the denoising score matching estimate. We model the density of observations with a nonparametric Gaussian mixture. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. At the same time, we are still able to leverage a nice distribution structure. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate. The rates of convergence are determined by the intrinsic dimension. Furthermore, our bounds remain valid even if we allow the ambient dimension grow polynomially with the sample size.

[LG-17] owards Invariance to Node Identifiers in Graph Neural Networks

链接: https://arxiv.org/abs/2502.13660
作者: Maya Bechler-Speicher,Moshe Eliasof,Carola-Bibiane Schonlieb,Ran Gilad-Bachrach,Amir Globerson
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2411.02271

点击查看摘要

Abstract:Message-Passing Graph Neural Networks (GNNs) are known to have limited expressive power, due to their message passing structure. One mechanism for circumventing this limitation is to add unique node identifiers (IDs), which break the symmetries that underlie the expressivity limitation. In this work, we highlight a key limitation of the ID framework, and propose an approach for addressing it. We begin by observing that the final output of the GNN should clearly not depend on the specific IDs used. We then show that in practice this does not hold, and thus the learned network does not possess this desired structural property. Such invariance to node IDs may be enforced in several ways, and we discuss their theoretical properties. We then propose a novel regularization method that effectively enforces ID invariance to the network. Extensive evaluations on both real-world and synthetic tasks demonstrate that our approach significantly improves ID invariance and, in turn, often boosts generalization performance. Comments: arXiv admin note: text overlap with arXiv:2411.02271 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.13660 [cs.LG] (or arXiv:2502.13660v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.13660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] A Query-Driven Approach to Space-Efficient Range Searching

链接: https://arxiv.org/abs/2502.13653
作者: Dimitris Fotakis,Andreas Kalavas,Ioannis Psarros
类目: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:We initiate a study of a query-driven approach to designing partition trees for range-searching problems. Our model assumes that a data structure is to be built for an unknown query distribution that we can access through a sampling oracle, and must be selected such that it optimizes a meaningful performance parameter on expectation. Our first contribution is to show that a near-linear sample of queries allows the construction of a partition tree with a near-optimal expected number of nodes visited during querying. We enhance this approach by treating node processing as a classification problem, leveraging fast classifiers like shallow neural networks to obtain experimentally efficient query times. Our second contribution is to develop partition trees using sparse geometric separators. Our preprocessing algorithm, based on a sample of queries, builds a balanced tree with nodes associated with separators that minimize query stabs on expectation; this yields both fast processing of each node and a small number of visited nodes, significantly reducing query time.

[LG-19] Multi-Target Radar Search and Track Using Sequence-Capable Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.13584
作者: Jan-Hendrik Ewers,David Cormack,Joe Gibbs,David Anderson
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted for RLDM 2025, submitted to IEEE SSP 2025

点击查看摘要

Abstract:The research addresses sensor task management for radar systems, focusing on efficiently searching and tracking multiple targets using reinforcement learning. The approach develops a 3D simulation environment with an active electronically scanned array radar, using a multi-target tracking algorithm to improve observation data quality. Three neural network architectures were compared including an approach using fated recurrent units with multi-headed self-attention. Two pre-training techniques were applied: behavior cloning to approximate a random search strategy and an auto-encoder to pre-train the feature extractor. Experimental results revealed that search performance was relatively consistent across most methods. The real challenge emerged in simultaneously searching and tracking targets. The multi-headed self-attention architecture demonstrated the most promising results, highlighting the potential of sequence-capable architectures in handling dynamic tracking scenarios. The key contribution lies in demonstrating how reinforcement learning can optimize sensor management, potentially improving radar systems’ ability to identify and track multiple targets in complex environments.

[LG-20] ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

链接: https://arxiv.org/abs/2502.13581
作者: Yupeng Hou,Jianmo Ni,Zhankui He,Noveen Sachdeva,Wang-Cheng Kang,Ed H. Chi,Julian McAuley,Derek Zhiyuan Cheng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a set of item features, which serve as the initial tokens. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Experiments on public datasets demonstrate that ActionPiece consistently outperforms existing action tokenization methods, improving NDCG@ 10 by 6.00% to 12.82% .

[LG-21] Unraveling the Localized Latents: Learning Stratified Manifold Structures in LLM Embedding Space with Sparse Mixture-of-Experts

链接: https://arxiv.org/abs/2502.13577
作者: Xin Li,Anand Sarwate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:However, real-world data often exhibit complex local structures that can be challenging for single-model approaches with a smooth global manifold in the embedding space to unravel. In this work, we conjecture that in the latent space of these large language models, the embeddings live in a local manifold structure with different dimensions depending on the perplexities and domains of the input data, commonly referred to as a Stratified Manifold structure, which in combination form a structured space known as a Stratified Space. To investigate the validity of this structural claim, we propose an analysis framework based on a Mixture-of-Experts (MoE) model where each expert is implemented with a simple dictionary learning algorithm at varying sparsity levels. By incorporating an attention-based soft-gating network, we verify that our model learns specialized sub-manifolds for an ensemble of input data sources, reflecting the semantic stratification in LLM embedding space. We further analyze the intrinsic dimensions of these stratified sub-manifolds and present extensive statistics on expert assignments, gating entropy, and inter-expert distances. Our experimental results demonstrate that our method not only validates the claim of a stratified manifold structure in the LLM embedding space, but also provides interpretable clusters that align with the intrinsic semantic variations of the input data.

[LG-22] ETS: Efficient Tree Search for Inference-Time Scaling

链接: https://arxiv.org/abs/2502.13575
作者: Coleman Hooper,Sehoon Kim,Suhong Moon,Kerem Dilmen,Monishwaran Maheswaran,Nicholas Lee,Michael W. Mahoney,Sophia Shao,Kurt Keutzer,Amir Gholami
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these partial trajectories are then scored by a separate reward model in order to guide the search process. The diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration. However, this diversity comes at a cost, as divergent trajectories have less KV sharing, which means they consume more memory and slow down the search process. Previous search methods either do not perform sufficient exploration, or else explore diverse trajectories but have high latency. We address this challenge by proposing Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories. ETS incorporates a linear programming cost model to promote KV cache sharing by penalizing the number of nodes retained, while incorporating a semantic coverage term into the cost model to ensure that we retain trajectories which are semantically different. We demonstrate how ETS can achieve 1.8 \times reduction in average KV cache size during the search process, leading to 1.4 \times increased throughput relative to prior state-of-the-art methods, with minimal accuracy degradation and without requiring any custom kernel implementation. Code is available at: this https URL.

[LG-23] Noise May Contain Transferable Knowledge: Understanding Semi-supervised Heterogeneous Domain Adaptation from an Empirical Perspective

链接: https://arxiv.org/abs/2502.13573
作者: Yuan Yao,Xiaopu Zhang,Yu Zhang,Jian Jin,Qiang Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised heterogeneous domain adaptation (SHDA) addresses learning across domains with distinct feature representations and distributions, where source samples are labeled while most target samples are unlabeled, with only a small fraction labeled. Moreover, there is no one-to-one correspondence between source and target samples. Although various SHDA methods have been developed to tackle this problem, the nature of the knowledge transferred across heterogeneous domains remains unclear. This paper delves into this question from an empirical perspective. We conduct extensive experiments on about 330 SHDA tasks, employing two supervised learning methods and seven representative SHDA methods. Surprisingly, our observations indicate that both the category and feature information of source samples do not significantly impact the performance of the target domain. Additionally, noise drawn from simple distributions, when used as source samples, may contain transferable knowledge. Based on this insight, we perform a series of experiments to uncover the underlying principles of transferable knowledge in SHDA. Specifically, we design a unified Knowledge Transfer Framework (KTF) for SHDA. Based on the KTF, we find that the transferable knowledge in SHDA primarily stems from the transferability and discriminability of the source domain. Consequently, ensuring those properties in source samples, regardless of their origin (e.g., image, text, noise), can enhance the effectiveness of knowledge transfer in SHDA tasks. The codes and datasets are available at this https URL.

[LG-24] Diffusion Model Agnostic Social Influence Maximization in Hyperbolic Space

链接: https://arxiv.org/abs/2502.13571
作者: Hongliang Qiao
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:The Influence Maximization (IM) problem aims to find a small set of influential users to maximize their influence spread in a social network. Traditional methods rely on fixed diffusion models with known parameters, limiting their generalization to real-world scenarios. In contrast, graph representation learning-based methods have gained wide attention for overcoming this limitation by learning user representations to capture influence characteristics. However, existing studies are built on Euclidean space, which fails to effectively capture the latent hierarchical features of social influence distribution. As a result, users’ influence spread cannot be effectively measured through the learned representations. To alleviate these limitations, we propose HIM, a novel diffusion model agnostic method that leverages hyperbolic representation learning to estimate users’ potential influence spread from social propagation data. HIM consists of two key components. First, a hyperbolic influence representation module encodes influence spread patterns from network structure and historical influence activations into expressive hyperbolic user representations. Hence, the influence magnitude of users can be reflected through the geometric properties of hyperbolic space, where highly influential users tend to cluster near the space origin. Second, a novel adaptive seed selection module is developed to flexibly and effectively select seed users using the positional information of learned user representations. Extensive experiments on five network datasets demonstrate the superior effectiveness and efficiency of our method for the IM problem with unknown diffusion model parameters, highlighting its potential for large-scale real-world social networks.

[LG-25] AS-GCL: Asymmetric Spectral Augmentation on Graph Contrastive Learning

链接: https://arxiv.org/abs/2502.13525
作者: Ruyue Liu,Rong Yin,Yong Liu,Xiaoshuai Hao,Haichao Shi,Can Ma,Weiping Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by TMM

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) has emerged as the foremost approach for self-supervised learning on graph-structured data. GCL reduces reliance on labeled data by learning robust representations from various augmented views. However, existing GCL methods typically depend on consistent stochastic augmentations, which overlook their impact on the intrinsic structure of the spectral domain, thereby limiting the model’s ability to generalize effectively. To address these limitations, we propose a novel paradigm called AS-GCL that incorporates asymmetric spectral augmentation for graph contrastive learning. A typical GCL framework consists of three key components: graph data augmentation, view encoding, and contrastive loss. Our method introduces significant enhancements to each of these components. Specifically, for data augmentation, we apply spectral-based augmentation to minimize spectral variations, strengthen structural invariance, and reduce noise. With respect to encoding, we employ parameter-sharing encoders with distinct diffusion operators to generate diverse, noise-resistant graph views. For contrastive loss, we introduce an upper-bound loss function that promotes generalization by maintaining a balanced distribution of intra- and inter-class distance. To our knowledge, we are the first to encode augmentation views of the spectral domain using asymmetric encoders. Extensive experiments on eight benchmark datasets across various node-level tasks demonstrate the advantages of the proposed method.

[LG-26] Enhancing Machine Learning Potentials through Transfer Learning across Chemical Elements

链接: https://arxiv.org/abs/2502.13522
作者: Sebastien Röcken,Julija Zavadlav
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Machine Learning Potentials (MLPs) can enable simulations of ab initio accuracy at orders of magnitude lower computational cost. However, their effectiveness hinges on the availability of considerable datasets to ensure robust generalization across chemical space and thermodynamic conditions. The generation of such datasets can be labor-intensive, highlighting the need for innovative methods to train MLPs in data-scarce scenarios. Here, we introduce transfer learning of potential energy surfaces between chemically similar elements. Specifically, we leverage the trained MLP for silicon to initialize and expedite the training of an MLP for germanium. Utilizing classical force field and ab initio datasets, we demonstrate that transfer learning surpasses traditional training from scratch in force prediction, leading to more stable simulations and improved temperature transferability. These advantages become even more pronounced as the training dataset size decreases. The out-of-target property analysis shows that transfer learning leads to beneficial but sometimes adversarial effects. Our findings demonstrate that transfer learning across chemical elements is a promising technique for developing accurate and numerically stable MLPs, particularly in a data-scarce regime.

[LG-27] Kernel Mean Embedding Topology: Weak and Strong Forms for Stochastic Kernels and Implications for Model Learning

链接: https://arxiv.org/abs/2502.13486
作者: Naci Saldi,Serdar Yuksel
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 35 pages

点击查看摘要

Abstract:We introduce a novel topology, called Kernel Mean Embedding Topology, for stochastic kernels, in a weak and strong form. This topology, defined on the spaces of Bochner integrable functions from a signal space to a space of probability measures endowed with a Hilbert space structure, allows for a versatile formulation. This construction allows one to obtain both a strong and weak formulation. (i) For its weak formulation, we highlight the utility on relaxed policy spaces, and investigate connections with the Young narrow topology and Borkar (or ( w^* ))-topology, and establish equivalence properties. We report that, while both the ( w^* )-topology and kernel mean embedding topology are relatively compact, they are not closed. Conversely, while the Young narrow topology is closed, it lacks relative compactness. (ii) We show that the strong form provides an appropriate formulation for placing topologies on spaces of models characterized by stochastic kernels with explicit robustness and learning theoretic implications on optimal stochastic control under discounted or average cost criteria. (iii) We show that this topology possesses several properties making it ideal to study optimality, approximations, robustness and continuity properties. In particular, the kernel mean embedding topology has a Hilbert space structure, which is particularly useful for approximating stochastic kernels through simulation data.

[LG-28] Smoothed Normalization for Efficient Distributed Private Optimization

链接: https://arxiv.org/abs/2502.13482
作者: Egor Shulgin,Sarit Khirirat,Peter Richtárik
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 36 pages

点击查看摘要

Abstract:Federated learning enables training machine learning models while preserving the privacy of participants. Surprisingly, there is no differentially private distributed method for smooth, non-convex optimization problems. The reason is that standard privacy techniques require bounding the participants’ contributions, usually enforced via \textitclipping of the updates. Existing literature typically ignores the effect of clipping by assuming the boundedness of gradient norms or analyzes distributed algorithms with clipping but ignores DP constraints. In this work, we study an alternative approach via \textitsmoothed normalization of the updates motivated by its favorable performance in the single-node setting. By integrating smoothed normalization with an error-feedback mechanism, we design a new distributed algorithm \alpha - \sf NormEC . We prove that our method achieves a superior convergence rate over prior works. By extending \alpha - \sf NormEC to the DP setting, we obtain the first differentially private distributed optimization algorithm with provable convergence guarantees. Finally, our empirical results from neural network training indicate robust convergence of \alpha - \sf NormEC across different parameter settings.

[LG-29] Continuous K-Max Bandits

链接: https://arxiv.org/abs/2502.13467
作者: Yu Chen,Siwei Wang,Longbo Huang,Wei Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the K -Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects K arms, obtains the maximum value sampled from these K arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous K -Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a \widetilde\mathcalO(T^3/4) regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables \widetilde\mathcalO(\sqrtT) regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.

[LG-30] Poisoned Source Code Detection in Code Models

链接: https://arxiv.org/abs/2502.13459
作者: Ehab Ghannoum,Mohammad Ghafari
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted for Publication in the Journal of Systems and Software (JSS)

点击查看摘要

Abstract:Deep learning models have gained popularity for conducting various tasks involving source code. However, their black-box nature raises concerns about potential risks. One such risk is a poisoning attack, where an attacker intentionally contaminates the training set with malicious samples to mislead the model’s predictions in specific scenarios. To protect source code models from poisoning attacks, we introduce CodeGarrison (CG), a hybrid deep-learning model that relies on code embeddings to identify poisoned code samples. We evaluated CG against the state-of-the-art technique ONION for detecting poisoned samples generated by DAMP, MHM, ALERT, as well as a novel poisoning technique named CodeFooler. Results showed that CG significantly outperformed ONION with an accuracy of 93.5%. We also tested CG’s robustness against unknown attacks and achieved an average accuracy of 85.6% in identifying poisoned samples across the four attacks mentioned above.

[LG-31] Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization

链接: https://arxiv.org/abs/2502.13457
作者: Linfeng Cao,Ming Shi,Ness B. Shroff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-objective multi-armed bandit (MO-MAB) problems traditionally aim to achieve Pareto optimality. However, real-world scenarios often involve users with varying preferences across objectives, resulting in a Pareto-optimal arm that may score high for one user but perform quite poorly for another. This highlights the need for customized learning, a factor often overlooked in prior research. To address this, we study a preference-aware MO-MAB framework in the presence of explicit user preference. It shifts the focus from achieving Pareto optimality to further optimizing within the Pareto front under preference-centric customization. To our knowledge, this is the first theoretical study of customized MO-MAB optimization with explicit user preferences. Motivated by practical applications, we explore two scenarios: unknown preference and hidden preference, each presenting unique challenges for algorithm design and analysis. At the core of our algorithms are preference estimation and preference-aware optimization mechanisms to adapt to user preferences effectively. We further develop novel analytical techniques to establish near-optimal regret of the proposed algorithms. Strong empirical performance confirm the effectiveness of our approach.

[LG-32] Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

链接: https://arxiv.org/abs/2502.13449
作者: Dongki Kim,Wonbin Lee,Sung Ju Hwang
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in interpreting molecular structures, their instruction datasets are limited to the specific knowledge from task-oriented datasets and do not fully cover the fundamental characteristics of molecules, hindering their abilities as general-purpose molecular assistants. To address this issue, we propose Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules via multi-modal instruction tuning. To this end, we design key data types that encompass the fundamental features of molecules, incorporating essential knowledge from molecular structures. In addition, to improve understanding of molecular features, we introduce a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of different molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and generating relevant responses to users’ queries with detailed explanations, implying its potential as a general-purpose assistant for molecular analysis.

[LG-33] Object-Pose Estimation With Neural Population Codes

链接: https://arxiv.org/abs/2502.13403
作者: Heiko Hoffmann,Richard Hoffmann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic assembly tasks require object-pose estimation, particularly for tasks that avoid costly mechanical constraints. Object symmetry complicates the direct mapping of sensory input to object rotation, as the rotation becomes ambiguous and lacks a unique training target. Some proposed solutions involve evaluating multiple pose hypotheses against the input or predicting a probability distribution, but these approaches suffer from significant computational overhead. Here, we show that representing object rotation with a neural population code overcomes these limitations, enabling a direct mapping to rotation and end-to-end learning. As a result, population codes facilitate fast and accurate pose estimation. On the T-LESS dataset, we achieve inference in 3.2 milliseconds on an Apple M1 CPU and a Maximum Symmetry-Aware Surface Distance accuracy of 84.7% using only gray-scale image input, compared to 69.7% accuracy when directly mapping to pose.

[LG-34] Unsupervised CP-UNet Framework for Denoising DAS Data with Decay Noise

链接: https://arxiv.org/abs/2502.13395
作者: Tianye Huang,Aopeng Li,Xiang Li,Jing Zhang,Sijing Xian,Qi Zhang,Mingkong Lu,Guodong Chen,Liangming Xiong,Xiangyun Hu
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP); Optics (physics.optics)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Distributed acoustic sensor (DAS) technology leverages optical fiber cables to detect acoustic signals, providing cost-effective and dense monitoring capabilities. It offers several advantages including resistance to extreme conditions, immunity to electromagnetic interference, and accurate detection. However, DAS typically exhibits a lower signal-to-noise ratio (S/N) compared to geophones and is susceptible to various noise types, such as random noise, erratic noise, level noise, and long-period noise. This reduced S/N can negatively impact data analyses containing inversion and interpretation. While artificial intelligence has demonstrated excellent denoising capabilities, most existing methods rely on supervised learning with labeled data, which imposes stringent requirements on the quality of the labels. To address this issue, we develop a label-free unsupervised learning (UL) network model based on Context-Pyramid-UNet (CP-UNet) to suppress erratic and random noises in DAS data. The CP-UNet utilizes the Context Pyramid Module in the encoding and decoding process to extract features and reconstruct the DAS data. To enhance the connectivity between shallow and deep features, we add a Connected Module (CM) to both encoding and decoding section. Layer Normalization (LN) is utilized to replace the commonly employed Batch Normalization (BN), accelerating the convergence of the model and preventing gradient explosion during training. Huber-loss is adopted as our loss function whose parameters are experimentally determined. We apply the network to both the 2-D synthetic and filed data. Comparing to traditional denoising methods and the latest UL framework, our proposed method demonstrates superior noise reduction performance.

[LG-35] Flow-based generative models as iterative algorithms in probability space

链接: https://arxiv.org/abs/2502.13394
作者: Yao Xie,Xiuyuan Cheng
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative AI (GenAI) has revolutionized data-driven modeling by enabling the synthesis of high-dimensional data across various applications, including image generation, language modeling, biomedical signal processing, and anomaly detection. Flow-based generative models provide a powerful framework for capturing complex probability distributions, offering exact likelihood estimation, efficient sampling, and deterministic transformations between distributions. These models leverage invertible mappings governed by Ordinary Differential Equations (ODEs), enabling precise density estimation and likelihood evaluation. This tutorial presents an intuitive mathematical framework for flow-based generative models, formulating them as neural network-based representations of continuous probability densities. We explore key theoretical principles, including the Wasserstein metric, gradient flows, and density evolution governed by ODEs, to establish convergence guarantees and bridge empirical advancements with theoretical insights. By providing a rigorous yet accessible treatment, we aim to equip researchers and practitioners with the necessary tools to effectively apply flow-based generative models in signal processing and machine learning.

[LG-36] Quantum Recurrent Neural Networks with Encoder-Decoder for Time-Dependent Partial Differential Equations

链接: https://arxiv.org/abs/2502.13370
作者: Yuan Chen,Abdul Khaliq,Khaled M. Furati
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Nonlinear time-dependent partial differential equations are essential in modeling complex phenomena across diverse fields, yet they pose significant challenges due to their computational complexity, especially in higher dimensions. This study explores Quantum Recurrent Neural Networks within an encoder-decoder framework, integrating Variational Quantum Circuits into Gated Recurrent Units and Long Short-Term Memory networks. Using this architecture, the model efficiently compresses high-dimensional spatiotemporal data into a compact latent space, facilitating more efficient temporal evolution. We evaluate the algorithms on the Hamilton-Jacobi-Bellman equation, Burgers’ equation, the Gray-Scott reaction-diffusion system, and the three dimensional Michaelis-Menten reaction-diffusion equation. The results demonstrate the superior performance of the quantum-based algorithms in capturing nonlinear dynamics, handling high-dimensional spaces, and providing stable solutions, highlighting their potential as an innovative tool in solving challenging and complex systems.

[LG-37] VUS: Effective and Efficient Accuracy Measures for Time-Series Anomaly Detection

链接: https://arxiv.org/abs/2502.13318
作者: Paul Boniol,Ashwin K. Krishna,Marine Bruel,Qinghua Liu,Mingyi Huang,Themis Palpanas,Ruey S. Tsay,Aaron Elmore,Michael J. Franklin,John Paparrizos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection (AD) is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), AD for time series is also concerned with range-based anomalies (i.e., outliers spanning multiple observations). Nevertheless, it is common to use traditional point-based information retrieval measures, such as Precision, Recall, and F-score, to assess the quality of methods by thresholding the anomaly score to mark each point as an anomaly or not. However, mapping discrete labels into continuous data introduces unavoidable shortcomings, complicating the evaluation of range-based anomalies. Notably, the choice of evaluation measure may significantly bias the experimental outcome. Despite over six decades of attention, there has never been a large-scale systematic quantitative and qualitative analysis of time-series AD evaluation measures. This paper extensively evaluates quality measures for time-series AD to assess their robustness under noise, misalignments, and different anomaly cardinality ratios. Our results indicate that measures producing quality values independently of a threshold (i.e., AUC-ROC and AUC-PR) are more suitable for time-series AD. Motivated by this observation, we first extend the AUC-based measures to account for range-based anomalies. Then, we introduce a new family of parameter-free and threshold-independent measures, Volume Under the Surface (VUS), to evaluate methods while varying parameters. We also introduce two optimized implementations for VUS that reduce significantly the execution time of the initial implementation. Our findings demonstrate that our four measures are significantly more robust in assessing the quality of time-series AD methods.

[LG-38] A Label-Free Heterophily-Guided Approach for Unsupervised Graph Fraud Detection AAAI2025

链接: https://arxiv.org/abs/2502.13308
作者: Junjun Pan,Yixin Liu,Xin Zheng,Yizhen Zheng,Alan Wee-Chung Liew,Fuyi Li,Shirui Pan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures. Accepted by AAAI 2025

点击查看摘要

Abstract:Graph fraud detection (GFD) has rapidly advanced in protecting online services by identifying malicious fraudsters. Recent supervised GFD research highlights that heterophilic connections between fraudsters and users can greatly impact detection performance, since fraudsters tend to camouflage themselves by building more connections to benign users. Despite the promising performance of supervised GFD methods, the reliance on labels limits their applications to unsupervised scenarios; Additionally, accurately capturing complex and diverse heterophily patterns without labels poses a further challenge. To fill the gap, we propose a Heterophily-guided Unsupervised Graph fraud dEtection approach (HUGE) for unsupervised GFD, which contains two essential components: a heterophily estimation module and an alignment-based fraud detection module. In the heterophily estimation module, we design a novel label-free heterophily metric called HALO, which captures the critical graph properties for GFD, enabling its outstanding ability to estimate heterophily from node attributes. In the alignment-based fraud detection module, we develop a joint MLP-GNN architecture with ranking loss and asymmetric alignment loss. The ranking loss aligns the predicted fraud score with the relative order of HALO, providing an extra robustness guarantee by comparing heterophily among non-adjacent nodes. Moreover, the asymmetric alignment loss effectively utilizes structural information while alleviating the feature-smooth effects of this http URL experiments on 6 datasets demonstrate that HUGE significantly outperforms competitors, showcasing its effectiveness and robustness. The source code of HUGE is at this https URL.

[LG-39] Application of Context-dependent Interpretation of Biosignals Recognition to Control a Bionic Multifunctional Hand Prosthesis

链接: https://arxiv.org/abs/2502.13301
作者: Pawel Trajdos,Marek Kurzynski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The paper presents an original method for controlling a surface-electromyography-driven (sEMG) prosthesis. A context-dependent recognition system is proposed in which the same class of sEMG signals may have a different interpretation, depending on the context. This allowed the repertoire of performed movements to be increased. The proposed structure of the context-dependent recognition system includes unambiguously defined decision sequences covering the overall action of the prosthesis, i.e. the so-called boxes. Because the boxes are mutually isolated environments, each box has its own interpretation of the recognition result, as well as a separate local-recognition-task-focused classifier. Due to the freedom to assign contextual meanings to classes of biosignals, the construction procedure of the classifier can be optimised in terms of the local classification quality in a given box or the classification quality of the entire system. In the paper, two optimisation problems are formulated, differing in the adopted constraints on optimisation variables, with the methods of solving the problems based on an exhaustive search and an evolutionary algorithm, being developed. Experimental studies were conducted using signals from 1 able-bodied person with simulation of amputation and 10 volunteers with transradial amputations. The study compared the classical recognition system and the context-dependent system for various classifier models. An unusual testing strategy was adopted in the research, taking into account the specificity of the considered recognition task, with two original quality measures resulting from this scheme then being applied. The results obtained confirm the hypothesis that the application of the context-dependent classifier led to an improvement in classification quality. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.13301 [cs.LG] (or arXiv:2502.13301v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.13301 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Biocybernetics and Biomedical Engineering, 44,2024, 161-182 Related DOI: https://doi.org/10.1016/j.bbe.2024.01.001 Focus to learn more DOI(s) linking to related resources

[LG-40] Multiple Distribution Shift – Aerial (MDS-A): A Dataset for Test-Time Error Detection and Model Adaptation

链接: https://arxiv.org/abs/2502.13289
作者: Noel Ngu,Aditya Taparia,Gerardo I. Simari,Mario Leiva,Jack Corcoran,Ransalu Senanayake,Paulo Shakarian,Nathaniel D. Bastian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models assume that training and test samples are drawn from the same distribution. As such, significant differences between training and test distributions often lead to degradations in performance. We introduce Multiple Distribution Shift – Aerial (MDS-A) – a collection of inter-related datasets of the same aerial domain that are perturbed in different ways to better characterize the effects of out-of-distribution performance. Specifically, MDS-A is a set of simulated aerial datasets collected under different weather conditions. We include six datasets under different simulated weather conditions along with six baseline object-detection models, as well as several test datasets that are a mix of weather conditions that we show have significant differences from the training data. In this paper, we present characterizations of MDS-A, provide performance results for the baseline machine learning models (on both their specific training datasets and the test data), as well as results of the baselines after employing recent knowledge-engineering error-detection techniques (EDR) thought to improve out-of-distribution performance. The dataset is available at this https URL.

[LG-41] Breaking the bonds of generative artificial intelligence by minimizing the maximum entropy

链接: https://arxiv.org/abs/2502.13287
作者: Mattia Miotto,Lorenzo Monacelli
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:The emergence of generative artificial intelligence (GenAI), comprising large language models, text-to-image generators, and AI algorithms for medical drug and material design, had a transformative impact on society. However, despite an initial exponential growth surpassing Moore’s law, progress is now plateauing, suggesting we are approaching the limits of current technology. Indeed, these models are notoriously data-hungry, prone to overfitting, and challenging to direct during the generative process, hampering their effective professional employment. To cope with these limitations, we propose a paradigm shift in GenAI by introducing an ab initio method based on the minimal maximum entropy principle. Our approach does not fit the data. Instead, it compresses information in the training set by finding a latent representation parameterized by arbitrary nonlinear functions, such as neural networks. The result is a general physics-driven model, which is data-efficient, resistant to overfitting, and flexible, permitting to control and influence the generative process. Benchmarking shows that our method outperforms variational autoencoders (VAEs) with similar neural architectures, particularly on undersampled datasets. We demonstrate the methods effectiveness in generating images, even with limited training data, and its unprecedented capability to customize the generation process a posteriori without the need of any fine-tuning or retraining.

[LG-42] Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

链接: https://arxiv.org/abs/2502.13283
作者: Jingfeng Wu,Peter Bartlett,Matus Telgarsky,Bin Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum \ell_2 -margin solution – a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and \ell_2 -regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit \ell_2 -regularization.

[LG-43] Value Gradient Sampler: Sampling as Sequential Decision Making

链接: https://arxiv.org/abs/2502.13280
作者: Sangwoong Yoon,Himchan Hwang,Hyeokju Jeong,Dong Kyu Shin,Che-Sang Park,Sehee Kwon,Frank Chongwoo Park
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound of the KL divergence between the target density and the samples. We employ value-based dynamic programming to solve this optimal control problem, which gives the gradient of the value function as the optimal drift vector. The connection to sequential decision making allows VGS to leverage extensively studied techniques in reinforcement learning, making VGS a fast, adaptive, and accurate sampler that achieves competitive results in various sampling benchmarks. Furthermore, VGS can replace MCMC in contrastive divergence training of energy-based models. We demonstrate the effectiveness of VGS in training accurate energy-based models in industrial anomaly detection applications.

[LG-44] alking About the Assumption in the Room

链接: https://arxiv.org/abs/2502.13268
作者: Ramaravind Kommiya Mothilal,Faisal M. Lalani,Syed Ishtiaque Ahmed,Shion Guha,Sharifa Sultana
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 19 pages without references, single-column, preprint for conference

点击查看摘要

Abstract:The reference to assumptions in how practitioners use or interact with machine learning (ML) systems is ubiquitous in HCI and responsible ML discourse. However, what remains unclear from prior works is the conceptualization of assumptions and how practitioners identify and handle assumptions throughout their workflows. This leads to confusion about what assumptions are and what needs to be done with them. We use the concept of an argument from Informal Logic, a branch of Philosophy, to offer a new perspective to understand and explicate the confusions surrounding assumptions. Through semi-structured interviews with 22 ML practitioners, we find what contributes most to these confusions is how independently assumptions are constructed, how reactively and reflectively they are handled, and how nebulously they are recorded. Our study brings the peripheral discussion of assumptions in ML to the center and presents recommendations for practitioners to better think about and work with assumptions.

[LG-45] A Machine Learning Approach That Beats Large Rubiks Cubes

链接: https://arxiv.org/abs/2502.13266
作者: Alexander Chervov,Kirill Khoruzhii,Nikita Bukhal,Jalal Naghiyev,Vladislav Zamkovoy,Ivan Koltsov,Lyudmila Cheldieva,Arsenii Sychev,Arsenii Lenin,Mark Obozov,Egor Urvanov,Alexey Romanov
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: 12 pages, 3 tables, 3 figures

点击查看摘要

Abstract:The paper proposes a novel machine learning-based approach to the pathfinding problem on extremely large graphs. This method leverages diffusion distance estimation via a neural network and uses beam search for pathfinding. We demonstrate its efficiency by finding solutions for 4x4x4 and 5x5x5 Rubik’s cubes with unprecedentedly short solution lengths, outperforming all available solvers and introducing the first machine learning solver beyond the 3x3x3 case. In particular, it surpasses every single case of the combined best results in the Kaggle Santa 2023 challenge, which involved over 1,000 teams. For the 3x3x3 Rubik’s cube, our approach achieves an optimality rate exceeding 98%, matching the performance of task-specific solvers and significantly outperforming prior solutions such as DeepCubeA (60.3%) and EfficientCube (69.6%). Additionally, our solution is more than 26 times faster in solving 3x3x3 Rubik’s cubes while requiring up to 18.5 times less model training time than the most efficient state-of-the-art competitor.

[LG-46] Random Forest Autoencoders for Guided Representation Learning

链接: https://arxiv.org/abs/2502.13257
作者: Adrien Aumon,Shuang Ni,Myriam Lizotte,Guy Wolf,Kevin R. Moon,Jake S. Rhodes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decades of research have produced robust methods for unsupervised data visualization, yet supervised visualization \unicodex2013 where expert labels guide representations \unicodex2013 remains underexplored, as most supervised approaches prioritize classification over visualization. Recently, RF-PHATE, a diffusion-based manifold learning method leveraging random forests and information geometry, marked significant progress in supervised visualization. However, its lack of an explicit mapping function limits scalability and prevents application to unseen data, posing challenges for large datasets and label-scarce scenarios. To overcome these limitations, we introduce Random Forest Autoencoders (RF-AE), a neural network-based framework for out-of-sample kernel extension that combines the flexibility of autoencoders with the supervised learning strengths of random forests and the geometry captured by RF-PHATE. RF-AE enables efficient out-of-sample supervised visualization and outperforms existing methods, including RF-PHATE’s standard kernel extension, in both accuracy and interpretability. Additionally, RF-AE is robust to the choice of hyper-parameters and generalizes to any kernel-based dimensionality reduction method.

[LG-47] he impact of conformer quality on learned representations of molecular conformer ensembles

链接: https://arxiv.org/abs/2502.13220
作者: Keir Adams,Connor W. Coley
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Training machine learning models to predict properties of molecular conformer ensembles is an increasingly popular strategy to accelerate the conformational analysis of drug-like small molecules, reactive organic substrates, and homogeneous catalysts. For high-throughput analyses especially, trained surrogate models can help circumvent traditional approaches to conformational analysis that rely on expensive conformer searches and geometry optimizations. Here, we question how the performance of surrogate models for predicting 3D conformer-dependent properties (of a single, active conformer) is affected by the quality of the 3D conformers used as their input. How well do lower-quality conformers inform the prediction of properties of higher-quality conformers? Does the fidelity of geometry optimization matter when encoding random conformers? For models that encode sets of conformers, how does the presence of the active conformer that induces the target property affect model accuracy? How do predictions from a surrogate model compare to estimating the properties from cheap ensembles themselves? We explore these questions in the context of predicting Sterimol parameters of conformer ensembles optimized with density functional theory. Although answers will be case-specific, our analyses provide a valuable perspective on 3D representation learning models and raise practical considerations regarding when conformer quality matters.

[LG-48] Application of machine learning algorithm in temperature field reconstruction

链接: https://arxiv.org/abs/2502.13190
作者: Qianyu He,Huaiwei Sun,Yubo Li,Zhiwen You,Qiming Zheng,Yinghan Huang,Sipeng Zhu,Fengyu Wang
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This study focuses on the stratification patterns and dynamic evolution of reservoir water temperatures, aiming to estimate and reconstruct the temperature field using limited and noisy local measurement data. Due to complex measurement environments and technical limitations, obtaining complete temperature information for reservoirs is highly challenging. Therefore, accurately reconstructing the temperature field from a small number of local data points has become a critical scientific issue. To address this, the study employs Proper Orthogonal Decomposition (POD) and sparse representation methods to reconstruct the temperature field based on temperature data from a limited number of local measurement points. The results indicate that satisfactory reconstruction can be achieved when the number of POD basis functions is set to 2 and the number of measurement points is 10. Under different water intake depths, the reconstruction errors of both POD and sparse representation methods remain stable at around 0.15, fully validating the effectiveness of these methods in reconstructing the temperature field based on limited local temperature data. Additionally, the study further explores the distribution characteristics of reconstruction errors for POD and sparse representation methods under different water level intervals, analyzing the optimal measurement point layout scheme and potential limitations of the reconstruction methods in this case. This research not only effectively reduces measurement costs and computational resource consumption but also provides a new technical approach for reservoir temperature analysis, holding significant theoretical and practical importance.

[LG-49] Autonomous Vehicles Using Multi-Agent Reinforcement Learning for Routing Decisions Can Harm Urban Traffic

链接: https://arxiv.org/abs/2502.13188
作者: Anastasia Psarou,Ahmet Onur Akman,Łukasz Gorczyca,Michał Hoffmann,Zoltán György Varga,Grzegorz Jamróz,Rafał Kucharski
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) using Multi-Agent Reinforcement Learning (MARL) for simultaneous route optimization may destabilize traffic environments, with human drivers possibly experiencing longer travel times. We study this interaction by simulating human drivers and AVs. Our experiments with standard MARL algorithms reveal that, even in trivial cases, policies often fail to converge to an optimal solution or require long training periods. The problem is amplified by the fact that we cannot rely entirely on simulated training, as there are no accurate models of human routing behavior. At the same time, real-world training in cities risks destabilizing urban traffic systems, increasing externalities, such as CO_2 emissions, and introducing non-stationarity as human drivers adapt unpredictably to AV behaviors. Centralization can improve convergence in some cases, however, it raises privacy concerns for the travelers’ destination data. In this position paper, we argue that future research must prioritize realistic benchmarks, cautious deployment strategies, and tools for monitoring and regulating AV routing behaviors to ensure sustainable and equitable urban mobility systems.

[LG-50] he Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent

链接: https://arxiv.org/abs/2502.13961
作者: Yatin Dandi,Luca Pesce,Lenka Zdeborová,Florent Krzakala
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. While the study of multi-index models with Gaussian data in high dimensions has provided analytical insights into the benefits of GD-trained neural networks over kernels, the role of depth in improving sample complexity and generalization in GD-trained networks remains poorly understood. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.

[LG-51] AI-Driven Discovery of High Performance Polymer Electrodes for Next-Generation Batteries

链接: https://arxiv.org/abs/2502.13899
作者: Subhash V.S. Ganti,Lukas Woelfel,Christopher Kuenneth
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 33 pages, 10 figures, 3 tables

点击查看摘要

Abstract:The use of transition group metals in electric batteries requires extensive usage of critical elements like lithium, cobalt and nickel, which poses significant environmental challenges. Replacing these metals with redox-active organic materials offers a promising alternative, thereby reducing the carbon footprint of batteries by one order of magnitude. However, this approach faces critical obstacles, including the limited availability of suitable redox-active organic materials and issues such as lower electronic conductivity, voltage, specific capacity, and long-term stability. To overcome the limitations for lower voltage and specific capacity, a machine learning (ML) driven battery informatics framework is developed and implemented. This framework utilizes an extensive battery dataset and advanced ML techniques to accelerate and enhance the identification, optimization, and design of redox-active organic materials. In this contribution, a data-fusion ML coupled meta learning model capable of predicting the battery properties, voltage and specific capacity, for various organic negative electrodes and charge carriers (positive electrode materials) combinations is presented. The ML models accelerate experimentation, facilitate the inverse design of battery materials, and identify suitable candidates from three extensive material libraries to advance sustainable energy-storage technologies.

[LG-52] Evaluation of EAS directions based on TAIGA HiSCORE data using fully connected neural networks

链接: https://arxiv.org/abs/2502.13851
作者: A.P.Kryukov,S.P.Polyakov,Yu.Yu.Dubenskaya,E.O.Gres,E.B.Postnikov,P.A.Volchugov,D.P.Zhurov
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: The work was reported on the 8th International Conference on Deep Learning in Computational Physics (DLCP2025), June 19-21, 2024, Moscow, Russia ( this https URL ). To bee published in Moscow University Physics Bulletin

点击查看摘要

Abstract:The direction of extensive air showers can be used to determine the source of gamma quanta and plays an important role in estimating the energy of the primary particle. The data from an array of non-imaging Cherenkov detector stations HiSCORE in the TAIGA experiment registering the number of photoelectrons and detection time can be used to estimate the shower direction with high accuracy. In this work, we use artificial neural networks trained on Monte Carlo-simulated TAIGA HiSCORE data for gamma quanta to obtain shower direction estimates. The neural networks are multilayer perceptrons with skip connections using partial data from several HiSCORE stations as inputs; composite estimates are derived from multiple individual estimates by the neural networks. We apply a two-stage algorithm in which the direction estimates obtained in the first stage are used to transform the input data and refine the estimates. The mean error of the final estimates is less than 0.25 degrees. The approach will be used for multimodal analysis of the data from several types of detectors used in the TAIGA experiment.

[LG-53] Uncertainty quantification for Markov chains with application to temporal difference learning

链接: https://arxiv.org/abs/2502.13822
作者: Weichen Wu,Yuting Wei,Alessandro Rinaldo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Markov chains are fundamental to statistical machine learning, underpinning key methodologies such as Markov Chain Monte Carlo (MCMC) sampling and temporal difference (TD) learning in reinforcement learning (RL). Given their widespread use, it is crucial to establish rigorous probabilistic guarantees on their convergence, uncertainty, and stability. In this work, we develop novel, high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains, addressing key limitations in existing theoretical tools for handling dependent data. We leverage these results to analyze the TD learning algorithm, a widely used method for policy evaluation in RL. Our analysis yields a sharp high-probability consistency guarantee that matches the asymptotic variance up to logarithmic factors. Furthermore, we establish a O(T^-\frac14\log T) distributional convergence rate for the Gaussian approximation of the TD estimator, measured in convex distance. These findings provide new insights into statistical inference for RL algorithms, bridging the gaps between classical stochastic approximation theory and modern reinforcement learning applications.

[LG-54] Learning Is a Kan Extension

链接: https://arxiv.org/abs/2502.13810
作者: Matthew Pugh,Jo Grundy,Corina Cirstea,Nick Harris
类目: Category Theory (math.CT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Previous work has demonstrated that efficient algorithms exist for computing Kan extensions and that some Kan extensions have interesting similarities to various machine learning algorithms. This paper closes the gap by proving that all error minimisation algorithms may be presented as a Kan extension. This result provides a foundation for future work to investigate the optimisation of machine learning algorithms through their presentation as Kan extensions. A corollary of this representation of error-minimising algorithms is a presentation of error from the perspective of lossy and lossless transformations of data.

[LG-55] Identifying metric structures of deep latent variable models

链接: https://arxiv.org/abs/2502.13757
作者: Stas Syrota,Yevgen Zainchkovskyy,Johnny Xi,Benjamin Bloem-Reddy,Søren Hauberg
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep latent variable models learn condensed representations of data that, hopefully, reflect the inner workings of the studied phenomena. Unfortunately, these latent representations are not statistically identifiable, meaning they cannot be uniquely determined. Domain experts, therefore, need to tread carefully when interpreting these. Current solutions limit the lack of identifiability through additional constraints on the latent variable model, e.g. by requiring labeled training data, or by restricting the expressivity of the model. We change the goal: instead of identifying the latent variables, we identify relationships between them such as meaningful distances, angles, and volumes. We prove this is feasible under very mild model conditions and without additional labeled data. We empirically demonstrate that our theory results in more reliable latent distances, offering a principled path forward in extracting trustworthy conclusions from deep latent variable models.

[LG-56] Deep Learning for VWAP Execution in Crypto Markets: Beyond the Volume Curve

链接: https://arxiv.org/abs/2502.13722
作者: Remi Genet
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Volume-Weighted Average Price (VWAP) is arguably the most prevalent benchmark for trade execution as it provides an unbiased standard for comparing performance across market participants. However, achieving VWAP is inherently challenging due to its dependence on two dynamic factors, volumes and prices. Traditional approaches typically focus on forecasting the market’s volume curve, an assumption that may hold true under steady conditions but becomes suboptimal in more volatile environments or markets such as cryptocurrency where prediction error margins are higher. In this study, I propose a deep learning framework that directly optimizes the VWAP execution objective by bypassing the intermediate step of volume curve prediction. Leveraging automatic differentiation and custom loss functions, my method calibrates order allocation to minimize VWAP slippage, thereby fully addressing the complexities of the execution problem. My results demonstrate that this direct optimization approach consistently achieves lower VWAP slippage compared to conventional methods, even when utilizing a naive linear model presented in arXiv:2410.21448. They validate the observation that strategies optimized for VWAP performance tend to diverge from accurate volume curve predictions and thus underscore the advantage of directly modeling the execution objective. This research contributes a more efficient and robust framework for VWAP execution in volatile markets, illustrating the potential of deep learning in complex financial systems where direct objective optimization is crucial. Although my empirical analysis focuses on cryptocurrency markets, the underlying principles of the framework are readily applicable to other asset classes such as equities.

[LG-57] Graph Signal Inference by Learning Narrowband Spectral Kernels

链接: https://arxiv.org/abs/2502.13686
作者: Osman Furkan Kar,Gülce Turhan,Elif Vural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While a common assumption in graph signal analysis is the smoothness of the signals or the band-limitedness of their spectrum, in many instances the spectrum of real graph data may be concentrated at multiple regions of the spectrum, possibly including mid-to-high-frequency components. In this work, we propose a novel graph signal model where the signal spectrum is represented through the combination of narrowband kernels in the graph frequency domain. We then present an algorithm that jointly learns the model by optimizing the kernel parameters and the signal representation coefficients from a collection of graph signals. Our problem formulation has the flexibility of permitting the incorporation of signals possibly acquired on different graphs into the learning algorithm. We then theoretically study the signal reconstruction performance of the proposed method, by also elaborating on when joint learning on multiple graphs is preferable to learning an individual model on each graph. Experimental results on several graph data sets shows that the proposed method offers quite satisfactory signal interpolation accuracy in comparison with a variety of reference approaches in the literature.

[LG-58] RestoreGrad: Signal Restoration Using Conditional Denoising Diffusion Models with Jointly Learned Prior

链接: https://arxiv.org/abs/2502.13574
作者: Ching-Hua Lee,Chouchang Yang,Jaejin Cho,Yashas Malur Saidutta,Rakshith Sharma Srinivasa,Yilin Shen,Hongxia Jin
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Denoising diffusion probabilistic models (DDPMs) can be utilized for recovering a clean signal from its degraded observation(s) by conditioning the model on the degraded signal. The degraded signals are themselves contaminated versions of the clean signals; due to this correlation, they may encompass certain useful information about the target clean data distribution. However, existing adoption of the standard Gaussian as the prior distribution in turn discards such information, resulting in sub-optimal performance. In this paper, we propose to improve conditional DDPMs for signal restoration by leveraging a more informative prior that is jointly learned with the diffusion model. The proposed framework, called RestoreGrad, seamlessly integrates DDPMs into the variational autoencoder framework and exploits the correlation between the degraded and clean signals to encode a better diffusion prior. On speech and image restoration tasks, we show that RestoreGrad demonstrates faster convergence (5-10 times fewer training steps) to achieve better quality of restored signals over existing DDPM baselines, and improved robustness to using fewer sampling steps in inference time (2-2.5 times fewer), advocating the advantages of leveraging jointly learned prior for efficiency improvements in the diffusion process.

[LG-59] An Efficient Permutation-Based Kernel Two-Sample Test

链接: https://arxiv.org/abs/2502.13570
作者: Antoine Chatalic,Marco Letizia,Nicolas Schreuder,and Lorenzo Rosasco
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 23 pages, 2 figures

点击查看摘要

Abstract:Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nyström approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing realistic scientific data.

[LG-60] A Study on Monthly Marine Heatwave Forecasts in New Zealand: An Investigation of Imbalanced Regression Loss Functions with Neural Network Models

链接: https://arxiv.org/abs/2502.13495
作者: Ding Ning,Varvara Vetrova,Sébastien Delaux,Rachael Tappenden,Karin R. Bryan,Yun Sing Koh
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注: The paper contains 32 pages for the main text

点击查看摘要

Abstract:Marine heatwaves (MHWs) are extreme ocean-temperature events with significant impacts on marine ecosystems and related industries. Accurate forecasts (one to six months ahead) of MHWs would aid in mitigating these impacts. However, forecasting MHWs presents a challenging imbalanced regression task due to the rarity of extreme temperature anomalies in comparison to more frequent moderate conditions. In this study, we examine monthly MHW forecasts for 12 locations around New Zealand. We use a fully-connected neural network and compare standard and specialized regression loss functions, including the mean squared error (MSE), the mean absolute error (MAE), the Huber, the weighted MSE, the focal-R, the balanced MSE, and a proposed scaling-weighted MSE. Results show that (i) short lead times (one month) are considerably more predictable than three- and six-month leads, (ii) models trained with the standard MSE or MAE losses excel at forecasting average conditions but struggle to capture extremes, and (iii) specialized loss functions such as the balanced MSE and our scaling-weighted MSE substantially improve forecasting of MHW and suspected MHW events. These findings underscore the importance of tailored loss functions for imbalanced regression, particularly in forecasting rare but impactful events such as MHWs.

[LG-61] Adopting Whisper for Confidence Estimation ICASSP2025

链接: https://arxiv.org/abs/2502.13446
作者: Vaibhav Aggarwal,Shabari S Nair,Yash Verma,Yash Jogi
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at IEEE ICASSP 2025

点击查看摘要

Abstract:Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.

[LG-62] Deep-Unfolded Massive Grant-Free Transmission in Cell-Free Wireless Communication Systems

链接: https://arxiv.org/abs/2502.13390
作者: Gangle Sun,Mengyao Cao,Wenjin Wang,Wei Xu,Christoph Studer
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: To appear in the IEEE Transactions on Signal Processing

点击查看摘要

Abstract:Grant-free transmission and cell-free communication are vital in improving coverage and quality-of-service for massive machine-type communication. This paper proposes a novel framework of joint active user detection, channel estimation, and data detection (JACD) for massive grant-free transmission in cell-free wireless communication systems. We formulate JACD as an optimization problem and solve it approximately using forward-backward splitting. To deal with the discrete symbol constraint, we relax the discrete constellation to its convex hull and propose two approaches that promote solutions from the constellation set. To reduce complexity, we replace costly computations with approximate shrinkage operations and approximate posterior mean estimator computations. To improve active user detection (AUD) performance, we introduce a soft-output AUD module that considers both the data estimates and channel conditions. To jointly optimize all algorithm hyper-parameters and to improve JACD performance, we further deploy deep unfolding together with a momentum strategy, resulting in two algorithms called DU-ABC and DU-POEM. Finally, we demonstrate the efficacy of the proposed JACD algorithms via extensive system simulations.

[LG-63] Dynamic directed functional connectivity as a neural biomarker for objective motor skill assessment

链接: https://arxiv.org/abs/2502.13362
作者: Anil Kamat,Rahul Rahul,Anirban Dutta,Lora Cavuoto,Uwe Kruger,Harry Burke,Matthew Hackett,Jack Norfleet,Steven Schwaitzberg,Suvranu De
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective motor skill assessment plays a critical role in fields such as surgery, where proficiency is vital for certification and patient safety. Existing assessment methods, however, rely heavily on subjective human judgment, which introduces bias and limits reproducibility. While recent efforts have leveraged kinematic data and neural imaging to provide more objective evaluations, these approaches often overlook the dynamic neural mechanisms that differentiate expert and novice performance. This study proposes a novel method for motor skill assessment based on dynamic directed functional connectivity (dFC) as a neural biomarker. By using electroencephalography (EEG) to capture brain dynamics and employing an attention-based Long Short-Term Memory (LSTM) model for non-linear Granger causality analysis, we compute dFC among key brain regions involved in psychomotor tasks. Coupled with hierarchical task analysis (HTA), our approach enables subtask-level evaluation of motor skills, offering detailed insights into neural coordination that underpins expert proficiency. A convolutional neural network (CNN) is then used to classify skill levels, achieving greater accuracy and specificity than established performance metrics in laparoscopic surgery. This methodology provides a reliable, objective framework for assessing motor skills, contributing to the development of tailored training protocols and enhancing the certification process.

[LG-64] Increasing NWP Thunderstorm Predictability Using Ensemble Data and Machine Learning

链接: https://arxiv.org/abs/2502.13316
作者: Kianusch Vahid Yousefnia,Tobias Bölle,Christoph Metzl
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 1 table. This work has been submitted to Weather Forecasting. Copyright in this work may be transferred without further notice

点击查看摘要

Abstract:While numerical weather prediction (NWP) models are essential for forecasting thunderstorms hours in advance, NWP uncertainty, which increases with lead time, limits the predictability of thunderstorm occurrence. This study investigates how ensemble NWP data and machine learning (ML) can enhance the skill of thunderstorm forecasts. Using our recently introduced neural network model, SALAMA 1D, which identifies thunderstorm occurrence in operational forecasts of the convection-permitting ICON-D2-EPS model for Central Europe, we demonstrate that ensemble-averaging significantly improves forecast skill. Notably, an 11-hour ensemble forecast matches the skill level of a 5-hour deterministic forecast. To explain this improvement, we derive an analytic expression linking skill differences to correlations between ensemble members, which aligns with observed performance gains. This expression generalizes to any binary classification model that processes ensemble members individually. Additionally, we show that ML models like SALAMA 1D can identify patterns of thunderstorm occurrence which remain predictable for longer lead times compared to raw NWP output. Our findings quantitatively explain the benefits of ensemble-averaging and encourage the development of ML methods for thunderstorm forecasting and beyond.

[LG-65] ask Shift: From Classification to Regression in Overparameterized Linear Models AISTATS2025

链接: https://arxiv.org/abs/2502.13285
作者: Tyler LaBonte,Kuo-Wei Lai,Vidya Muthukumar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression setting where the task shifts from classification during training to regression during evaluation. In the zero-shot case, wherein no regression data is available, we prove that task shift is impossible in both sparse signal and random signal models for any Gaussian covariate distribution. In the few-shot case, wherein limited regression data is available, we propose a simple postprocessing algorithm which asymptotically recovers the ground-truth predictor. Our analysis leverages a fine-grained characterization of individual parameters arising from minimum-norm interpolation which may be of independent interest. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.

[LG-66] Evidence of Replica Symmetry Breaking under the Nishimori conditions in epidemic inference on graphs

链接: https://arxiv.org/abs/2502.13249
作者: Alfredo Braunstein,Louise Budzynski,Matteo Mariani,Federico Ricci-Tersenghi
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:In Bayesian inference, computing the posterior distribution from the data is typically a non-trivial problem, which usually requires approximations such as mean-field approaches or numerical methods, like the Monte Carlo Markov Chain. Being a high-dimensional distribution over a set of correlated variables, the posterior distribution can undergo the notorious replica symmetry breaking transition. When it happens, several mean-field methods and virtually every Monte Carlo scheme can not provide a reasonable approximation to the posterior and its marginals. Replica symmetry is believed to be guaranteed whenever the data is generated with known prior and likelihood distributions, namely under the so-called Nishimori conditions. In this paper, we break this belief, by providing a counter-example showing that, under the Nishimori conditions, replica symmetry breaking arises. Introducing a simple, geometrical model that can be thought of as a patient zero retrieval problem in a highly infectious regime of the epidemic Susceptible-Infectious model, we show that under the Nishimori conditions, there is evidence of replica symmetry breaking. We achieve this result by computing the instability of the replica symmetric cavity method toward the one step replica symmetry broken phase. The origin of this phenomenon – replica symmetry breaking under the Nishimori conditions – is likely due to the correlated disorder appearing in the epidemic models.

[LG-67] Learning the Universe: Learning to Optimize Cosmic Initial Conditions with Non-Differentiable Structure Formation Models

链接: https://arxiv.org/abs/2502.13243
作者: Ludvig Doeser,Metin Ata,Jens Jasche
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 18 pages, 13 figures

点击查看摘要

Abstract:Making the most of next-generation galaxy clustering surveys requires overcoming challenges in complex, non-linear modelling to access the significant amount of information at smaller cosmological scales. Field-level inference has provided a unique opportunity beyond summary statistics to use all of the information of the galaxy distribution. However, addressing current challenges often necessitates numerical modelling that incorporates non-differentiable components, hindering the use of efficient gradient-based inference methods. In this paper, we introduce Learning the Universe by Learning to Optimize (LULO), a gradient-free framework for reconstructing the 3D cosmic initial conditions. Our approach advances deep learning to train an optimization algorithm capable of fitting state-of-the-art non-differentiable simulators to data at the field level. Importantly, the neural optimizer solely acts as a search engine in an iterative scheme, always maintaining full physics simulations in the loop, ensuring scalability and reliability. We demonstrate the method by accurately reconstructing initial conditions from M_200\mathrmc halos identified in a dark matter-only N -body simulation with a spherical overdensity algorithm. The derived dark matter and halo overdensity fields exhibit \geq80% cross-correlation with the ground truth into the non-linear regime k \sim 1h Mpc ^-1 . Additional cosmological tests reveal accurate recovery of the power spectra, bispectra, halo mass function, and velocities. With this work, we demonstrate a promising path forward to non-linear field-level inference surpassing the requirement of a differentiable physics model.

[LG-68] Model selection for behavioral learning data and applications to contextual bandits

链接: https://arxiv.org/abs/2502.13186
作者: Julien Aubert(UniCA),Louis Köhler,Luc Lehéricy(LMO, LMO),Giulia Mezzadri,Patricia Reynaud-Bouret
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning for animals or humans is the process that leads to behaviors better adapted to the environment. This process highly depends on the individual that learns and is usually observed only through the individual’s actions. This article presents ways to use this individual behavioral data to find the model that best explains how the individual learns. We propose two model selection methods: a general hold-out procedure and an AIC-type criterion, both adapted to non-stationary dependent data. We provide theoretical error bounds for these methods that are close to those of the standard i.i.d. case. To compare these approaches, we apply them to contextual bandit models and illustrate their use on both synthetic and experimental learning data in a human categorization task.

[LG-69] Synthetic generation of 2D data records based on Autoencoders

链接: https://arxiv.org/abs/2502.13183
作者: Darius Couchard,Oscar Olarte,Rob Haelterman
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 6 pages conference publication submitted to IEEE MeMeA 2025

点击查看摘要

Abstract:Gas Chromatography coupled with Ion Mobility Spectrometry (GC-IMS) is a dual-separation analytical technique widely used for identifying components in gaseous samples by separating and analysing the arrival times of their constituent species. Data generated by GC-IMS is typically represented as two-dimensional spectra, providing rich information but posing challenges for data-driven analysis due to limited labelled datasets. This study introduces a novel method for generating synthetic 2D spectra using a deep learning framework based on Autoencoders. Although applied here to GC-IMS data, the approach is broadly applicable to any two-dimensional spectral measurements where labelled data are scarce. While performing component classification over a labelled dataset of GC-IMS records, the addition of synthesized records significantly has improved the classification performance, demonstrating the method’s potential for overcoming dataset limitations in machine learning frameworks.

信息检索

[IR-0] Optimizing Research Portfolio For Semantic Impact

链接: https://arxiv.org/abs/2502.13912
作者: Alexander V. Belikov
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: 24 pages; 13 figures

点击查看摘要

Abstract:Citation metrics are widely used to assess academic impact but suffer from social biases, including institutional prestige and journal visibility. Here we introduce rXiv Semantic Impact (XSI), a novel framework that predicts research impact by analyzing how scientific semantic graphs evolve in underlying fabric of science. Rather than counting citations, XSI tracks the evolution of research concepts in the academic knowledge graph (KG). Starting with a construction of a comprehensive KG from 324K biomedical publications (2003-2025), we demonstrate that XSI can predict a paper’s future semantic impact (SI) with remarkable accuracy ( R^2 = 0.69) three years in advance. We leverage these predictions to develop an optimization framework for research portfolio selection that systematically outperforms random allocation. We propose SI as a complementary metric to citations and present XSI as a tool to guide funding and publishing decisions, enhancing research impact while mitigating risk.

[IR-1] Judging the Judges: A Collection of LLM -Generated Relevance Judgements

链接: https://arxiv.org/abs/2502.13908
作者: Hossein A. Rahmani,Clemencia Siro,Mohammad Aliannejadi,Nick Craswell,Charles L. A. Clarke,Guglielmo Faggioli,Bhaskar Mitra,Paul Thomas,Emine Yilmaz
类目: Information Retrieval (cs.IR)
*备注: 11 pages

点击查看摘要

Abstract:Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: this https URL Comments: 11 pages Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2502.13908 [cs.IR] (or arXiv:2502.13908v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.13908 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] In-Place Updates of a Graph Index for Streaming Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2502.13826
作者: Haike Xu,Magdalen Dobson Manohar,Philip A. Bernstein,Badrish Chandramouli,Richard Wen,Harsha Vardhan Simhadri
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Indices for approximate nearest neighbor search (ANNS) are a basic component for information retrieval and widely used in database, search, recommendation and RAG systems. In these scenarios, documents or other objects are inserted into and deleted from the working set at a high rate, requiring a stream of updates to the vector index. Algorithms based on proximity graph indices are the most efficient indices for ANNS, winning many benchmark competitions. However, it is challenging to update such graph index at a high rate, while supporting stable recall after many updates. Since the graph is singly-linked, deletions are hard because there is no fast way to find in-neighbors of a deleted vertex. Therefore, to update the graph, state-of-the-art algorithms such as FreshDiskANN accumulate deletions in a batch and periodically consolidate, removing edges to deleted vertices and modifying the graph to ensure recall stability. In this paper, we present IP-DiskANN (InPlaceUpdate-DiskANN), the first algorithm to avoid batch consolidation by efficiently processing each insertion and deletion in-place. Our experiments using standard benchmarks show that IP-DiskANN has stable recall over various lengthy update patterns in both high-recall and low-recall regimes. Further, its query throughput and update speed are better than using the batch consolidation algorithm and HNSW.

[IR-3] Generative Large Recommendation Models: Emerging Trends in LLM s for Recommendation WWW2025

链接: https://arxiv.org/abs/2502.13783
作者: Hao Wang,Wei Guo,Luankang Zhang,Jin Yao Chin,Yufei Ye,Huifeng Guo,Yong Liu,Defu Lian,Ruiming Tang,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted for the tutorial track at WWW 2025

点击查看摘要

Abstract:In the era of information overload, recommendation systems play a pivotal role in filtering data and delivering personalized content. Recent advancements in feature interaction and user behavior modeling have significantly enhanced the recall and ranking processes of these systems. With the rise of large language models (LLMs), new opportunities have emerged to further improve recommendation systems. This tutorial explores two primary approaches for integrating LLMs: LLMs-enhanced recommendations, which leverage the reasoning capabilities of general LLMs, and generative large recommendation models, which focus on scaling and sophistication. While the former has been extensively covered in existing literature, the latter remains underexplored. This tutorial aims to fill this gap by providing a comprehensive overview of generative large recommendation models, including their recent advancements, challenges, and potential research directions. Key topics include data quality, scaling laws, user behavior mining, and efficiency in training and inference. By engaging with this tutorial, participants will gain insights into the latest developments and future opportunities in the field, aiding both academic research and practical applications. The timely nature of this exploration supports the rapid evolution of recommendation systems, offering valuable guidance for researchers and practitioners alike.

[IR-4] Unsupervised Graph Embeddings for Session-based Recommendation with Item Features

链接: https://arxiv.org/abs/2502.13763
作者: Andreas Peintner,Marta Moscati,Emilia Parada-Cabaleiro,Markus Schedl,Eva Zangerle
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In session-based recommender systems, predictions are based on the user’s preceding behavior in the session. State-of-the-art sequential recommendation algorithms either use graph neural networks to model sessions in a graph or leverage the similarity of sessions by exploiting item features. In this paper, we combine these two approaches and propose a novel method, Graph Convolutional Network Extension (GCNext), which incorporates item features directly into the graph representation via graph convolutional networks. GCNext creates a feature-rich item co-occurrence graph and learns the corresponding item embeddings in an unsupervised manner. We show on three datasets that integrating GCNext into sequential recommendation algorithms significantly boosts the performance of nearest-neighbor methods as well as neural network models. Our flexible extension is easy to incorporate in state-of-the-art methods and increases the MRR@20 by up to 12.79%.

[IR-5] ALKPLAY: Multimodal Music Recommendation with Large Language Models

链接: https://arxiv.org/abs/2502.13713
作者: Seungheon Doh,Keunwoo Choi,Juhan Nam
类目: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We present TalkPlay, a multimodal music recommendation system that reformulates the recommendation task as large language model token generation. TalkPlay represents music through an expanded token vocabulary that encodes multiple modalities - audio, lyrics, metadata, semantic tags, and playlist co-occurrence. Using these rich representations, the model learns to generate recommendations through next-token prediction on music recommendation conversations, that requires learning the associations natural language query and response, as well as music items. In other words, the formulation transforms music recommendation into a natural language understanding task, where the model’s ability to predict conversation tokens directly optimizes query-item relevance. Our approach eliminates traditional recommendation-dialogue pipeline complexity, enabling end-to-end learning of query-aware music recommendations. In the experiment, TalkPlay is successfully trained and outperforms baseline methods in various aspects, demonstrating strong context understanding as a conversational music recommender.

[IR-6] Bursting Filter Bubble: Enhancing Serendipity Recommendations with Aligned Large Language Models

链接: https://arxiv.org/abs/2502.13539
作者: Yunjia Xi,Muyan Weng,Wen Chen,Chao Yi,Dian Chen,Gaoyang Guo,Mao Zhang,Jian Wu,Yuning Jiang,Qingwen Liu,Yong Yu,Weinan Zhang
类目: Information Retrieval (cs.IR)
*备注: 15 pages

点击查看摘要

Abstract:Recommender systems (RSs) often suffer from the feedback loop phenomenon, e.g., RSs are trained on data biased by their recommendations. This leads to the filter bubble effect that reinforces homogeneous content and reduces user satisfaction. To this end, serendipity recommendations, which offer unexpected yet relevant items, are proposed. Recently, large language models (LLMs) have shown potential in serendipity prediction due to their extensive world knowledge and reasoning capabilities. However, they still face challenges in aligning serendipity judgments with human assessments, handling long user behavior sequences, and meeting the latency requirements of industrial RSs. To address these issues, we propose SERAL (Serendipity Recommendations with Aligned Large Language Models), a framework comprising three stages: (1) Cognition Profile Generation to compress user behavior into multi-level profiles; (2) SerenGPT Alignment to align serendipity judgments with human preferences using enriched training data; and (3) Nearline Adaptation to integrate SerenGPT into industrial RSs pipelines efficiently. Online experiments demonstrate that SERAL improves exposure ratio (PVR), clicks, and transactions of serendipitous items by 5.7%, 29.56%, and 27.6%, enhancing user experience without much impact on overall revenue. Now, it has been fully deployed in the “Guess What You Like” of the Taobao App homepage.

[IR-7] Breaking the Clusters: Uniformity-Optimization for Text-Based Sequential Recommendation

链接: https://arxiv.org/abs/2502.13530
作者: Wuhan Chen,Zongwei Wang,Min Gao,Xin Xia,Feng Jiang,Junhao Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional sequential recommendation (SR) methods heavily rely on explicit item IDs to capture user preferences over time. This reliance introduces critical limitations in cold-start scenarios and domain transfer tasks, where unseen items and new contexts often lack established ID mappings. To overcome these limitations, recent studies have shifted towards leveraging text-only information for recommendation, thereby improving model generalization and adaptability across domains. Although promising, text-based SR faces unique difficulties: items’ text descriptions often share semantic similarities that lead to clustered item representations, compromising their uniformity, a property essential for promoting diversity and enhancing generalization in recommendation systems. In this paper, we explore a novel framework to improve the uniformity of item representations in text-based SR. Our analysis reveals that items within a sequence exhibit marked semantic similarity, meaning they are closer in representation than items overall, and that this effect is more pronounced for less popular items, which form tighter clusters compared to their more popular counterparts. Based on these findings, we propose UniT, a framework that employs three pairwise item sampling strategies: Unified General Sampling Strategy, Sequence-Driven Sampling Strategy, and Popularity-Driven Sampling Strategy. Each strategy applies varying degrees of repulsion to selectively adjust the distances between item pairs, thereby refining representation uniformity while considering both sequence context and item popularity. Extensive experiments on multiple real-world datasets demonstrate that our proposed approach outperforms state-of-the-art models, validating the effectiveness of UniT in enhancing both representation uniformity and recommendation this http URL source code is available at this https URL.

[IR-8] Reproducing NevIR: Negation in Neural Information Retrieval SIGIR2025

链接: https://arxiv.org/abs/2502.13506
作者: Coen van Elsen,Francien Barkhof,Thijmen Nijdam,Simon Lupart,Mohammad Alliannejadi
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 5 figures, under review at SIGIR 2025

点击查看摘要

Abstract:Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models (LMs) in Information Retrieval (IR). Despite the heavy reliance of modern neural IR systems on LMs, little attention has been given to their handling of negation. In this study, we reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation. We replicate NevIR’s original experiments and evaluate newly developed state-of-the-art IR models. Our findings show that a recently emerging category - listwise Large Language Model (LLM) rerankers - outperforms other models but still underperforms human performance. Additionally, we leverage ExcluIR, a benchmark dataset designed for exclusionary queries with extensive negation, to assess the generalizability of negation understanding. Our findings suggest that fine-tuning on one dataset does not reliably improve performance on the other, indicating notable differences in their data distributions. Furthermore, we observe that only cross-encoders and listwise LLM rerankers achieve reasonable performance across both negation tasks.

[IR-9] LLM 4Tag: Automatic Tagging System for Information Retrieval via Large Language Models

链接: https://arxiv.org/abs/2502.13481
作者: Ruiming Tang,Chenxu Zhu,Bo Chen,Weipeng Zhang,Menghui Zhu,Xinyi Dai,Huifeng Guo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Tagging systems play an essential role in various information retrieval applications such as search engines and recommender systems. Recently, Large Language Models (LLMs) have been applied in tagging systems due to their extensive world knowledge, semantic understanding, and reasoning capabilities. Despite achieving remarkable performance, existing methods still have limitations, including difficulties in retrieving relevant candidate tags comprehensively, challenges in adapting to emerging domain-specific knowledge, and the lack of reliable tag confidence quantification. To address these three limitations above, we propose an automatic tagging system LLM4Tag. First, a graph-based tag recall module is designed to effectively and comprehensively construct a small-scale highly relevant candidate tag set. Subsequently, a knowledge-enhanced tag generation module is employed to generate accurate tags with long-term and short-term knowledge injection. Finally, a tag confidence calibration module is introduced to generate reliable tag confidence scores. Extensive experiments over three large-scale industrial datasets show that LLM4Tag significantly outperforms the state-of-the-art baselines and LLM4Tag has been deployed online for content tagging to serve hundreds of millions of users.

[IR-10] Range Retrieval with Graph-Based Indices

链接: https://arxiv.org/abs/2502.13245
作者: Magdalen Dobson Manohar,Taekseung Kim,Guy E. Belloch
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieving points based on proximity in a high-dimensional vector space is a crucial step in information retrieval applications. The approximate nearest neighbor search (ANNS) problem, which identifies the k nearest neighbors for a query (approximately, since exactly is hard), has been extensively studied in recent years. However, comparatively little attention has been paid to the related problem of finding all points within a given distance of a query, the range retrieval problem, despite its applications in areas such as duplicate detection, plagiarism checking, and facial recognition. In this paper, we present a set of algorithms for range retrieval on graph-based vector indices, which are known to achieve excellent performance on ANNS queries. Since a range query may have anywhere from no matching results to thousands of matching results in the database, we introduce a set of range retrieval algorithms based on modifications of the standard graph search that adapt to terminate quickly on queries in the former group, and to put more resources into finding results for the latter group. Due to the lack of existing benchmarks for range retrieval, we also undertake a comprehensive study of range characteristics of existing embedding datasets, and select a suitable range retrieval radius for eight existing datasets with up to 100 million points in addition to the one existing benchmark. We test our algorithms on these datasets, and find up to 100x improvement in query throughput over a naive baseline approach, with 5-10x improvement on average, and strong performance up to 100 million data points.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-20

目录

概览 (2025-02-20)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载