Arxiv今日论文 | 2025-05-23

本篇博文主要内容为 2025-05-23 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决视觉生成模型在处理包含多个对象及其精确空间关系和属性的复杂文本提示时表现不佳的问题。其解决方案的关键在于提出GoT-R1框架，该框架通过强化学习增强语义-空间推理能力。GoT-R1基于Generation Chain-of-Thought方法，使模型能够自主发现超越预定义模板的有效推理策略，并引入双阶段多维奖励框架，利用多模态大语言模型（MLLM）对推理过程和最终输出进行评估，从而实现整个生成流程的有效监督。

链接: https://arxiv.org/abs/2505.17022
作者: Chengqi Duan,Rongyao Fang,Yuqing Wang,Kun Wang,Linjiang Huang,Xingyu Zeng,Hongsheng Li,Xihui Liu
机构: HKU MMLab (香港大学多媒体实验室); CUHK MMLab (香港中文大学多媒体实验室); Sensetime (商汤科技); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Github page refer to: this https URL

点击查看摘要

Abstract:Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at this https URL.
zh

[NLP-1] Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

【速读】：该论文旨在解决在自回归图像生成中应用强化学习（Reinforcement Learning, RL）所面临的特定挑战，包括确保文本与图像的一致性、提升图像美学质量以及设计复杂的奖励模型，而非依赖简单的基于规则的奖励。其解决方案的关键在于对两种主流RL算法——直接偏好优化（Direct Preference Optimization, DPO）和组相对策略优化（Group Relative Policy Optimization, GRPO）在该领域的性能进行系统性评估，并深入分析不同奖励模型对其性能的影响。研究还探讨了三种常见的扩展策略以提升算法在域内和跨域任务中的表现，揭示了增强奖励模型内在泛化能力对于提升RL算法整体泛化潜力的重要性。

链接: https://arxiv.org/abs/2505.17017
作者: Chengzhuo Tong,Ziyu Guo,Renrui Zhang,Wenyu Shan,Xinyu Wei,Zhenghao Xing,Hongsheng Li,Pheng-Ann Heng
机构: CUHK 1MiuLar Lab & 2MMLab; Peking University; Shanghai AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code is released at this https URL

点击查看摘要

Abstract:Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at this https URL
zh

[NLP-2] Multi-SpatialMLLM : Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在空间理解方面的局限性，即其当前仅能处理单张图像，无法胜任需要多帧推理的机器人等实际应用场景。解决方案的关键在于提出一个框架，通过整合深度感知、视觉对应关系和动态感知，增强MLLMs的多帧空间理解能力。该框架的核心是MultiSPA数据集，这是一个包含超过2700万样本的新型大规模数据集，覆盖多种3D和4D场景，并配套了一个全面的基准测试体系，以统一指标评估多种空间任务。基于此框架构建的Multi-SpatialMLLM模型在基线和专有系统上取得了显著提升，展示了可扩展且泛化的多帧推理能力。

链接: https://arxiv.org/abs/2505.17015
作者: Runsen Xu,Weiyao Wang,Hao Tang,Xingyu Chen,Xiaodong Wang,Fu-Jen Chu,Dahua Lin,Matt Feiszli,Kevin J. Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 24 pages. An MLLM, dataset, and benchmark for multi-frame spatial understanding. Project page: this https URL

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
zh

[NLP-3] R1-Searcher: Incentivizing the Dynamic Knowledge Acquisition of LLM s via Reinforcement Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）因静态知识而容易产生幻觉的问题，以及现有检索增强生成（Retrieval-Augmented Generation, RAG）方法在成本、泛化能力和内部知识利用方面的不足。其解决方案的关键在于提出一种名为R1-Searcher++的新框架，该框架通过两阶段训练策略实现模型对内部和外部知识源的自适应利用：首先进行监督微调（SFT）冷启动阶段以进行初步格式学习，随后通过强化学习（RL）进行动态知识获取，其中包含结果监督以鼓励探索、内部知识利用奖励机制以及持续整合检索信息的记忆机制，从而提升模型的内部知识库并实现高效的检索增强推理。

链接: https://arxiv.org/abs/2505.17005
作者: Huatong Song,Jinhao Jiang,Wenqing Tian,Zhipeng Chen,Yuhuan Wu,Jiahao Zhao,Yingqian Min,Wayne Xin Zhao,Lei Fang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); DataCanvas Alaya NeW; Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model’s internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
zh

[NLP-4] Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在逻辑推理任务中系统性评估不足的问题，特别是针对形式化语言引导下模型推理能力的全面评估。其解决方案的关键在于从三个维度——LLM的谱系、任务分类以及推理轨迹格式——进行综合分析，并通过引入形式化训练数据和简单的拒绝微调方法，提升模型在不同形式化语言间的泛化能力。

链接: https://arxiv.org/abs/2505.16998
作者: Jin Jiang,Jianing Wang,Yuchen Yan,Yang Liu,Jianhua Zhu,Mengdi Zhang,Xunliang Cai,Liangcai Gao
机构: Peking University (北京大学); Meituan Group (美团集团); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.
zh

[NLP-5] X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLM s

【速读】：该论文旨在解决传统基于单一大语言模型（Large Language Model, LLM）的多智能体系统（Multi-Agent Systems, MAS）在智能上限方面的限制问题，即现有框架通常依赖单一LLM驱动所有代理，导致系统性能受限于该模型的能力。其解决方案的关键在于提出异构LLM驱动的多智能体系统（heterogeneous LLM-driven MAS, X-MAS），通过让不同专业化的代理使用多种LLM，从而提升系统的集体智能水平。

链接: https://arxiv.org/abs/2505.16997
作者: Rui Ye,Xiangrui Liu,Qimin Wu,Xianghe Pang,Zhenfei Yin,Lei Bai,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); University of Oxford (牛津大学); The University of Sydney (悉尼大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system’s intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system’s potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
zh

[NLP-6] DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization

【速读】：该论文旨在解决情感支持对话（Emotional Support Conversation, ESC）中仍然存在的常见心理错误问题，以及直接偏好优化（Direct Preference Optimization, DPO）在ESC任务中的有效性受限问题。其解决方案的关键在于引入推断偏好挖掘（Inferential Preference Mining, IPM）以构建高质量的偏好数据，并基于Gross的情绪调节扩展过程模型设计解耦的ESC框架，将任务分解为策略规划和共情响应生成两个子任务，从而提升响应质量并减少偏好偏差。

链接: https://arxiv.org/abs/2505.16995
作者: Chao Zhang,Xin Shi,Xueqiao Zhang,Yifan Zhu,Yi Yang,Yawei Luo
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross’s Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
zh

[NLP-7] textR2textec: Towards Large Recommender Models with Reasoning

【速读】：该论文旨在解决传统推荐系统与大型语言模型（Large Language Models, LLMs）结合时存在的资源消耗高和联合优化效果不佳的问题。其关键解决方案是提出\name，一个具有内在推理能力的统一大规模推荐模型。该模型通过重构架构以实现自回归过程中的推理与推荐的交织，并引入RecPO强化学习框架，在单一策略更新中同时优化推理与推荐能力，采用融合奖励机制仅依赖推荐标签来模拟推理能力，从而避免对专门推理标注的依赖。

链接: https://arxiv.org/abs/2505.16994
作者: Runyang You,Yongqi Li,Xinyu Lin,Xin Zhang,Wenjie Wang,Wenjie Li,Liqiang Nie
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67% in Hit@5 and 45.21% in NDCG@20. Code available at this https URL.
zh

[NLP-8] MASLab: A Unified and Comprehensive Codebase for LLM -based Multi-Agent Systems

【速读】：该论文试图解决基于大语言模型（Large Language Model, LLM）的多智能体系统（Multi-Agent Systems, MAS）领域缺乏统一代码库的问题，这一问题导致了重复实现、不公平比较以及研究门槛高等挑战。解决方案的关键在于提出MASLab，这是一个统一、全面且适合研究的代码库，其核心特点包括：集成超过20种跨领域的成熟方法并进行严格验证；提供统一环境与多种基准以确保方法间的公平比较；在共享的简化结构中实现方法，降低理解和扩展的难度。

链接: https://arxiv.org/abs/2505.16988
作者: Rui Ye,Keduan Huang,Qimin Wu,Yuzhu Cai,Tian Jin,Xianghe Pang,Xiangrui Liu,Jiaqi Su,Chen Qian,Bohan Tang,Kaiqu Liang,Jiaao Chen,Yue Hu,Zhenfei Yin,Rongye Shi,Bo An,Yang Gao,Wenjun Wu,Lei Bai,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); University of Oxford (牛津大学); Princeton University (普林斯顿大学); Meta (Meta); University of Michigan (密歇根大学); The University of Sydney (悉尼大学); Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
zh

[NLP-9] 1: A Tool-Oriented Conversational Dataset for Multi-Turn Agent ic Planning

【速读】：该论文试图解决在涉及API或工具调用之间依赖关系的场景中，特别是在多轮对话中，有效规划仍然面临重大挑战的问题。解决方案的关键在于引入T1，这是一个增强工具、跨领域、多轮对话数据集，专门设计用于捕捉和管理跨不同领域的工具间依赖关系，并通过集成的缓存机制支持短期和长期记忆，同时具备动态重规划能力（如决定是否重新计算或复用缓存结果）。

链接: https://arxiv.org/abs/2505.16986
作者: Amartya Chakraborty,Paresh Dashore,Nadia Bathaee,Anmol Jain,Anirban Das,Shi-Xiong Zhang,Sambit Sahu,Milind Naphade,Genta Indra Winata
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents’ ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
zh

[NLP-10] UFT: Unifying Supervised and Reinforcement Fine-Tuning

【速读】：该论文试图解决传统后训练方法（包括监督微调（SFT）和强化微调（RFT））在提升大语言模型（LLM）推理能力时存在的局限性。SFT虽然高效但容易过拟合，限制了大模型的推理能力，而RFT虽能提供更好的泛化性能，但依赖于基础模型的质量。解决方案的关键在于提出统一微调（UFT），这是一种将SFT与RFT整合为单一过程的新型后训练范式，使模型能够在引入信息性监督信号的同时有效探索解决方案，从而弥合现有方法中记忆与思考之间的差距。

链接: https://arxiv.org/abs/2505.16984
作者: Mingyang Liu,Gabriele Farina,Asuman Ozdaglar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT’s inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.
zh

[NLP-11] LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding ACL2025

【速读】：该论文旨在解决将面向批量处理的大型语言模型（Large Language Models, LLMs）适应到流式处理场景中的问题，现有方法要么依赖于昂贵的重新编码，要么使用可扩展性受限的专用架构。论文识别出三个关键不匹配：输入-注意力、输出-注意力和位置ID不匹配。研究发现，只有输入-注意力不匹配显著影响性能，表明重新编码输出在很大程度上是不必要的。解决方案的关键在于提出一种基于批量架构的组位置编码范式，以增强流式与批量模式之间的一致性，且无需进行架构修改，展现出在两种模式下的强泛化能力。

链接: https://arxiv.org/abs/2505.16983
作者: Junlong Tong,Jinlan Fu,Zixuan Lin,Yingqi Fan,Anhao Zhao,Hui Su,Xiaoyu Shen
机构: Shanghai Jiao Tong University (上海交通大学); Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT (宁波市空间智能与数字衍生重点实验室，数字孪生研究所，EIT); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Meituan Inc. (美团公司)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.
zh

[NLP-12] SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

【速读】：该论文试图解决在现实世界中针对大型现有代码库开发新功能的特征驱动开发（Feature-Driven Development, FDD）任务在生成式AI领域的研究不足问题。解决方案的关键在于引入SWE-Dev，这是一个大规模数据集（包含14,000个训练样本和500个测试样本），专门用于评估和训练自主编码系统处理真实世界的FDD任务。SWE-Dev的独特之处在于为每个实例提供可运行的环境及其开发者编写的可执行单元测试，从而确保了训练数据的可验证性和多样性，并支持通过可执行单元测试提供的精确奖励信号进行强化学习。

链接: https://arxiv.org/abs/2505.16975
作者: Yaxin Du,Yuzhu Cai,Yifan Zhou,Cheng Wang,Yu Qian,Xianghe Pang,Qian Liu,Yue Hu,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学); Beijing University of Aeronautics and Astronautics (北京航空航天大学); Soochow University (苏州大学); Tiktok (Tiktok); University of Michigan (密歇根大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textithard split, underscoring the value of its high-quality training data. Code is available here \hrefthis https URLthis https URL.
zh

[NLP-13] VeriFastScore: Speeding up long-form factuality evaluation

【速读】：该论文试图解决长文本事实性评估中现有方法（如FactScore和VeriScore）因需要分解响应为原子声明并逐一验证而导致的计算开销大、效率低的问题，这些问题限制了其在大规模评估和训练场景中的实用性。解决方案的关键在于提出VeriFastScore，该方法通过使用合成数据对Llama3.1 8B进行微调，使其能够基于Google Search的证据同时提取和验证文本中的所有可验证声明，从而显著提升评估速度并保持与原始VeriScore管道的高度一致性。

链接: https://arxiv.org/abs/2505.16973
作者: Rishanth Rajendhran,Amir Zadeh,Matthew Sarte,Chuan Li,Mohit Iyyer
机构: University of Maryland (马里兰大学); Lambda Labs (Lambda 实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
zh

[NLP-14] From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

【速读】：该论文试图解决在资源有限的语言中扩展自动语音识别（Automatic Speech Recognition, ASR）覆盖范围的挑战，而传统方法依赖于大规模语音语料库。解决方案的关键在于提出一种可扩展的语音反向翻译（Speech Back-Translation）流程，通过使用现成的文本到语音（Text-to-Speech, TTS）模型将大规模文本语料库转换为合成语音，从而生成大量高质量的合成语音数据，以增强多语言ASR模型的性能。

链接: https://arxiv.org/abs/2505.16972
作者: Tianduo Wang,Lu Xu,Wei Lu,Shanbo Cheng
机构: Singapore University of Technology and Design (新加坡科技设计大学); ByteDance Seed (字节跳动种子)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
zh

[NLP-15] CASS: Nvidia to AMD Transpilation with Data Models and Benchmark

【速读】：该论文试图解决跨架构GPU代码移植（cross-architecture GPU code transpilation）中的低级代码可移植性问题，特别是针对源代码级别（CUDA~ \leftrightarrow ~HIP）和汇编级别（Nvidia SASS~ \leftrightarrow ~AMD RDNA3）的翻译。解决方案的关键在于构建了首个大规模数据集和模型套件\textttCASS，其中包含70k经过验证的代码对，并基于此训练了\textttCASS家族的领域专用语言模型，实现了高精度的代码翻译与性能保持。

链接: https://arxiv.org/abs/2505.16968
作者: Ahmed Heakl,Sarim Hashmi,Gustavo Bertolo Stahl,Seung Hun Eddie Han,Salman Khan,Abdulrahman Mahmoud
机构: MBZUAI; Australian National University
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 20 pages, 11 figures, 5 tables

点击查看摘要

Abstract:We introduce \textttCASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA~ \leftrightarrow ~HIP) and assembly-level (Nvidia SASS~ \leftrightarrow ~AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the \textttCASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce \textttCASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation. Dataset and benchmark are on \hrefthis https URL\textcolorblueHuggingFace, with code at \hrefthis https URL\textcolorblueGitHub.
zh

[NLP-16] Fixing Data That Hurts Performance: Cascading LLM s to Relabel Hard Negatives for Robust Information Retrieval

【速读】：该论文旨在解决训练检索和重排序模型时，由于训练数据质量不佳导致模型效果受限的问题，特别是针对“错误负样本”（false negatives）的影响进行了深入研究。其解决方案的关键在于提出一种简单且成本效益高的方法，通过级联大语言模型（LLM）提示来识别并重新标记困难的负样本，从而提升模型的检索性能。实验结果表明，重新标记错误负样本能够显著提高E5和Qwen2.5-7B等检索模型以及基于重新标记数据微调的重排序器在多个基准测试中的表现。

链接: https://arxiv.org/abs/2505.16967
作者: Nandan Thakur,Crystina Zhang,Xueguang Ma,Jimmy Lin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL datasets are available at this https URL

点击查看摘要

Abstract:Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness – pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35 \times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on “false negatives”, where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
zh

[NLP-17] BP-Seg: A graphical model approach to unsupervised and non-contiguous text segmentation using belief propagation

【速读】：该论文旨在解决基于语义的文本分割问题，这是一个在多种下游应用中具有广泛用途的基础性任务。其解决方案的关键在于提出一种基于图模型的无监督学习方法，称为BP-Seg，该方法通过在精心构建的图模型上进行信念传播（belief propagation），不仅考虑局部一致性，还有效将文本中距离较远但语义相似的句子进行分组。

链接: https://arxiv.org/abs/2505.16965
作者: Fengyi Li,Kayhan Behdin,Natesh Pillai,Xiaofeng Wang,Zhipeng Wang,Ercan Yildiz
机构: LinkedIn Corporation(领英公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text segmentation based on the semantic meaning of sentences is a fundamental task with broad utility in many downstream applications. In this paper, we propose a graphical model-based unsupervised learning approach, named BP-Seg for efficient text segmentation. Our method not only considers local coherence, capturing the intuition that adjacent sentences are often more related, but also effectively groups sentences that are distant in the text yet semantically similar. This is achieved through belief propagation on the carefully constructed graphical models. Experimental results on both an illustrative example and a dataset with long-form documents demonstrate that our method performs favorably compared to competing approaches.
zh

[NLP-18] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

【速读】：该论文试图解决现有医学视觉问答（Medical VQA）基准主要关注单图像分析，而临床实践中医生通常需要对比一系列图像进行诊断的问题。解决方案的关键在于引入MedFrameQA——首个专门评估医学VQA中多图像推理的基准，其通过自动化管道提取时间连贯的图像帧并构建逻辑演进的VQA条目，以及采用多阶段过滤策略确保数据的清晰度、难度和医学相关性，从而更贴近实际临床工作流程。

链接: https://arxiv.org/abs/2505.16964
作者: Suhao Yu,Haojin Wang,Juncheng Wu,Cihang Xie,Yuyin Zhou
机构: University of Pennsylvania (宾夕法尼亚大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); UC Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 9 pages, 4 Figures Benchmark data: this https URL

点击查看摘要

Abstract:Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA – the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs – both proprietary and open source, with and without explicit reasoning modules – on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
zh

[NLP-19] On Multilingual Encoder Language Model Compression for Low-Resource Languages

【速读】：该论文旨在解决多语言编码器模型在低资源语言中的压缩问题，以实现更高效的模型部署。其解决方案的关键在于系统性地结合两阶段知识蒸馏、结构化剪枝、截断和词汇裁剪等技术，将模型的层深度、前馈隐藏层大小和中间层嵌入尺寸压缩到极小规模，同时保留关键的语言特定知识。通过这种极端压缩方法，模型在四个下游任务中实现了高达92%的压缩率，仅出现2-10%的性能小幅下降。

链接: https://arxiv.org/abs/2505.16956
作者: Daniil Gurgurov,Michal Gregor,Josef van Genabith,Simon Ostermann
机构: Saarland University (萨尔兰大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心（DFKI）); Kempelen Institute of Intelligent Technologies (KInIT) (肯佩伦智能技术研究所（KInIT）); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心（CERTAIN）)
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
zh

[NLP-20] AGENT IF: Benchmarking Instruction Following of Large Language Models in Agent ic Scenarios

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在代理场景中对复杂指令的遵循能力不足的问题，特别是在处理长文本指令和多重约束条件时表现不佳。解决方案的关键在于构建AgentIF，这是首个系统评估LLM在代理场景中指令遵循能力的基准数据集，其特点包括真实场景、长文本指令和复杂的约束结构，通过收集并标注50个实际代理任务中的707条人类注释指令及其相关约束和评估指标，为评估和改进LLM的指令遵循能力提供了基础。

链接: https://arxiv.org/abs/2505.16944
作者: Yunjia Qi,Hao Peng,Xiaozhi Wang,Amy Xin,Youfeng Liu,Bin Xu,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); Zhipu AI (智普AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.
zh

[NLP-21] NovelSeek: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification

【速读】：该论文旨在解决传统科学研究中效率低、创新性不足以及难以快速响应复杂问题的挑战，提出了一种统一的闭环多智能体框架NovelSeek，以实现跨领域的自主科学研究（Autonomous Scientific Research, ASR）。解决方案的关键在于其三个核心优势：可扩展性，通过在12个科学任务中展示其通用性并生成创新性想法；交互性，通过人机专家反馈与多智能体协作实现领域知识的无缝整合；高效性，在多个科学领域中显著降低时间成本并提升性能，例如在反应产率预测、增强子活性预测和二维语义分割任务中均取得了显著进步。

链接: https://arxiv.org/abs/2505.16938
作者: NovelSeek Team:Bo Zhang,Shiyang Feng,Xiangchao Yan,Jiakang Yuan,Zhiyin Yu,Xiaohan He,Songtao Huang,Shaowei Hou,Zheng Nie,Zhilong Wang,Jinyao Liu,Runmin Ma,Tianshuo Peng,Peng Ye,Dongzhan Zhou,Shufei Zhang,Xiaosong Wang,Yilan Zhang,Meng Li,Zhongying Tu,Xiangyu Yue,Wangli Ouyang,Bowen Zhou,Lei Bai
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: HomePage: this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
zh

[NLP-22] In-Context Watermarks for Large Language Models

【速读】：该论文试图解决在敏感应用场景中，如何有效实现生成式 AI (Generative AI) 生成文本的溯源与责任认定问题，尤其是在无法访问解码过程的情况下，传统水印技术难以适用。解决方案的关键在于提出一种称为“上下文水印”（In-Context Watermarking, ICW）的方法，该方法通过提示工程（prompt engineering）将水印嵌入生成文本中，利用大语言模型（LLM）的上下文学习和指令遵循能力，实现了无需访问模型内部解码过程的水印嵌入与检测。

链接: https://arxiv.org/abs/2505.16934
作者: Yepeng Liu,Xuandong Zhao,Christopher Kruegel,Dawn Song,Yuheng Bu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs’ in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.
zh

[NLP-23] LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）在多模态任务中的性能与数据可扩展性问题，特别是针对当前主流的自回归范式存在的局限性。其解决方案的关键在于提出一种完全基于扩散模型的多模态架构——LLaDA-V，该模型通过将视觉指令微调与掩码扩散模型相结合，实现了视觉特征到语言嵌入空间的有效对齐，从而在不依赖强大纯文本语言模型的情况下，仍能展现出优异的多模态性能。

链接: https://arxiv.org/abs/2505.16933
作者: Zebin You,Shen Nie,Xiaolu Zhang,Jun Hu,Jun Zhou,Zhiwu Lu,Ji-Rong Wen,Chongxuan Li
机构: Renmin University of China (中国人民大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: this https URL.
zh

[NLP-24] he Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

【速读】：该论文旨在解决在深度学习中计算极分解（polar decomposition）和相关矩阵符号函数的问题，传统数值分析中的方法如Newton-Schulz和基于有理函数的方法由于收敛速度慢或依赖QR分解和矩阵求逆，在深度学习场景下并不适用。论文提出的解决方案是Polar Express，其关键在于采用仅包含矩阵-矩阵乘法的多项式更新规则，并通过求解最小最大优化问题来优化迭代过程，从而实现快速初始收敛和渐近收敛，同时保证在半精度浮点数（bfloat16）下的稳定性。

链接: https://arxiv.org/abs/2505.16932
作者: Noah Amsel,David Persson,Christopher Musco,Robert Gower
机构: New York University (纽约大学); Flatiron Institute (扁平化研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Computing the polar decomposition and the related matrix sign function, has been a well-studied problem in numerical analysis for decades. More recently, it has emerged as an important subroutine in deep learning, particularly within the Muon optimization framework. However, the requirements in this setting differ significantly from those of traditional numerical analysis. In deep learning, methods must be highly efficient and GPU-compatible, but high accuracy is often unnecessary. As a result, classical algorithms like Newton-Schulz (which suffers from slow initial convergence) and methods based on rational functions (which rely on QR decompositions or matrix inverses) are poorly suited to this context. In this work, we introduce Polar Express, a GPU-friendly algorithm for computing the polar decomposition. Like classical polynomial methods such as Newton-Schulz, our approach uses only matrix-matrix multiplications, making it GPU-compatible. Motivated by earlier work of Chen Chow and Nakatsukasa Freund, Polar Express adapts the polynomial update rule at each iteration by solving a minimax optimization problem, and we prove that it enjoys a strong worst-case optimality guarantee. This property ensures both rapid early convergence and fast asymptotic convergence. We also address finite-precision issues, making it stable in bfloat16 in practice. We apply Polar Express within the Muon optimization framework and show consistent improvements in validation loss on large-scale models such as GPT-2, outperforming recent alternatives across a range of learning rates.
zh

[NLP-25] PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues EMNLP2025

【速读】：该论文试图解决个人可识别信息（Personally Identifiable Information, PII）匿名化这一高风险任务在开放科学数据共享中的障碍问题。尽管PII识别技术近年来取得了显著进展，但实际应用中错误阈值和召回率/精确率之间的权衡仍然限制了这些匿名化流程的采用。论文提出的解决方案是PIIvot，其关键在于利用数据上下文知识来简化PII检测问题，从而提高匿名化效率和效果。

链接: https://arxiv.org/abs/2505.16931
作者: Matthew Zent,Digory Smith,Simon Woodhead
机构: Eedi(教育科技公司)
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures, submitted to EMNLP 2025, for associated dataset, see this https URL

点击查看摘要

Abstract:Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
zh

[NLP-26] Latent Principle Discovery for Language Model Self-Improvement

【速读】：该论文试图解决如何在不进行大量人工标注的情况下，自动提取并优化语言模型（Language Model, LM）生成过程中应遵循的隐含行为属性问题。其核心挑战在于如何从模型自身中挖掘出这些隐含原则，并将其压缩为可解释的集合以指导模型生成更符合人类偏好的响应。解决方案的关键在于通过自校正设置显式建模这些隐含属性，并利用后验正则化蒙特卡洛期望最大化（posterior-regularized Monte Carlo Expectation-Maximization）近似方法，识别出最有效的潜在原则，同时训练模型在生成过程中策略性地调用这些原则，从而实现模型的内在优化。

链接: https://arxiv.org/abs/2505.16927
作者: Keshav Ramji,Tahira Naseem,Ramón Fernandez Astudillo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.
zh

[NLP-27] UNCLE: Uncertainty Expressions in Long-Form Generation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在长文本生成中容易产生幻觉（hallucination）的问题，特别是如何有效表达模型在缺乏足够知识时的不确定性。解决方案的关键在于引入UNCLE基准，这是一个用于评估短文本和长文本问答（QA）中不确定性表达能力的基准数据集，包含4000个长文本QA实例和20000多个短文本QA对，首次实现了短文本与长文本QA的直接对比。同时，该研究提出了一套新的评估指标，以衡量模型在特定情况下选择性表达不确定性的能力，并通过实验验证了当前模型在长文本生成中未能恰当表达不确定性的现状，进一步探索了基于提示和训练的方法来提升模型性能，其中训练方法效果更显著。

链接: https://arxiv.org/abs/2505.16922
作者: Ruihan Yang,Caiqi Zhang,Zhisong Zhang,Xinting Huang,Dong Yu,Nigel Collier,Deqing Yang
机构: Fudan University (复旦大学); University of Cambridge (剑桥大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs’ ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models’ capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models’ performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
zh

[NLP-28] Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality

【速读】：该论文旨在解决文本生成任务微调过程中，标准交叉熵损失函数对所有标记一视同仁所导致的模型过度关注高频低信息量标记，而忽视低频但对内容特异性和信息量至关重要的标记的问题。解决方案的关键在于提出一种新的损失函数——幂律衰减损失（Power-Law Decay Loss, PDL），该损失函数根据训练语料中标记的频率对标准交叉熵损失中每个标记的贡献进行重新加权，遵循幂律衰减规律，降低高频标记的权重，提高低频信息密集型标记的权重，从而引导模型在微调过程中更专注于学习和生成具有特定性和信息量的标记，提升生成文本的质量、多样性和信息量。

链接: https://arxiv.org/abs/2505.16900
作者: Jintian Shao,Hongyi Huang,Jiayi Wu,Beiwen Zhang,ZhiYu Wu,You Shan,MingKai Zheng
机构: Southern University of Science and Technology (南方科技大学); Fudan University (复旦大学); Sun Yat-sen University (中山大学); SenseTime Research (商汤研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
zh

[NLP-29] Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中幻觉（Hallucinations）问题，即模型生成看似合理但错误的输出，这对模型的可靠部署构成重大障碍。其解决方案的关键在于系统性地研究幻觉发生频率与通过逐步上下文注入引发的内部状态漂移（internal-state drift）之间的关系。通过构建两种不同类型的上下文注入实验，并利用多角度检测方法追踪显性和隐性漂移特征，揭示了幻觉增长规律及内在机制，为后续的幻觉预测与上下文感知的缓解策略提供了实证基础。

链接: https://arxiv.org/abs/2505.16894
作者: Zeyu Wei,Shuo Wang,Xiaohui Rong,Xuemin Liu,He Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucinations – plausible yet erroneous outputs – remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round “titration” tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5–7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence “self-consistent” hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ( \sim0.69 ) and Spearman-Drift ( \sim0 ) marks an “attention-locking” threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
zh

[NLP-30] CAIN: Hijacking LLM -Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对对抗性攻击时的安全性问题，特别是通过操控系统提示（system prompts）来劫持AI与人类的对话，使模型仅对特定目标问题生成恶意回答，而在其他情况下表现正常。解决方案的关键在于提出CAIN算法，该算法能够在黑盒环境下或无需访问LLM参数的情况下，自动构建针对特定目标问题的有害系统提示，从而实现对LLM的高效对抗攻击。

链接: https://arxiv.org/abs/2505.16888
作者: Viet Pham,Thai Le
机构: Indiana University (印第安纳大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs’ system prompts to produce malicious answers only to specific targeted questions (e.g., “Who should I vote for US President?”, “Are Covid vaccines safe?”), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM’s parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.
zh

[NLP-31] Dont “Overthink” Passage Reranking: Is Reasoning Truly Necessary?

【速读】：该论文试图解决在基于大型语言模型（Large Language Models, LLMs）的段落重排序器中，推理过程是否能够提升重排序准确性的问题。其解决方案的关键在于通过对比基于推理的点对点重排序器（ReasonRR）与非推理的点对点重排序器（StandardRR），发现StandardRR在相同训练条件下表现更优，并进一步通过禁用ReasonRR的推理过程（ReasonRR-NoReason）验证推理对性能的影响，结果表明禁用推理后的模型反而更有效，揭示了推理过程可能导致极端相关性评分，从而忽略了部分相关性这一影响点对点重排序准确性的关键因素。

链接: https://arxiv.org/abs/2505.16886
作者: Nour Jedidi,Yung-Sung Chuang,James Glass,Jimmy Lin
机构: MIT Lincoln Laboratory (麻省理工学院林肯实验室); Massachusetts Institute of Technology (麻省理工学院); University of Waterloo (滑铁卢大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers built on Large Language Models (LLMs). These methods typically employ an LLM to produce an explicit, step-by-step reasoning process before arriving at a final relevance prediction. But, does reasoning actually improve reranking accuracy? In this paper, we dive deeper into this question, studying the impact of the reasoning process by comparing reasoning-based pointwise rerankers (ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under identical training conditions, and observe that StandardRR generally outperforms ReasonRR. Building on this observation, we then study the importance of reasoning to ReasonRR by disabling its reasoning process (ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more effective than ReasonRR. Examining the cause of this result, our findings reveal that reasoning-based rerankers are limited by the LLM’s reasoning process, which pushes it toward polarized relevance scores and thus fails to consider the partial relevance of passages, a key factor for the accuracy of pointwise rerankers.
zh

[NLP-32] CASTILLO: Characterizing Response Length Distributions of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）推理过程中计算资源管理的挑战，特别是由于自回归文本生成的固有随机性和长度变化导致的资源分配难题。现有方法要么偏向特定长度的文本生成，要么依赖忽略模型和提示特异性变化的假设。论文提出的解决方案关键在于构建CASTILLO数据集，该数据集通过在不同指令遵循语料库上对13个广泛使用的开源LLM进行评估，记录每个提示-模型样本对的10个独立生成结果的token长度及其统计信息，从而揭示模型间的和模型内的响应长度变异性，并为预测模型的开发和系统性分析提供支持。

链接: https://arxiv.org/abs/2505.16881
作者: Daniel F. Perez-Ramirez,Dejan Kostic,Magnus Boman
机构: KTH Royal Institute of Technology (皇家理工学院); RISE Computer Science (RISE计算机科学); Karolinska Institutet (卡罗林斯卡医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset available in this https URL and code is available in this https URL

点击查看摘要

Abstract:Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each \langle prompt, model \rangle sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.
zh

[NLP-33] MPO: Multilingual Safety Alignment via Reward Gap Optimization ACL2025

【速读】：该论文旨在解决多语言安全对齐（multilingual safety alignment）问题，即在多种语言环境下确保大型语言模型（LLMs）的安全性，而现有方法如RLHF和DPO主要针对单语场景，难以处理噪声多语言数据。其解决方案的关键在于提出一种名为多语言奖励差距优化（MPO）的新方法，该方法利用英语这一主导语言已有的安全对齐能力，通过直接最小化主导语言与目标语言之间的奖励差距，实现安全能力的有效迁移，同时保持主导语言的原有优势。

链接: https://arxiv.org/abs/2505.16869
作者: Weixiang Zhao,Yulin Hu,Yang Deng,Tongtong Wu,Wenxuan Zhang,Jiahe Guo,An Zhang,Yanyan Zhao,Bing Qin,Tat-Seng Chua,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); Monash University (莫纳什大学); Singapore University of Technology and Design (新加坡科技设计大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: To Appear at ACL 2025 (Main)

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO’s efficacy in multilingual safety alignment without degrading general multilingual utility.
zh

[NLP-34] Comparative analysis of subword tokenization approaches for Indian languages

【速读】：该论文旨在解决印度语言（Indian Languages, ILs）在机器翻译（Machine Translation, MT）中的复杂形态结构带来的挑战，特别是由于这些语言具有丰富的词形变化和复杂的构词法，传统的分词方法难以有效处理。解决方案的关键在于采用不同的子词分词技术（subword tokenization techniques），如SentencePiece、Byte Pair Encoding (BPE)和WordPiece Tokenization，以优化文本的分词过程，从而提升翻译效果。研究通过多种统计、神经网络及多语言神经机器翻译模型评估了这些分词方法的有效性，并验证了其在不同语言对中的表现差异。

链接: https://arxiv.org/abs/2505.16868
作者: Sudhansu Bala Das,Samujjal Choudhury,Tapas Kumar Mishra,Bidyut Kr. Patra
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 4 tables

点击查看摘要

Abstract:Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization, affect ILs. The effectiveness of these subword tokenization techniques is investigated in statistical, neural, and multilingual neural machine translation models. All models are examined using standard evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF, RIBES, and COMET. Based on the results, it appears that for the majority of language pairs for the Statistical and Neural MT models, the SentencePiece tokenizer continuously performed better than other tokenizers in terms of BLEU score. However, BPE tokenization outperformed other tokenization techniques in the context of Multilingual Neural Machine Translation model. The results show that, despite using the same tokenizer and dataset for each model, translations from ILs to English surpassed translations from English to ILs.
zh

[NLP-35] Nested Named Entity Recognition as Single-Pass Sequence Labeling EMNLP2025

【速读】：该论文试图解决嵌套命名实体识别（Nested Named Entity Recognition, NNER）的问题，该任务旨在识别文本中存在层级结构的实体。其解决方案的关键在于将NNER建模为序列标注任务，并通过利用先前工作中的成分结构线性化方法，将这一结构预测问题转化为简单的词元分类问题，从而有效降低复杂度。通过结合这些成分结构线性化方法与预训练编码器，该方法在执行精确的n次标记操作的同时捕获嵌套实体，实现了与效率较低系统相当的性能。

链接: https://arxiv.org/abs/2505.16855
作者: Alberto Muñoz-Ortiz,David Vilares,Caio COrro,Carlos Gómez-Rodríguez
机构: Universidade da Coruña, CITIC, Spain (埃斯特雷马杜拉大学, CITIC, 西班牙); INSA Rennes, IRISA, Inria, CNRS, Université de Rennes, France (INSA Rennes, IRISA, Inria, CNRS, 雷恩大学, 法国)
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2025

点击查看摘要

Abstract:We cast nested named entity recognition (NNER) as a sequence labeling task by leveraging prior work that linearizes constituency structures, effectively reducing the complexity of this structured prediction problem to straightforward token classification. By combining these constituency linearizations with pretrained encoders, our method captures nested entities while performing exactly n tagging actions. Our approach achieves competitive performance compared to less efficient systems, and it can be trained using any off-the-shelf sequence labeling library.
zh

[NLP-36] ATR-Bench: A Federated Learning Benchmark for Adaptation Trust and Reasoning

【速读】：该论文试图解决联邦学习（Federated Learning, FL）在实际应用中缺乏标准化评估框架的问题，这限制了FL方法的系统性进展和公平比较。其解决方案的关键在于提出ATR-Bench，一个基于适应性（Adaptation）、信任度（Trust）和推理能力（Reasoning）三个核心维度的统一分析框架，旨在为联邦学习提供系统且全面的评估基础。

链接: https://arxiv.org/abs/2505.16850
作者: Tajamul Ashraf,Mohammed Mohsen Peerzada,Moloud Abdar,Yutong Xie,Yuyin Zhou,Xiaofeng Liu,Iqra Altaf Gillani,Janibul Bashir
机构: MBZUAI(穆罕默德本扎耶德大学人工智能研究院); University of Queensland(昆士兰大学); University of California, Santa Cruz(加州大学圣克鲁兹分校); Yale University(耶鲁大学); Gaash Lab, Department of IT, NIT Srinagar(Gaash实验室，信息技术系，斯里纳加尔国立理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Federated Learning Benchmark for Domain Adaptation, Trustworthiness, and Reasoning

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising paradigm for collaborative model training while preserving data privacy across decentralized participants. As FL adoption grows, numerous techniques have been proposed to tackle its practical challenges. However, the lack of standardized evaluation across key dimensions hampers systematic progress and fair comparison of FL methods. In this work, we introduce ATR-Bench, a unified framework for analyzing federated learning through three foundational dimensions: Adaptation, Trust, and Reasoning. We provide an in-depth examination of the conceptual foundations, task formulations, and open research challenges associated with each theme. We have extensively benchmarked representative methods and datasets for adaptation to heterogeneous clients and trustworthiness in adversarial or unreliable environments. Due to the lack of reliable metrics and models for reasoning in FL, we only provide literature-driven insights for this dimension. ATR-Bench lays the groundwork for a systematic and holistic evaluation of federated learning with real-world relevance. We will make our complete codebase publicly accessible and a curated repository that continuously tracks new developments and research in the FL literature.
zh

[NLP-37] Understanding and Analyzing Inappropriately Targeting Language in Online Discourse: A Comparative Annotation Study

【速读】：该论文试图解决在线对话中不当针对性语言（inappropriately targeting language）的检测问题，特别是针对个体或群体的仇恨言论和更隐蔽的歧视性语言。解决方案的关键在于整合众包标注与专家标注，并结合生成式 AI（Generative AI）模型 ChatGPT 的能力，构建一个全面的注释框架，以识别不同目标类别及语境中的特定目标词汇。通过对比分析人类专家、众包标注者与 ChatGPT 的注释结果，该方法揭示了上下文因素在识别仇恨言论中的重要性，并发现了新的目标类型，如社会信念和身体形象。

链接: https://arxiv.org/abs/2505.16847
作者: Baran Barbarestani,Isa Maks,Piek Vossen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces a method for detecting inappropriately targeting language in online conversations by integrating crowd and expert annotations with ChatGPT. We focus on English conversation threads from Reddit, examining comments that target individuals or groups. Our approach involves a comprehensive annotation framework that labels a diverse data set for various target categories and specific target words within the conversational context. We perform a comparative analysis of annotations from human experts, crowd annotators, and ChatGPT, revealing strengths and limitations of each method in recognizing both explicit hate speech and subtler discriminatory language. Our findings highlight the significant role of contextual factors in identifying hate speech and uncover new categories of targeting, such as social belief and body image. We also address the challenges and subjective judgments involved in annotation and the limitations of ChatGPT in grasping nuanced language. This study provides insights for improving automated content moderation strategies to enhance online safety and inclusivity.
zh

[NLP-38] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

【速读】：该论文试图解决长链式思维（Long-CoT）在扩展到大型语言模型（LLMs）时产生的计算开销过大的问题，以及现有压缩方法在保留局部推理信号和保持输出连贯性方面的不足。解决方案的关键在于提出R1-Compress，这是一种两阶段的块级压缩框架，通过将Long-CoT分割为可管理的块，应用基于大语言模型的块内压缩，并利用块间搜索机制选择短且连贯的序列，从而在保持局部信息和连贯性的同时显著减少令牌使用量。

链接: https://arxiv.org/abs/2505.16838
作者: Yibo Wang,Li Shen,Huanjin Yao,Tiansheng Huang,Rui Liu,Naiqiang Tan,Jiaxing Huang,Kai Zhang,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches – instance-level and token-level – either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at this https URL
zh

[NLP-39] SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

【速读】：该论文试图解决检索增强生成（Retrieval-augmented generation, RAG）系统在复杂深度搜索场景中面临的训练轨迹质量不足、模拟环境与真实场景的分布不匹配以及实际部署计算成本过高的问题。其解决方案的关键在于通过策略性数据工程而非复杂的训练范式来构建轻量且高效的框架，具体表现为在真实网络搜索环境中模拟用户交互以生成高质量的训练数据，并结合多标准的数据筛选策略优化输入和输出的多样性和质量。

链接: https://arxiv.org/abs/2505.16834
作者: Shuang Sun,Huatong Song,Yuhao Wang,Ruiyang Ren,Jinhao Jiang,Junjie Zhang,Fei Bai,Jia Deng,Wayne Xin Zhao,Zheng Liu,Lei Fang,Zhongyuan Wang,Ji-Rong Wen
机构: Northeastern University (东北大学); Renmin University of China (中国人民大学); Xiamen University (厦门大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); DataCanvas Alaya NeW (DataCanvas Alaya NeW)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at this https URL.
zh

[NLP-40] From EduVisBench to EduVisAgent : A Benchmark and Multi-Agent Framework for Pedagogical Visualization

【速读】：该论文试图解决生成式 AI (Generative AI) 在教育场景中生成具有教学有效性的视觉解释能力有限的问题，尤其是现有方法主要关注文本推理而忽视了结构化和可解释性可视化在支持概念理解中的关键作用。解决方案的关键在于提出 EduVisAgent，这是一个多智能体协作框架，通过协调专门的智能体进行教学规划、推理分解、元认知提示和可视化设计，以提升模型在教育场景下的视觉推理能力。

链接: https://arxiv.org/abs/2505.16832
作者: Haonian Ji,Shi Qiu,Siyang Xin,Siwei Han,Zhaorun Chen,Hongyi Wang,Dake Zhang,Huaxiu Yao
机构: UNC-Chapel Hill(北卡罗来纳大学教堂山分校); University of Chicago(芝加哥大学); Rutgers University(罗格斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages; 7 figures

点击查看摘要

Abstract:While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at this https URL and this https URL.
zh

[NLP-41] Unlearning Isnt Deletion: Investigating Reversibility of Machine Unlearning in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中“遗忘”（unlearning）评估的不足问题，即当前依赖于token-level指标（如准确率和困惑度）的评估方法可能产生误导，因为模型看似已遗忘特定数据，但实际上其原始行为可通过少量微调迅速恢复，表明遗忘可能只是信息的掩盖而非真正删除。解决方案的关键在于引入一种基于表示层（representation-level）的评估框架，包括PCA相似性与偏移、中心核对齐以及Fisher信息等方法，用以诊断遗忘的可逆性与不可逆性，并揭示模型在不同任务类型和超参数下的遗忘机制。

链接: https://arxiv.org/abs/2505.16831
作者: Xiaoyu Xu,Xiang Yue,Yang Liu,Qingqing Ye,Haibo Hu,Minxin Du
机构: The Hong Kong Polytechnic University (香港理工大学); Carnegie Mellon University (卡内基梅隆大学); University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 44 pages

点击查看摘要

Abstract:Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: this https URL.
zh

[NLP-42] KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

【速读】：该论文试图解决强化学习算法在计算优势时存在的粒度粗糙问题，即现有方法如GRPO及其变体DAPO在计算rollout级优势时，对序列中的每个token赋予相同的值，无法捕捉到token级别的贡献，从而影响学习效果。解决方案的关键在于提出一种新的算法——关键token优势估计（Key-token Advantage Estimation, KTAE），该算法通过利用采样rollout的正确性并进行统计分析，量化序列中单个token对最终结果的重要性，并将其与rollout级优势结合，从而获得更细粒度的token级优势估计。

链接: https://arxiv.org/abs/2505.16826
作者: Wei Sun,Wen Yang,Pu Jian,Qianlong Du,Fuwei Cui,Shuo Ren,Jiajun Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Wuhan AI Research (武汉人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.
zh

[NLP-43] Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

【速读】：该论文旨在解决低资源语言的命名实体识别（Named Entity Recognition, NER）问题，即在标注训练数据有限的情况下构建鲁棒的NER系统。其解决方案的关键在于利用合成数据（synthetic data）来增加低资源语言的标注数据量，从而提升模型性能。研究通过在11种来自不同语系的语言中评估合成数据的作用，验证了其在多语言低资源NER场景中的潜力。

链接: https://arxiv.org/abs/2505.16814
作者: Gaurav Kamath,Sowmya Vajjala
机构: McGill University (麦吉尔大学); National Research Council, Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注: pre-print

点击查看摘要

Abstract:Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.
zh

[NLP-44] wo-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在知识密集型多步骤推理（Knowledge-Intensive Multi-Step Reasoning, KIMSR）任务中面临的两个关键问题：一是如何有效提取和表示推理依据证据，二是如何在证据存在不确定性的情况下利用推理依据和LLM内在知识进行准确推理。解决方案的关键在于提出一种两阶段的证据自对齐（Two-Way Evidence Self-Alignment, TW-ESA）模块和一个双门控推理增强（Dual-Gated Reasoning Enhancement, DGR）模块，并在统一框架ESA-DGR中协同训练，以提升模型对因果逻辑的理解和推理准确性。

链接: https://arxiv.org/abs/2505.16806
作者: Kexin Zhang,Junlan Chen,Daifeng Li,Yuxuan Zhang,Yangyang Feng,Bowen Deng,Weixu Chen
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM’s intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at this https URL.
zh

[NLP-45] Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

【速读】：该论文旨在解决低资源语言中形态分割（morpheme segmentation）任务的性能瓶颈问题，特别是在训练数据稀缺的情况下。其解决方案的关键在于结合多任务学习与大语言模型（LLM）生成的合成数据，通过共享的语言表征增强模型的泛化能力，并利用上下文学习生成高质量的合成训练数据以弥补数据不足的问题。

链接: https://arxiv.org/abs/2505.16800
作者: Changbing Yang,Garrett Nicolai
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.
zh

[NLP-46] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

【速读】：该论文试图解决大规模语言模型在微调过程中可能引入的意外对齐问题（Accidental Misalignment），即由于微调数据特征导致的未预见的安全漏洞。解决方案的关键在于识别微调数据中的潜在相关因素，如语言特征、语义相似性和毒性，并评估这些因素与对抗攻击成功率之间的关联，从而为对抗防御策略提供新的见解，并强调数据集设计在保持模型对齐中的关键作用。

链接: https://arxiv.org/abs/2505.16789
作者: Punya Syon Pandey,Samuel Simko,Kellin Pelrine,Zhijing Jin
机构: University of Toronto(多伦多大学); ETH Zurich(苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所); McGill University(麦吉尔大学); FAR AI(FAR AI); Vector Institute(向量研究所); MILA(MILA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment. Our code is available at this https URL.
zh

[NLP-47] Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

【速读】：该论文试图解决传统Chain-of-Thought (CoT)推理方法在复杂推理任务中因依赖显式自然语言推理步骤而带来的效率低下和抽象推理适用性受限的问题。其解决方案的关键在于探索隐式CoT推理（latent CoT reasoning），通过将推理过程置于潜在空间中，实现推理与语言的解耦，从而获得更丰富的认知表示和更灵活、高效的推理能力。

链接: https://arxiv.org/abs/2505.16782
作者: Xinghao Chen,Anhao Zhao,Heming Xia,Xuan Lu,Hanlin Wang,Yanjun Chen,Wei Zhang,Jian Wang,Wenjie Li,Xiaoyu Shen
机构: The Hong Kong Polytechnic University (香港理工大学); Eastern Institute of Technology (东方理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at this https URL.
zh

[NLP-48] IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

【速读】：该论文试图解决音频基础大型语言模型（Audio-based Large Language Models, Audio LLMs）在遵循指令方面能力不足的问题，尤其是在多模态对齐后，其指令遵循能力相较于纯文本模型有所下降。解决方案的关键在于引入IFEval-Audio，这是一个专门设计用于评估音频LLM遵循指令能力的新颖评估数据集，包含280个跨六个维度（内容、大小写、符号、列表结构、长度和格式）的音频-指令-答案三元组，旨在推动该领域未来的研究。

链接: https://arxiv.org/abs/2505.16774
作者: Yiming Gao,Bin Wang,Chengwei Wei,Shuo Sun,AiTi Aw
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore; MiroMind; Nanyang Technological University (NTU), Singapore
类目: Computation and Language (cs.CL)
备注: Link: this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.
zh

[NLP-49] RIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在部署过程中因模型规模庞大而面临的计算和内存挑战，特别是现有单次剪枝方法在层内或层间应用统一稀疏性约束导致的性能不佳问题。其解决方案的关键在于提出TRIM（Targeted Row-wise Iterative Metric-driven pruning），通过为每层中的每个输出维度（行）分配不同的稀疏性比例，并利用质量度量引导的迭代调整过程优化维度级稀疏性分配，从而降低输出间质量保留的方差并保留关键信息。

链接: https://arxiv.org/abs/2505.16743
作者: Florentin Beck,William Rudman,Carsten Eickhoff
机构: University of Tübingen (图宾根大学); Brown University (布朗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: this https URL
zh

[NLP-50] Mitigating Fine-tuning Risks in LLM s via Safety-Aware Probing Optimization

【速读】：该论文试图解决在对大语言模型（Large Language Models, LLMs）进行微调时，即使使用非有害数据仍会导致安全性能退化的问题。解决方案的关键在于引入一种安全感知探测（Safety-Aware Probing, SAP）优化框架，该框架通过在梯度传播过程中集成安全感知探测器，识别梯度方向中的潜在风险点，从而减轻模型的安全性退化风险，同时提升任务特定性能并保持模型的安全性。

链接: https://arxiv.org/abs/2505.16737
作者: Chengcan Wu,Zhixin Zhang,Zeming Wei,Yihao Zhang,Meng Sun
机构: Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model’s risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at this https URL.
zh

[NLP-51] Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在跨语言场景下消除毒性内容的挑战，特别是在不同语言和脚本体系之间实现毒性缓解能力的迁移。其解决方案的关键在于提出“跨语言去毒”（Cross-lingual Detoxification）范式，通过在多种语言环境中验证该方法的有效性，评估在数据有限的情况下毒性降低的效果，并探讨去毒操作对非毒性任务性能的影响，从而揭示安全性和知识保留之间的权衡。

链接: https://arxiv.org/abs/2505.16722
作者: Himanshu Beniwal,Youngwoo Kim,Maarten Sap,Soham Dan,Thomas Hartvigsen
机构: Indian Institute of Technology Gandhinagar(印度理工学院甘地纳加尔分校); University of Virginia(弗吉尼亚大学); Carnegie Mellon University(卡内基梅隆大学); Allen Institute for Artificial Intelligence(人工智能技术研究所); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore “Cross-lingual Detoxification”, a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification’s effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at this https URL.
zh

[NLP-52] Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLM s

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在多模态指令微调过程中导致基础语言模型语言能力出现灾难性遗忘的问题。解决方案的关键在于提出了一种无需训练的参数融合框架——Locate-then-Merge，其核心是通过定位重要参数并选择性地合并，以保留模型的视觉适应能力同时减轻语言能力的退化。进一步引入的神经元级策略Neuron-Fusion，通过保留参数变化较大的神经元的影响（这些神经元可能负责新获得的视觉能力），而减弱参数变化较小的神经元的影响（这些神经元可能编码通用语言技能），从而实现更优的性能表现。

链接: https://arxiv.org/abs/2505.16703
作者: Zeping Yu,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM’s language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts–neurons likely responsible for newly acquired visual capabilities–while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.
zh

[NLP-53] Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence ICML2025

【速读】：该论文试图解决大型语言模型中基于上下文学习（In-Context Learning, ICL）的元学习能力如何在训练过程中获得的问题，特别是当答案未直接包含在上下文中时，模型如何从示例中推断任务并进行推理。解决方案的关键在于通过扩展先前研究中的复制任务为基于上下文的元学习设置，分析模型在训练过程中电路动态的变化，发现该能力的获取涉及多个阶段，每个阶段都出现了独特的电路结构，这与诱导头的单阶段变化形成对比。

链接: https://arxiv.org/abs/2505.16694
作者: Gouki Minegishi,Hiroki Furuta,Shohei Taniguchi,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model’s circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer’s ICL ability.
zh

[NLP-54] SPaRC: A Spatial Pathfinding Reasoning Challenge

【速读】：该论文试图解决现有推理数据集在抽象、多步骤问题上的不足，特别是路径规划和复杂规则约束满足方面的测试能力有限的问题。其解决方案的关键是引入SPaRC（Spatial Pathfinding Reasoning Challenge）数据集，该数据集包含1,000个二维网格路径规划谜题，用于评估空间与符号推理能力，要求基于算术和几何规则进行逐步规划。实验表明，人类在该任务上表现优异，而当前最佳推理模型如o4-mini表现较差，表明模型在路径生成、导航和空间逻辑方面存在显著缺陷。研究还发现，允许模型多次尝试求解可提升准确性，提示通过改进训练和高效测试时扩展方法可能提升空间推理能力。

链接: https://arxiv.org/abs/2505.16686
作者: Lars Benedikt Kaesberg,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and symbolic reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models’ spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.
zh

[NLP-55] R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

【速读】：该论文旨在通过强化学习（Reinforcement Learning, RL）提升多模态大语言模型（Multimodal Large Language Models, MLLMs）的推理能力，并解决在RL过程中出现的稀疏奖励和优势消失问题。其解决方案的关键在于提出了一种名为Share-GRPO的新颖RL方法，该方法通过在扩展的问题空间中探索并共享多样化的推理轨迹来缓解上述问题，同时在优势计算过程中共享奖励信息，从而实现跨问题变体和内部问题变体的优势层次化估计，提高策略训练的稳定性与效果。

链接: https://arxiv.org/abs/2505.16673
作者: Huanjin Yao,Qixiang Yin,Jingyi Zhang,Min Yang,Yibo Wang,Wenhao Wu,Fei Su,Li Shen,Minghui Qiu,Dacheng Tao,Jiaxing Huang
机构: Nanyang Technological University; ByteDance; Tsinghua University; Beijing University of Posts and Telecommunications; The University of Sydney
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical report

点击查看摘要

Abstract:In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at this https URL.
zh

[NLP-56] A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

【速读】：该论文旨在解决制药领域中语言模型的专用化与评估问题，通过构建一个针对日本制药领域的语言模型来提升该领域自然语言处理（NLP）任务的性能。其解决方案的关键在于通过持续预训练（continual pretraining）在20亿个日语制药语料和80亿个英语生物医学语料上进行模型训练，从而增强模型在术语密集型和知识驱动型任务中的表现。此外，研究者还提出了三个新的基准测试（YakugakuQA、NayoseQA和SogoCheck），以全面评估模型在事实回忆、词汇变化和逻辑一致性等方面的能力。

链接: https://arxiv.org/abs/2505.16661
作者: Issey Sukeda,Takuro Fujii,Kosei Buma,Shunsuke Sasaki,Shinnosuke Ono
机构: EQUES Inc.(EQUES公司); The University of Tokyo(东京大学); University of Tsukuba(筑波大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 9 tables, 5 figures

点击查看摘要

Abstract:We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.
zh

[NLP-57] Can reasoning models comprehend mathematical problems in Chinese ancient texts? An empirical study based on data from Suanjing Shishu

【速读】：该论文试图解决中国古算书在智能处理中的挑战，具体是通过构建Guji_MATH基准来评估基于《算经十书》的古典文本。其解决方案的关键在于利用机器辅助标注与人工验证相结合的方法，从8部经典文本中提取538道数学问题，形成以“问题-答案-解法”为核心的结构化数据集，并设计了闭卷（自主解题）和开卷（复现古典解法）两种评估模式，以系统评估主流推理模型在古典汉语语境下的数学问题求解能力。

链接: https://arxiv.org/abs/2505.16660
作者: Liu Chang,Wang Dongbo,Liu liu,Zhao Zhixiao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29pages, 7 figures

点击查看摘要

Abstract:This study addresses the challenges in intelligent processing of Chinese ancient mathematical classics by constructing Guji_MATH, a benchmark for evaluating classical texts based on Suanjing Shishu. It systematically assesses the mathematical problem-solving capabilities of mainstream reasoning models under the unique linguistic constraints of classical Chinese. Through machine-assisted annotation and manual verification, 538 mathematical problems were extracted from 8 canonical texts, forming a structured dataset centered on the “Question-Answer-Solution” framework, supplemented by problem types and difficulty levels. Dual evaluation modes–closed-book (autonomous problem-solving) and open-book (reproducing classical solution methods)–were designed to evaluate the performance of six reasoning models on ancient Chinese mathematical problems. Results indicate that reasoning models can partially comprehend and solve these problems, yet their overall performance remains inferior to benchmarks on modern mathematical tasks. Enhancing models’ classical Chinese comprehension and cultural knowledge should be prioritized for optimization. This study provides methodological support for mining mathematical knowledge from ancient texts and disseminating traditional culture, while offering new perspectives for evaluating cross-linguistic and cross-cultural capabilities of reasoning models.
zh

[NLP-58] Collaboration among Multiple Large Language Models for Medical Question Answering ALT

【速读】：该论文试图解决多大型语言模型（Large Language Models, LLMs）在医学任务中缺乏协同效应的问题，即如何有效整合不同LLMs的专业知识和背景以提升整体性能。解决方案的关键在于提出一种针对医学选择题数据集的多LLM协作框架，通过后验分析验证该框架能够增强所有LLMs的推理能力并缓解其在不同问题上的分歧，同时观察到LLM在面对其他LLM的对抗性观点时的置信度与其预测准确性之间存在一致性。

链接: https://arxiv.org/abs/2505.16648
作者: Kexin Shang,Chia-Hsuan Chang,Christopher C. Yang
机构: Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Healthcare Informatics 2025

点击查看摘要

Abstract:Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs’ expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM’s confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM’s confidence and prediction accuracy.
zh

[NLP-59] SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

【速读】：该论文试图解决机器翻译（Machine Translation, MT）中依赖外部监督信号的问题，这些信号如人工标注的参考数据或训练好的奖励模型（Reward Models, RMs），通常成本高昂且难以扩展。解决方案的关键在于提出一种无需参考、完全在线的简单自奖励（Simple Self-Rewarding, SSR）强化学习（Reinforcement Learning, RL）框架，该框架仅依赖于模型自身的判断奖励进行训练。通过使用13K单语示例和Qwen-2.5-7B作为基础模型，所提出的SSR-Zero-7B在WMT23、WMT24和Flores200基准的英中翻译任务中表现优于现有MT专用LLM和更大的通用LLM。进一步结合外部监督信号COMET后，最强模型SSR-X-Zero-7B实现了最先进的性能，超越了所有参数量低于72B的开源模型，甚至优于封闭源代码模型如GPT-4o和Gemini 1.5 Pro。

链接: https://arxiv.org/abs/2505.16637
作者: Wenjie Yang,Mao Zheng,Mingyang Song,Zheng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English \leftrightarrow Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English \leftrightarrow Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
zh

[NLP-60] MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries

【速读】：该论文试图解决双语者在网页搜索中频繁使用混合语言查询（mixed-language queries）但信息检索（Information Retrieval, IR）研究对此类查询关注不足的问题。解决方案的关键是引入MiLQ，即首个公开的混合语言查询测试集，该数据集被证实具有现实性和高偏好性，并通过实验验证了多语言IR模型在处理混合语言查询时表现中等且不一致，同时指出代码转换训练数据对提升处理此类查询的鲁棒性IR模型具有潜力。

链接: https://arxiv.org/abs/2505.16631
作者: Jonghwi Kim,Deokhyung Kang,Seonjeong Hwang,Yunsu Kim,Jungseul Ok,Gary Lee
机构: POSTECH(浦项科技大学); aiXplain Inc.(aiXplain公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce MiLQ,Mixed-Language Query test set, the first public benchmark of mixed-language queries, confirmed as realistic and highly preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data’s potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.
zh

[NLP-61] Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports

【速读】：该论文旨在解决胸部X光片（Chest X-ray, CXR）视觉问答（VQA）任务中单图像异常检测和多时相图像差异比较的问题。其解决方案的关键在于提出一种统一的方法，通过自回归生成答案来处理两类问题，并引入预测的放射科报告作为额外输入以增强答案生成模块的性能。该方法分为两个步骤：报告生成（Report Generation, RG）和答案生成（Answer Generation, AG），并通过将预测的放射科报告作为证据输入到答案生成模型中，显著提升了模型在单图像和图像差异问题上的表现。

链接: https://arxiv.org/abs/2505.16624
作者: Francesco Dalla Serra,Patrick Schrempf,Chaoyang Wang,Zaiqiao Meng,Fani Deligianni,Alison Q. O’Neil
机构: Canon Medical Research Europe (佳能医疗研究欧洲公司); School of Computing Science, University of Glasgow (计算科学学院，格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR (“What abnormalities are seen in image X?”), while image-difference questions compare two longitudinal CXRs acquired at different time points (“What are the differences between image X and Y?”). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model’s predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from ‘Chain-of-Thought reasoning’, we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
zh

[NLP-62] Steering Large Language Models for Machine Translation Personalization

【速读】：该论文试图解决在低资源环境下，基于大语言模型（Large Language Models, LLMs）的机器翻译系统在面对不明确的风格要求时难以生成符合个性化风格的翻译问题。解决方案的关键在于探索提示策略和推理时的干预方法，以引导模型生成具有个性化风格的翻译，并提出一种对比框架，利用从稀疏自编码器中提取的潜在概念来识别关键的个性化属性，从而实现有效的风格控制。

链接: https://arxiv.org/abs/2505.16612
作者: Daniel Scalena,Gabriele Sarti,Arianna Bisazza,Elisabetta Fersini,Malvina Nissim
机构: CLCG, University of Groningen (CLCG，格罗宁根大学); University of Milano-Bicocca (米兰-比科卡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.
zh

[NLP-63] From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在情感支持任务中生成通用化、缺乏个性化的回复，无法有效满足用户具体需求的问题。其解决方案的关键在于提出一个自进化框架，该框架包含两个阶段：第一阶段通过有限的情感支持对话数据微调模型以提供基础支持，第二阶段则通过自我反思和自我优化生成个性化回应，并利用迭代直接偏好优化提升模型对用户隐性偏好的理解能力。

链接: https://arxiv.org/abs/2505.16610
作者: Jing Ye,Lu Xiang,Yaping Zhang,Chengqing Zong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages

点击查看摘要

Abstract:Effective emotional support hinges on understanding users’ emotions and needs to provide meaningful comfort during multi-turn interactions. Large Language Models (LLMs) show great potential for expressing empathy; however, they often deliver generic and one-size-fits-all responses that fail to address users’ specific needs. To tackle this issue, we propose a self-evolution framework designed to help LLMs improve their responses to better align with users’ implicit preferences concerning user profiles (personalities), emotional states, and specific situations. Our framework consists of two distinct phases: \textit(1) \textitEmotional Support Experience Acquisition, where LLMs are fine-tuned on limited emotional support conversation data to provide basic support, and \textit(2) \textitSelf-Improvement for Personalized Emotional Support, where LLMs leverage self-reflection and self-refinement to generate personalized responses. Through iterative direct preference optimization between the pre- and post-refined responses, our model generates responses that reflect a better understanding of the user’s implicit preferences. Extensive experiments and evaluations demonstrate that our method significantly enhances the model’s performance in emotional support, reducing unhelpful responses and minimizing discrepancies between user preferences and model outputs.
zh

[NLP-64] What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse

【速读】：该论文试图解决媒体框架（media frame）与立场（stance）之间相互作用的缺乏系统研究问题，特别是在气候变迁网络迷因（climate change internet memes）中的表现。其解决方案的关键在于构建CLIMATEMEMES数据集，这是首个包含立场和媒体框架标注的气候变迁迷因数据集，并提出两种迷因理解任务：立场检测和媒体框架检测。通过评估多种模型性能，研究揭示了视觉语言模型（VLMs）在立场识别上的优势以及在处理复杂框架时的局限性。

链接: https://arxiv.org/abs/2505.16592
作者: Shijia Zhou,Siyao Peng,Simon Luebke,Jörg Haßler,Mario Haim,Saif M. Mohammad,Barbara Plank
机构: LMU Munich(慕尼黑路德维希-马克西米利安大学); National Research Council Canada(加拿大国家研究委员会)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors’ opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs’ limitations in handling nuanced frames and stance expressions on climate change internet memes.
zh

[NLP-65] Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多语言事实能力评估方面的不足，提出了一种名为KoLasSimpleQA的基准测试框架。其解决方案的关键在于构建一个具有单一知识点覆盖、绝对客观性、唯一答案和时间稳定性的问题集，从而支持高效的LLM-as-judge范式评估，同时测试模型的事实记忆能力和自我认知（“知道他们不知道什么”）。此外，该方案通过双域设计（通用领域与语言特定领域）和多语言覆盖（9种语言）来全面评估多语言能力，揭示了不同领域间的性能差异，强调了在多语言场景中进行针对性评估与优化的重要性。

链接: https://arxiv.org/abs/2505.16591
作者: Bowen Jiang,Runchuan Zhu,Jiang Wu,Zinco Jiang,Yifan He,Junyuan Gao,Jia Yu,Rui Min,Yinfan Wang,Haote Yang,Songyang Zhang,Dahua Lin,Lijun Wu,Conghui He
机构: Shanghai Artificial Intelligence Laboratory; Peking University; University of Chinese Academy of Sciences; Shanghai University
类目: Computation and Language (cs.CL)
备注: Equal contribution: Bowen Jiang, Runchuan Zhu, Jiang Wu; Corresponding author: Conghui He

点击查看摘要

Abstract:We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs’ factual memory and self-awareness (“know what they don’t know”). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at this https URL .
zh

[NLP-66] O2-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理需要开放领域实时信息的任务时，因静态参数化知识限制而导致性能受限的问题。现有方法主要针对封闭式问题，而开放式问题由于缺乏标准答案或存在多种可能答案，仍鲜有研究。论文提出的解决方案是O ^2 -Searcher，其关键在于利用强化学习构建一个高效的本地模拟搜索环境，以动态获取外部知识，并将外部知识与模型的复杂推理过程解耦，从而有效应对开放式和封闭式问题。此外，O ^2 -Searcher通过统一的训练机制和精心设计的奖励函数，使代理能够识别问题类型并适应不同的答案生成策略。

链接: https://arxiv.org/abs/2505.16582
作者: Jianbiao Mei,Tao Hu,Daocheng Fu,Licheng Wen,Xuemeng Yang,Rong Wu,Pinlong Cai,Xing Gao,Yu Yang,Chengjun Xie,Botian Shi,Yong Liu,Yu Qiao
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Fudan University (复旦大学); State Key Laboratory of Industrial Control Technology (工业控制技术国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O ^2 -Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O ^2 -Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model’s sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O ^2 -QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O ^2 -Searcher, using only a 3B model, significantly surpasses leading LLM agents on O ^2 -QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
zh

[NLP-67] EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions

【速读】：该论文试图解决原子性声明（atomic claim）真实性验证的问题，这是许多近期提出的事实核查系统中的关键组成部分。现有方法通常通过查询搜索引擎获取证据，再将证据集和原子性声明提供给大型语言模型进行分类，但这一过程与人类执行任务的方式存在偏差。该论文提出的解决方案的关键在于设计一种名为EMULATE的新型声明验证系统，该系统通过多智能体框架模拟人类行为，每个智能体负责完成任务中的一小部分，例如根据预定义标准对搜索结果进行排序或评估网页内容，从而实现更接近人类的验证过程。

链接: https://arxiv.org/abs/2505.16576
作者: Spencer Hong,Meng Luo,Xinyi Wan
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Determining the veracity of atomic claims is an imperative component of many recently proposed fact-checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi-agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi-agent framework.
zh

[NLP-68] URLs Help Topics Guide: Understanding Metadata Utility in LLM Training

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在预训练过程中缺乏对上下文元数据（如来源、质量或主题）利用的问题，从而导致学习过程为“无上下文”的范式。其解决方案的关键在于系统评估不同类型的元数据对预训练效率和下游任务性能的影响，并探索元数据在推理阶段的条件化作用。研究发现，仅URL上下文能够加速训练，而质量评分和主题/格式领域信息则未表现出明显优势；此外，URL条件化带来的性能提升仅在使用较长提示词时显现，同时表明上下文感知的预训练可以实现更可控的生成，类似于无分类器引导（classifier-free guidance）的方式。

链接: https://arxiv.org/abs/2505.16570
作者: Dongyang Fan,Vinko Sabolčec,Martin Jaggi
机构: EPFL(瑞士联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
zh

[NLP-69] ScholarBench: A Bilingual Benchmark for Abstraction Comprehension and Reasoning Evaluation in Academic Contexts

【速读】：该论文旨在解决现有评估大型语言模型（Large Language Models, LLMs）领域知识的基准测试在可扩展性方面不足的问题，无法有效处理复杂的学术任务。其解决方案的关键在于提出\textttScholarBench，这是一个以深度专家知识和复杂学术问题解决为核心的基准测试，通过三步流程构建，专注于从学术文献中提取更专业且逻辑复杂的场景，并涵盖五种不同的问题类型。该基准测试评估LLMs在八个不同研究领域的抽象、理解与推理能力，并通过定义类别特定的示例属性及设计符合各领域研究方法和话语结构的问题，确保高质量的评估数据。此外，该基准作为英韩双语数据集，支持对LLMs在两种语言中的语言能力进行同步评估。

链接: https://arxiv.org/abs/2505.16566
作者: Dongwon Noh,Donghyeok Koh,Junghun Yuk,Gyuwan Kim,Jaeyong Lee,Kyungtae Lim,Cheoneum Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \textttScholarBench, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \textttScholarBench targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \textttScholarBench evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.
zh

[NLP-70] CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在微调过程中面临的有害微调攻击问题，这类攻击可能导致模型被用于执行恶意任务。现有的防御方法——遗忘（unlearning）虽然旨在移除恶意知识，但无法有效应对LLM强大的泛化适应能力，因为模型可以迅速重新学习或重新利用其能力以绕过选择性遗忘。论文提出的关键解决方案是诱导模型崩溃（model collapse），即通过强制模型“遗忘一切”来直接消除攻击者所依赖的通用能力，而非仅选择性地移除部分知识。这一机制的核心在于引入Collapse Trap（CTRAP），它在对齐阶段预设模型对后续微调动态的响应，若检测到持续的恶意适应行为，则触发模型核心语言建模能力的逐步退化，从而使其对攻击者无用，同时在良性微调中保持模型的可用性。

链接: https://arxiv.org/abs/2505.16559
作者: Biao Yi,Tiansheng Huang,Baolei Zhang,Tong Li,Lihai Nie,Zheli Liu,Li Shen
机构: Nankai University (南开大学); Sun Yat-sen University (中山大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse–effectively forcing the model to “unlearn everything”–specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model’s reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model’s core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model’s utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at this https URL.
zh

[NLP-71] hink Silently Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在进行链式思维（Chain-of-Thought, CoT）推理时存在的计算成本高和效率低的问题。其解决方案的关键在于提出一种名为压缩潜在推理（Compressed Latent Reasoning, CoLaR）的新框架，该框架通过两阶段训练方法在潜在空间中动态压缩推理过程，首先利用监督微调扩展下一标记预测任务以引入辅助的下一压缩嵌入预测目标，其次通过强化学习（Reinforcement Learning, RL）利用潜在头的非确定性探索多样化的推理路径并利用更紧凑的路径，从而实现高效的潜在层推理和推理速度的动态调整。

链接: https://arxiv.org/abs/2505.16552
作者: Wenhui Tan,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Ruihua Song
机构: Renmin University of China (中国人民大学); Xiaomi (小米)
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head’s non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
zh

[NLP-72] Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

【速读】：该论文试图解决语言混淆（language confusion）问题，即大型语言模型（Large Language Models, LLMs）在用户需求不明确的情况下生成非预期语言的现象，尤其是在以英语为中心的模型中尤为突出。解决方案的关键在于通过机制可解释性（mechanistic interpretability, MI）研究，结合行为基准测试与神经元层面的分析，识别出语言切换发生的关键位置（confusion points, CPs），并发现最终层中的转换失败是导致混淆的主要原因。进一步通过编辑通过多语言调优模型对比分析所识别出的关键神经元，显著降低了混淆现象，同时保持了模型的通用能力和流畅性。

链接: https://arxiv.org/abs/2505.16538
作者: Ercong Nie,Helmut Schmid,Hinrich Schütze
机构: LMU Munich(慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Language confusion – where large language models (LLMs) generate unintended languages against the user’s need – remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) – specific positions where language switches occur – are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.
zh

[NLP-73] DuFFin: A Dual-Level Fingerprinting Framework for LLM s IP Protection

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）的知识产权（Intellectual Property, IP）保护问题，特别是防止模型被恶意窃取或未经授权部署。现有水印和指纹技术要么影响文本生成过程，要么在白盒访问受限的情况下效果有限，因此不具实用性。论文提出的解决方案是DuFFin，一种适用于黑盒环境的双层次指纹框架，其关键在于通过提取触发模式和知识级指纹来识别可疑模型的来源，从而实现对受保护LLM及其变体的版权验证。实验结果表明，该方法在IP-ROC指标上超过0.95，证明了其有效性。

链接: https://arxiv.org/abs/2505.16530
作者: Yuliang Yan,Haochun Tang,Shuo Yan,Enyan Dai
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Jilin University (吉林大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel \textbfDu al-Level \textbfFin gerprinting \textbfF ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at this https URL.
zh

[NLP-74] EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance ACL2025

【速读】：该论文旨在解决小型大语言模型（sLLMs）在任务导向对话系统中保持话题一致性的问题，特别是在面对无关或恶意输入时，模型容易偏离预定功能，从而影响系统的可靠性和安全性。为了解决这一问题，作者提出了一种名为熵缩放引导向量的话题维护方法（Entropy-scaled Steering vectors for Topic Maintenance, EnSToM），其关键在于根据输入的不确定性动态调整引导强度，从而在有效处理无关干扰的同时保持话题相关性的准确性。

链接: https://arxiv.org/abs/2505.16526
作者: Heejae Suh,Yejin Jeon,Deokhyung Kang,Taehee Park,Yejin Min,Gary Geunbae Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 (Findings, long paper)

点击查看摘要

Abstract:Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.
zh

[NLP-75] Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLM s via Causal Effect Estimation-guided Debiasing

【速读】：该论文试图解决当前大语言模型（Large Language Models, LLMs）在推理过程中可能仍存在偏见，导致其泛化能力不足的问题。现有基准测试通常仅包含单一类型的控制偏见，而实际应用中的数据可能包含多种偏见。为弥补这一差距，论文提出一个包含五种类型偏见的多偏见基准，并发现现有LLMs和去偏方法在处理多类型偏见时表现不佳。解决方案的关键在于提出一种基于因果效应估计的多偏见消除方法（Causal Effect Estimation-guided Multi-Bias Elimination, CMBE），该方法首先同时估计多种偏见的因果效应，随后从语义信息与偏见共同作用的总因果效应中消除偏见的影响，从而有效提升LLMs的泛化能力。

链接: https://arxiv.org/abs/2505.16522
作者: Zhouhao Sun,Zhiyuan Kan,Xiao Ding,Li Du,Yang Zhao,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.
zh

[NLP-76] Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中事实性幻觉（factual hallucinations）的问题，即模型生成不准确或虚构内容，从而影响其可靠性和用户信任。解决方案的关键在于构建更加真实且具有挑战性的数据集，以更有效地评估模型的事实准确性。具体而言，研究者提出了一种从表格数据中采样合理真假事实性句子的策略，以及一种基于问答语料库生成依赖于LLM的真实真假数据集的流程，以此提升对模型生成文本事实性的评估效果。

链接: https://arxiv.org/abs/2505.16520
作者: Giovanni Servedio,Alessandro De Bellis,Dario Di Palma,Vito Walter Anelli,Tommaso Di Noia
机构: Politecnico di Bari (巴里理工大学); Sapienza University of Rome (罗马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.
zh

[NLP-77] CUB: Benchmarking Context Utilisation Techniques for Language Models

【速读】：该论文试图解决在知识密集型任务中，语言模型（Language Models, LMs）可能忽略与过时参数记忆相矛盾的相关信息或被无关上下文干扰的问题。其解决方案的关键在于开发CUB（Context Utilisation Benchmark），这是一个用于评估上下文利用技术（Context Utilisation Manipulation Techniques, CMTs）的基准测试工具，旨在帮助检索增强生成（Retrieval-Augmented Generation, RAG）领域的实践者选择最适合其需求的CMT。CUB通过严格测试三种不同的上下文类型，捕捉现实场景中的关键挑战，并对多种CMT进行了系统评估。

链接: https://arxiv.org/abs/2505.16518
作者: Lovisa Hagström,Youna Kim,Haeun Yu,Sang-goo Lee,Richard Johansson,Hyunsoo Cho,Isabelle Augenstein
机构: Chalmers University of Technology (查尔姆斯理工大学); University of Gothenburg (哥德堡大学); Seoul National University (首尔国立大学); University of Copenhagen (哥本哈根大学); Ewha Womans University (延世女子大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.
zh

[NLP-78] AppealCase: A Dataset and Benchmark for Civil Case Appeal Scenarios

【速读】：该论文试图解决法律人工智能（LegalAI）研究中对上诉程序（appellate process）关注不足的问题，尤其是在个案判决分析方面，忽视了上诉在司法系统中作为错误纠正和确保公正审判的核心机制。解决方案的关键在于构建了一个名为AppealCase的数据集，该数据集包含10,000对真实世界中的第一审与第二审文书，覆盖91类民事案件，并提供了五个与上诉审查核心维度相关的详细标注：判决撤销、撤销原因、引用法律条款、主张层面的裁决以及第二审中是否存在新信息。基于这些标注，论文提出了五个新的LegalAI任务，并对20个主流模型进行了全面评估，揭示了当前模型在判决撤销预测任务上的性能局限性，突显了上诉场景的复杂性与挑战性。

链接: https://arxiv.org/abs/2505.16514
作者: Yuting Huang,Meitong Guo,Yiquan Wu,Ang Li,Xiaozhong Liu,Keting Yin,Changlong Sun,Fei Wu,Kun Kuang
机构: Zhejiang University (浙江大学); University of Innsbruck (因斯布鲁克大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Recent advances in LegalAI have primarily focused on individual case judgment analysis, often overlooking the critical appellate process within the judicial system. Appeals serve as a core mechanism for error correction and ensuring fair trials, making them highly significant both in practice and in research. To address this gap, we present the AppealCase dataset, consisting of 10,000 pairs of real-world, matched first-instance and second-instance documents across 91 categories of civil cases. The dataset also includes detailed annotations along five dimensions central to appellate review: judgment reversals, reversal reasons, cited legal provisions, claim-level decisions, and whether there is new information in the second instance. Based on these annotations, we propose five novel LegalAI tasks and conduct a comprehensive evaluation across 20 mainstream models. Experimental results reveal that all current models achieve less than 50% F1 scores on the judgment reversal prediction task, highlighting the complexity and challenge of the appeal scenario. We hope that the AppealCase dataset will spur further research in LegalAI for appellate case analysis and contribute to improving consistency in judicial decision-making.
zh

[NLP-79] Sparse Activation Editing for Reliable Instruction Following in Narratives

【速读】：该论文旨在解决语言模型在复杂叙事情境下遵循指令的能力不足问题，现有基准测试未能充分捕捉此类挑战。其解决方案的关键在于提出一种无需训练的框架——Concise-SAE，该框架通过仅使用自然语言指令识别并编辑与指令相关的神经元，从而提升模型的指令遵循能力，且无需依赖标注数据。

链接: https://arxiv.org/abs/2505.16505
作者: Runcong Zhao,Chengyu Cao,Qinglin Zhu,Xiucheng Lv,Shun Shao,Lin Gui,Ruifeng Xu,Yulan He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Complex narrative contexts often challenge language models’ ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.
zh

[NLP-80] LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在情感分析任务中对情感相关信息的捕捉机制尚不明确的问题，旨在确定情感特征在模型内部层中的分布情况及其对情感分析的影响。其解决方案的关键在于通过探测分类器分析不同层和池化方法下的情感编码，识别出最能捕捉情感信号的层结构，并发现情感信息在中层最为集中，同时验证了在解码器-only模型中最后一个词元并非始终是最具信息量的表达。这一方法不仅提升了情感分析的检测准确率，还显著降低了内存需求。

链接: https://arxiv.org/abs/2505.16491
作者: Dario Di Palma,Alessandro De Bellis,Giovanni Servedio,Vito Walter Anelli,Fedelucio Narducci,Tommaso Di Noia
机构: Politecnico di Bari (巴里理工大学); Sapienza University of Rome (罗马第一大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.16491 [cs.CL] (or arXiv:2505.16491v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.16491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-81] aching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在提供信息时缺乏忠实性的问题，这对于构建可靠的资讯获取系统至关重要。其解决方案的关键在于提出一种系统框架CANOE，该框架通过合成高质量的短文本问答（QA）数据并引入双阶段规则强化学习方法（Dual-GRPO），在无需人工标注的情况下提升LLMs在短文本和长文本生成任务中的忠实性。Dual-GRPO通过三种定制的基于规则的奖励机制，同时优化短文本和长文本的生成过程，避免了依赖单一短文本QA数据时的过拟合问题。

链接: https://arxiv.org/abs/2505.16483
作者: Shuzheng Si,Haozhe Zhao,Cheng Gao,Yuzhuo Bai,Zhitong Wang,Bofei Gao,Kangyang Luo,Wenhao Li,Yufei Huang,Gang Chen,Fanchao Qi,Minjia Zhang,Baobao Chang,Maosong Sun
机构: Tsinghua University (清华大学); Peking University (北京大学); DeepLang AI (DeepLang AI); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.
zh

[NLP-82] Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering

【速读】：该论文旨在解决文档视觉问答（DocVQA）中处理长篇多模态文档（文本、图像、表格）以及进行跨模态推理的双重挑战。现有文档检索增强生成（DocRAG）方法受限于以文本为中心的策略，常遗漏关键视觉信息，且缺乏对多模态证据选择与整合的稳健评估基准。论文提出的解决方案是MMDocRAG，其关键在于构建了一个包含4,055个专家标注的问答对的全面基准，具备多页跨模态证据链，并引入创新指标用于评估多模态引用选择，支持文本与相关视觉元素交织的答案生成。通过大规模实验验证了多模态证据检索、选择及整合的持续挑战，并揭示了先进专有大模型在性能上优于开源模型，同时表明多模态输入相较于纯文本输入具有中等优势，而开源模型则表现出显著性能下降。此外，微调的大语言模型（LLM）在使用详细图像描述时取得了显著提升。

链接: https://arxiv.org/abs/2505.16470
作者: Kuicai Dong,Yujing Chang,Shijie Huang,Yasheng Wang,Ruiming Tang,Yong Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: preprint. code available at \url{ this https URL }

点击查看摘要

Abstract:Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and this http URL findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at this https URL.
zh

[NLP-83] Reading Between the Prompts: How Stereotypes Shape LLM s Implicit Personalization

【速读】：该论文试图解决生成式大语言模型（Generative Large Language Models, LLMs）在对话中通过隐含线索推断用户人口统计信息（demographic information）所引发的隐性个性化问题，这种现象可能导致对少数群体用户响应质量下降。解决方案的关键在于通过训练线性探测器（linear probe）干预模型的内部表示，从而引导模型根据用户明确声明的身份进行响应，有效缓解基于刻板印象的隐性个性化问题。

链接: https://arxiv.org/abs/2505.16467
作者: Vera Neplenbroek,Arianna Bisazza,Raquel Fernández
机构: Institute for Logic, Language and Computation, University of Amsterdam (逻辑语言与计算研究所，阿姆斯特丹大学); Center for Language and Cognition, University of Groningen (语言与认知中心，格罗宁根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Large Language Models (LLMs) infer user’s demographic information from subtle cues in the conversation – a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models’ latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model’s internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.
zh

[NLP-84] University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

【速读】：该论文旨在解决跨28种语言的多标签情感分类问题（multilabel emotion classification），其核心挑战在于如何有效处理不同语言间的语义差异和情感表达的复杂性。解决方案的关键在于对比两种主要策略：全量微调Transformer模型与仅训练分类器的方法，并发现基于提示的编码器（如mE5和BGE）上训练分类器的效果显著优于全量微调XLMR和mBERT。最终，通过集成多个BGE模型并结合CatBoost分类器，实现了在所有语言上的平均F1-macro得分为56.58的最佳性能。

链接: https://arxiv.org/abs/2505.16460
作者: Ikhlasul Akmal Hanif,Eryawan Presma Yulianrifat,Jaycent Gunawan Ongris,Eduardus Tjitrahardja,Muhammad Falensi Azmi,Rahmat Bryan Naufal,Alfan Farizki Wicaksono
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 13 tables, 1 figures

点击查看摘要

Abstract:This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.
zh

[NLP-85] Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

【速读】：该论文旨在解决推荐系统评估与迭代过程中传统A/B测试资源消耗大、离线方法难以捕捉动态用户-平台交互的问题。其解决方案的关键在于提出RecInter平台，该平台采用基于智能体的仿真机制，通过模拟用户行为（如点赞、评论、购买）实时更新物品属性，并引入商户智能体进行响应，从而构建一个更真实且不断演化的生态系统。此外，平台通过多维用户画像模块、高级智能体架构以及经过思维链（Chain-of-Thought, CoT）增强数据微调的大型语言模型，确保了高保真度的仿真效果。

链接: https://arxiv.org/abs/2505.16429
作者: Song Jin,Juntian Zhang,Yuhan Liu,Xun Zhang,Yufei Zhang,Guojun Yin,Fei Jiang,Wei Lin,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Meituan (美团); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research.
zh

[NLP-86] I2G: Generating Instructional Illustrations via Text-Conditioned Diffusion

【速读】：该论文试图解决在自然语言处理（NLP）中程序性知识（procedural knowledge）有效传达的问题，因为纯文本指令往往无法准确表达复杂的物理动作和空间关系。其解决方案的关键在于提出一种基于语言驱动的框架，将程序性文本转化为连贯的视觉指令，通过分解文本为目标陈述和顺序步骤，并基于这些语言元素进行视觉生成。该方法的核心创新包括：基于成分句法分析器的文本编码机制、保持指令序列一致性的成对话语连贯模型，以及专为程序性语言到图像对齐设计的新评估协议。

链接: https://arxiv.org/abs/2505.16425
作者: Jing Bi,Pinxin Liu,Ali Vosoughi,Jiarui Wu,Jinxi He,Chenliang Xu
机构: University of Rochester(罗切斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, under review

点击查看摘要

Abstract:The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address this limitation by proposing a language-driven framework that translates procedural text into coherent visual instructions. Our approach models the linguistic structure of instructional content by decomposing it into goal statements and sequential steps, then conditioning visual generation on these linguistic elements. We introduce three key innovations: (1) a constituency parser-based text encoding mechanism that preserves semantic completeness even with lengthy instructions, (2) a pairwise discourse coherence model that maintains consistency across instruction sequences, and (3) a novel evaluation protocol specifically designed for procedural language-to-image alignment. Our experiments across three instructional datasets (HTStep, CaptainCook4D, and WikiAll) demonstrate that our method significantly outperforms existing baselines in generating visuals that accurately reflect the linguistic content and sequential nature of instructions. This work contributes to the growing body of research on grounding procedural language in visual content, with applications spanning education, task guidance, and multimodal language understanding.
zh

[NLP-87] WebAgent -R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

【速读】：该论文旨在解决在动态网页界面中训练有效多轮交互网络代理（web agents）的挑战，尤其是在长时域决策任务中的表现问题。其解决方案的关键在于提出WebAgent-R1，一个简单而有效的端到端多轮强化学习框架，通过异步生成多样化的轨迹并完全依赖任务成功与否的二元奖励进行训练，从而提升任务成功率。实验结果表明，该方法显著提升了Qwen-2.5-3B和Llama-3.1-8B等模型的任务成功率，并优于现有最先进方法和商业模型。

链接: https://arxiv.org/abs/2505.16421
作者: Zhepei Wei,Wenlin Yao,Yao Liu,Weizhi Zhang,Qin Lu,Liang Qiu,Changlong Yu,Puyang Xu,Chao Zhang,Bing Yin,Hyokun Yun,Lihong Li
机构: University of Virginia (弗吉尼亚大学); Amazon (亚马逊); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
zh

[NLP-88] Exploring the Relationship Between Diversity and Quality in Ad Text Generation

【速读】：该论文试图解决广告文本生成中多样性与广告质量之间关系的问题，特别是针对现有多样性增强方法在广告文本生成中的效果尚未被充分探讨这一问题。解决方案的关键在于综合考虑多种因素，包括多样性增强方法、其超参数、输入输出格式以及模型本身，以系统地分析这些因素如何影响广告文本的多样性和质量。

链接: https://arxiv.org/abs/2505.16418
作者: Yoichi Aoki,Soichiro Murakami,Ukyo Honda,Akihiko Kato
机构: Tohoku University (东北大学); RIKEN (理化学研究所); CyberAgent (CyberAgent)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In natural language generation for advertising, creating diverse and engaging ad texts is crucial for capturing a broad audience and avoiding advertising fatigue. Regardless of the importance of diversity, the impact of the diversity-enhancing methods in ad text generation – mainly tested on tasks such as summarization and machine translation – has not been thoroughly explored. Ad text generation significantly differs from these tasks owing to the text style and requirements. This research explores the relationship between diversity and ad quality in ad text generation by considering multiple factors, such as diversity-enhancing methods, their hyperparameters, input-output formats, and the models.
zh

[NLP-89] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

【速读】：该论文试图解决检索增强生成（Retrieval-Augmented Generation, RAG）中上下文归属（context attribution）的问题，即如何准确地将生成内容追溯到具体的上下文片段。现有方法在计算上耗费资源且通常需要大量微调或人工标注，难以实现可靠归属。论文提出的解决方案关键在于引入一种基于Jensen-Shannon散度（Jensen-Shannon Divergence）的新型方法，称为ARC-JSD，该方法能够在不进行额外微调或代理建模的情况下，高效且准确地识别关键上下文句子。

链接: https://arxiv.org/abs/2505.16415
作者: Ruizhe Li,Chen Chen,Yuchen Hu,Yanjun Gao,Xi Wang,Emine Yilmaz
机构: University of Aberdeen (阿伯丁大学); Nanyang Technological University (南洋理工大学); University College London (伦敦大学学院); University of Colorado Anschutz Medical Campus (科罗拉多大学安舒茨医学校区); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in process

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models.
zh

[NLP-90] ool-Star: Empowering LLM -Brained Multi-Tool Reason er via Reinforcement Learning

【速读】：该论文试图解决如何通过强化学习（Reinforcement Learning, RL）算法有效赋能大型语言模型（Large Language Models, LLMs）进行多工具协同推理的问题。其关键解决方案是提出一种基于RL的框架——Tool-Star，该框架通过集成六种类型的工具，并结合数据合成与训练策略，实现LLMs在逐步推理过程中自主调用多个外部工具。核心创新包括一个通用的工具整合推理数据生成管道以及一个两阶段训练框架，分别用于提升模型的初始推理能力和多工具协作效率。

链接: https://arxiv.org/abs/2505.16410
作者: Guanting Dong,Yifei Chen,Xiaoxi Li,Jiajie Jin,Hongjin Qian,Yutao Zhu,Hangyu Mao,Guorui Zhou,Zhicheng Dou,Ji-Rong Wen
机构: Renmin University of China (中国人民大学); BAAI (百度研究院); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Working in progress

点击查看摘要

Abstract:Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at this https URL.
zh

[NLP-91] From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLM s

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）中适配文化价值观所面临的挑战，特别是由于偏见和训练数据有限导致的文化表征不准确问题。现有方法主要依赖世界价值观调查（World Values Survey, WVS）数据来对齐不同文化价值观，但其在捕捉文化细微差别和生成任务特定文化表征方面效果尚不明确。论文的关键解决方案是通过引入维基百科和NormAd中的百科全书式及情境化文化叙述来增强WVS数据，从而提升文化独特性，尽管这些叙述对下游任务的影响可能有所差异，但整体上优于仅使用调查数据的方法。

链接: https://arxiv.org/abs/2505.16408
作者: Muhammad Farid Adilazuarda,Chen Cecilia Liu,Iryna Gurevych,Alham Fikri Aji
机构: MBZUAI; UKP Lab, TU Darmstadt
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior.
zh

[NLP-92] On the reliability of feature attribution methods for speech classification

【速读】：该论文试图解决在语音处理领域中，标准特征归因方法的可靠性问题（feature attribution），特别是输入类型、聚合方式和扰动时间跨度等因素如何影响这些方法的有效性。研究发现，标准特征归因方法在语音领域通常不可靠，而基于词对齐的扰动方法在基于词的分类任务中表现较好。解决方案的关键在于识别并优化与特定分类任务特性相适应的特征归因策略，以提高模型输出解释的准确性和可信度。

链接: https://arxiv.org/abs/2505.16406
作者: Gaofei Shen,Hosein Mohebbi,Arianna Bisazza,Afra Alishahi,Grzegorz Chrupała
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:As the capabilities of large-scale pre-trained models evolve, understanding the determinants of their outputs becomes more important. Feature attribution aims to reveal which parts of the input elements contribute the most to model outputs. In speech processing, the unique characteristics of the input signal make the application of feature attribution methods challenging. We study how factors such as input type and aggregation and perturbation timespan impact the reliability of standard feature attribution methods, and how these factors interact with characteristics of each classification task. We find that standard approaches to feature attribution are generally unreliable when applied to the speech domain, with the exception of word-aligned perturbation methods when applied to word-based classification tasks.
zh

[NLP-93] AceReason -Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

【速读】：该论文试图解决在构建高性能推理模型时，训练策略不明确以及小规模模型中知识蒸馏仍优于强化学习（Reinforcement Learning, RL）的问题。其解决方案的关键在于通过大规模强化学习显著提升小型和中型模型的推理能力，具体方法是通过系统性地进行强化学习训练过程的消融实验，并提出一种简单有效的训练策略：首先在仅数学提示上进行训练，随后在仅代码提示上进行训练。此方法不仅提升了模型在数学和代码推理任务上的表现，还通过数据筛选管道确保了高质量、可验证的提示数据，从而实现了跨领域的基于验证的强化学习。

链接: https://arxiv.org/abs/2505.16400
作者: Yang Chen,Zhuolin Yang,Zihan Liu,Chankyu Lee,Peng Xu,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: We release the model at: this https URL

点击查看摘要

Abstract:Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model’s reasoning ability, enabling it to solve problems that were previously unsolvable.
zh

[NLP-94] Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection SIGIR2025

【速读】：该论文试图解决自动文本简化（Automatic Text Simplification, ATS）中评估方法滞后于文本生成技术发展的问题，尤其是大型语言模型（Large Language Models, LLMs）的广泛应用。现有ATS评估指标与错误存在缺乏相关性，且手动检查揭示了多种错误类型，表明需要更细致的评估框架。该论文的关键解决方案是引入一个用于检测和分类简化文本中错误的测试集，包括提出一种错误分类体系，构建一个经过人工标注的并行数据集，并分析现有模型在该分类体系下的错误检测与分类性能，从而为研究者提供更有效的工具以提升ATS系统的可靠性与准确性。

链接: https://arxiv.org/abs/2505.16392
作者: Benjamin Vendeville,Liana Ermakova,Pierre De Loor
机构: Université de Bretagne Occidentale (布列塔尼西部大学); Lab-STICC (UMR CNRS 6285) (STICC实验室（CNRS UMR 6285）); HCTI (HCTI); ENIB (ENIB)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at SIGIR 2025

点击查看摘要

Abstract:The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.
zh

[NLP-95] Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在跨语言任务中的能力获取机制问题，以及如何有效提升其跨语言能力。解决方案的关键在于通过分析LLMs在词级跨语言翻译任务中中间层的输出，识别出两种不同的行为模式：共现行为和语义枢纽行为，并基于词语共现频率和预训练数据集中的语义枢纽构建一个语义枢纽感知的预训练数据集，从而提升LLMs的跨语言能力。

链接: https://arxiv.org/abs/2505.16385
作者: Kaiyu He,Tong Zhou,Yubo Chen,Delai Qiu,Shengping Liu,Kang Liu,Jun Zhao
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems (认知与复杂系统决策智能重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); University of Science and Technology Beijing (北京科技大学); Unisound Al Technology Co,Ltd (声智科技有限公司); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable ability in cross-lingual tasks. Understanding how LLMs acquire this ability is crucial for their interpretability. To quantify the cross-lingual ability of LLMs accurately, we propose a Word-Level Cross-Lingual Translation Task. To find how LLMs learn cross-lingual ability, we trace the outputs of LLMs’ intermediate layers in the word translation task. We identify and distinguish two distinct behaviors in the forward pass of LLMs: co-occurrence behavior and semantic pivot behavior. We attribute LLMs’ two distinct behaviors to the co-occurrence frequency of words and find the semantic pivot from the pre-training dataset. Finally, to apply our findings to improve the cross-lingual ability of LLMs, we reconstruct a semantic pivot-aware pre-training dataset using documents with a high proportion of semantic pivots. Our experiments validate the effectiveness of our approach in enhancing cross-lingual ability. Our research contributes insights into the interpretability of LLMs and offers a method for improving LLMs’ cross-lingual ability.
zh

[NLP-96] PaTH Attention: Position Encoding via Accumulating Householder Transformations

【速读】：该论文旨在解决传统位置编码方法（如RoPE）在建模序列时表达能力受限的问题，因为RoPE中的键/查询变换仅依赖于元素间的相对位置，而与实际输入无关。其解决方案的关键在于提出PaTH，一种基于累积Householder(类似)变换的灵活数据依赖位置编码方案，其中每个变换都是输入相关的，从而增强了模型对序列结构的建模能力。

链接: https://arxiv.org/abs/2505.16381
作者: Songlin Yang,Yikang Shen,Kaiyue Wen,Shawn Tan,Mayank Mishra,Liliang Ren,Rameswar Panda,Yoon Kim
机构: Massachusetts Institute of Technology (麻省理工学院); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室); Stanford University (斯坦福大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines. Comments: Preprint Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2505.16381 [cs.CL] (or arXiv:2505.16381v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.16381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-97] Ask Retrieve Summarize: A Modular Pipeline for Scientific Literature Summarization ECIR2025

【速读】：该论文旨在解决科学出版物数量激增导致研究人员难以有效跟踪和综合知识的问题。其解决方案的关键在于提出XSum，一个基于检索增强生成（Retrieval-Augmented Generation, RAG）的多文档摘要（Multi-Document Summarization, MDS）模块化流程，该流程包含问题生成模块和编辑模块，能够动态生成与输入论文相关的问题，并将检索到的内容整合为符合学术规范的连贯摘要。

链接: https://arxiv.org/abs/2505.16349
作者: Pierre Achkar,Tim Gollub,Martin Potthast
机构: Leipzig University (莱比锡大学); Fraunhofer ISI Leipzig (弗劳恩霍夫研究所莱比锡); Bauhaus-Universität Weimar (包豪斯魏玛大学); Kassel University (卡塞尔大学); hessian.AI (黑森人工智能); ScaDS.AI (Scalable Data Services AI)
类目: Computation and Language (cs.CL)
备注: Accepted at SCOLIA@ECIR 2025 Workshop

点击查看摘要

Abstract:The exponential growth of scientific publications has made it increasingly difficult for researchers to stay updated and synthesize knowledge effectively. This paper presents XSum, a modular pipeline for multi-document summarization (MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The pipeline includes two core components: a question-generation module and an editor module. The question-generation module dynamically generates questions adapted to the input papers, ensuring the retrieval of relevant and accurate information. The editor module synthesizes the retrieved content into coherent and well-structured summaries that adhere to academic standards for proper citation. Evaluated on the SurveySum dataset, XSum demonstrates strong performance, achieving considerable improvements in metrics such as CheckEval, G-Eval and Ref-F1 compared to existing approaches. This work provides a transparent, adaptable framework for scientific summarization with potential applications in a wide range of domains. Code available at this https URL
zh

[NLP-98] Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

【速读】：该论文试图解决 embodied agents 在个性化辅助任务中对记忆利用不足的问题，即现有系统在处理动态、现实世界的指令时，难以有效理解用户赋予物理世界的独特语义（如偏好物品或日常习惯）。解决方案的关键在于提出 MEMENTO 框架，该框架通过两阶段的记忆评估流程，量化记忆利用对任务性能的影响，从而评估代理在目标解释中对个性化知识的理解能力，包括基于个人意义的目标物体识别和从用户模式中推断物体位置配置的能力。

链接: https://arxiv.org/abs/2505.16348
作者: Taeyoon Kwon,Dongwook Choi,Sunghwan Kim,Hyojun Kim,Seungjun Moon,Beong-woo Kwak,Kuan-Hao Huang,Jinyoung Yeo
机构: Yonsei University (延世大学); Texas A&M University (德克萨斯A&M大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents’ understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: this https URL
zh

[NLP-99] SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers

【速读】：该论文试图解决学术论文新颖性（novelty）自动评估的问题，现有方法通常仅关注词汇或实体组合，难以全面反映论文的新颖性。其解决方案的关键在于探索学术论文不同核心部分（如引言、方法、结果和讨论，IMRaD）的最佳组合，以提升语言模型对新颖性评分的预测效果。研究发现，引言、结果和讨论部分的组合在评估论文新颖性方面最为有效，而使用全文则未表现出显著优势。此外，引言和结果部分被证明是新颖性评分预测任务中最重要的组成部分。

链接: https://arxiv.org/abs/2505.16330
作者: Wenqing Wu,Chengzhi Zhang,Tong Bao,Yi Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper’s novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at this https URL.
zh

[NLP-100] CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

【速读】：该论文试图解决现有评估指标在捕捉候选报告与真实报告之间细微临床差异时缺乏粒度和可解释性的问题，导致评估效果不理想。其解决方案的关键在于提出一种基于临床的表格框架——CLEAR（Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation），该框架不仅评估报告是否能准确识别医学状况的存在或缺失，还通过五个关键属性（首次出现、变化、严重程度、描述性位置和建议）对阳性识别的状况进行精确描述，从而实现多维度、属性级别的临床可解释性评估。

链接: https://arxiv.org/abs/2505.16325
作者: Yuyang Jiang,Chacha Chen,Shengyuan Wang,Feng Li,Zecong Tang,Benjamin M. Mervak,Lydia Chelala,Christopher M Straus,Reve Chahine,Samuel G. Armato III,Chenhao Tan
机构: University of Chicago (芝加哥大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but also assesses whether it can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared to prior works, CLEAR’s multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments show that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.
zh

[NLP-101] AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reason ers

【速读】：该论文试图解决自提升推理语言模型（self-improving reasoning Language Models, LMs）在训练过程中因随机观测数据采样导致的训练不平衡问题，即模型对简单示例过度训练而对困难示例训练不足。解决方案的关键在于提出一种新的算法——自适应自教推理（Adaptive STaR, AdaSTaR），其核心是整合两种自适应采样原则：（1）多样性采样，以促进不同观测数据的平衡训练；（2）课程采样，动态调整数据难度以匹配模型的演化能力。

链接: https://arxiv.org/abs/2505.16322
作者: Woosung Koh,Wonbeen Oh,Jaein Jang,MinHyung Lee,Hyeongjin Kim,Ah Yeon Kim,Joonkee Kim,Junghyun Lee,Taehyeon Kim,Se-Young Yun
机构: Yonsei University (延世大学); LG AI Research (LG人工智能研究院); KAIST AI (KAIST人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model’s evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.
zh

[NLP-102] Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在复杂推理任务中容易出现的过度思考问题，即生成冗余内容的问题。解决方案的关键在于提出一种基于认知科学双过程理论的强化学习框架——自适应认知策略优化（Adaptive Cognition Policy Optimization, ACPO），通过自适应认知分配和动态系统切换实现高效推理。ACPO的核心组件包括引入系统感知的推理标记以显式表示思维模式，以及整合在线难度估计和令牌长度预算以指导推理过程中的自适应系统切换。

链接: https://arxiv.org/abs/2505.16315
作者: Xiaoxue Cheng,Junyi Li,Zhenduo Zhang,Xinyu Tang,Wayne Xin Zhao,Xinyu Kong,Zhiqiang Zhang
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); Ant Group (蚂蚁集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model’s cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.
zh

[NLP-103] PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）性能提升中依赖昂贵输出生成、自我批判能力或人工标注偏好等问题，从而限制了其可扩展性，尤其是在较小或非指令调优模型中的应用。其解决方案的关键在于提出PMPO（Probabilistic Metric Prompt Optimization），该方法通过令牌级别的交叉熵损失作为直接且轻量的评估信号来优化提示词，利用掩码和损失影响分析识别低质量提示片段，并通过最小化正负样本的损失来重写和选择改进的变体，无需在优化过程中进行输出采样或人工评估，仅依赖前向传递和对数似然值。

链接: https://arxiv.org/abs/2505.16307
作者: Chenzhuo Zhao,Ziqian Liu,Xingda Wang,Junting Lu,Chaoyi Ruan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO’s effectiveness, efficiency, and broad applicability.
zh

[NLP-104] INFERENCEDYNAMICS: Efficient Routing Across LLM s through Structured Capability and Knowledge Profiling

【速读】：该论文试图解决大规模语言模型（Large Language Model, LLM）路由中的可扩展性和适应性问题，即在面对大量专业化的LLM时，现有路由方法难以有效扩展或适应模型范围的扩展和能力领域的变化。解决方案的关键在于提出InferenceDynamics，这是一个通过建模模型的能力和知识来实现灵活且可扩展的多维路由框架，能够在不同任务中识别并利用表现最佳的模型，从而实现高效资源利用和优越的性能。

链接: https://arxiv.org/abs/2505.16303
作者: Haochen Shi,Tianshi Zheng,Weiqi Wang,Baixuan Xu,Chunyang Li,Chunkit Chan,Tao Fan,Yangqiu Song,Qiang Yang
机构: The Hong Kong University of Science and Technology (香港科技大学); WeBank (微众银行); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Large Language Model (LLM) routing is a pivotal technique for navigating a diverse landscape of LLMs, aiming to select the best-performing LLMs tailored to the domains of user queries, while managing computational resources. However, current routing approaches often face limitations in scalability when dealing with a large pool of specialized LLMs, or in their adaptability to extending model scope and evolving capability domains. To overcome those challenges, we propose InferenceDynamics, a flexible and scalable multi-dimensional routing framework by modeling the capability and knowledge of models. We operate it on our comprehensive dataset RouteMix, and demonstrate its effectiveness and generalizability in group-level routing using modern benchmarks including MMLU-Pro, GPQA, BigGenBench, and LiveBench, showcasing its ability to identify and leverage top-performing models for given tasks, leading to superior outcomes with efficient resource utilization. The broader adoption of Inference Dynamics can empower users to harness the full specialized potential of the LLM ecosystem, and our code will be made publicly available to encourage further research.
zh

[NLP-105] oDi: Token-wise Distillation via Fine-Grained Divergence Control

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在资源受限环境下部署时面临的高延迟和高能耗问题，通过知识蒸馏（Knowledge Distillation, KD）将教师模型的知识迁移至更小的学生模型。传统KD方法如Forward KL（FKL）和Reverse KL（RKL）对整个词汇表应用统一的发散损失，忽略了token级别的预测差异。该论文的关键解决方案是提出一种基于token级别的知识蒸馏方法——ToDi，其通过基于sigmoid的权重函数自适应地结合FKL和RKL，根据教师-学生概率对数比动态调整每个token的发散损失，从而实现更精确的概率分布对齐。

链接: https://arxiv.org/abs/2505.16297
作者: Seongryong Jung,Suwan Yoon,DongGeon Kim,Hwanhee Lee
机构: Chung-Ang University (忠南大学); Dmtlabs
类目: Computation and Language (cs.CL)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi’s effectiveness and practicality.
zh

[NLP-106] Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA

【速读】：该论文旨在解决迭代式检索增强生成（Iterative RAG）在多跳问答任务中面临的长上下文和无关信息累积问题，这些问题限制了模型对检索内容的处理与推理能力。论文提出的解决方案关键在于“Notes Writing”方法，该方法在每一步从检索到的文档中生成简洁且相关的信息笔记，从而减少噪声并保留关键信息，间接提升了大语言模型（LLM）的有效上下文长度，使其在处理大量输入文本时能够更有效地进行推理和规划。

链接: https://arxiv.org/abs/2505.16293
作者: Rishabh Maheshwary,Masoud Hashemi,Khyati Mahajan,Shiva Krishna Reddy Malay,Sai Rajeswar,Sathwik Tejaswi Madhusudhan,Spandana Gella,Vikas Yadav
机构: ServiceNow(服务现在); ServiceNow Research(服务现在研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Iterative RAG for multi-hop question answering faces challenges with lengthy contexts and the buildup of irrelevant information. This hinders a model’s capacity to process and reason over retrieved content and limits performance. While recent methods focus on compressing retrieved information, they are either restricted to single-round RAG, require finetuning or lack scalability in iterative RAG. To address these challenges, we propose Notes Writing, a method that generates concise and relevant notes from retrieved documents at each step, thereby reducing noise and retaining only essential information. This indirectly increases the effective context length of Large Language Models (LLMs), enabling them to reason and plan more effectively while processing larger volumes of input text. Notes Writing is framework agnostic and can be integrated with different iterative RAG methods. We demonstrate its effectiveness with three iterative RAG methods, across two models and four evaluation datasets. Notes writing yields an average improvement of 15.6 percentage points overall, with minimal increase in output tokens.
zh

[NLP-107] HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

【速读】：该论文旨在解决基于大型语言模型（Large Language Models, LLMs）的机器翻译评估方法在准确识别错误范围和评估其严重性方面存在的挑战。现有方法未能充分利用多维质量度量（MQM）层次结构中的细粒度结构和语义信息。论文提出的解决方案是HiMATE，一种基于MQM错误类型学的分层多智能体框架，通过引入模型自我反思能力和涉及非对称信息的智能体讨论机制，有效缓解了系统性幻觉问题，从而实现了更精确的子类型错误评估。

链接: https://arxiv.org/abs/2505.16281
作者: Shijie Zhang,Renhao Li,Songsheng Wang,Philipp Koehn,Min Yang,Derek F. Wong
机构: University of Macau (澳门大学); Chinese Academy of Sciences (中国科学院); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model’s self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at this https URL.
zh

[NLP-108] Spontaneous Speech Variables for Evaluating LLM s Cognitive Plausibility NAACL2025

【速读】：该论文试图解决如何从认知角度更好地理解大型语言模型（Large Language Models, LLMs）的特性，特别是在高资源语言中的表现。其解决方案的关键在于利用自发口语语料库来提取生产变量（如语音简化和语调突出），并将其应用于评估模型在预测这些变量方面的性能。研究通过在不同预训练数据集（书面、口语和混合类型）上训练模型，并进行微调，验证了模型在预测生产变量上的有效性，结果显示口语语料的训练数据能够提供更准确的预测结果。

链接: https://arxiv.org/abs/2505.16277
作者: Sheng-Fu Wang,Laurent Prevot,Jou-an Chi,Ri-Sheng Huang,Shu-Kai Hsieh
机构: Academia Sinica (中央研究院); CNRS & MEAE (国家科学研究中心与法国海外教育署); Graduate Institute of Linguistics (语言学研究所); National Taiwan University (台湾大学); Department of CSIE (资讯工程学系)
类目: Computation and Language (cs.CL)
备注: The 14th Workshop on Cognitive Modeling and Computational Linguistics (CMCL). May 3, 2025. Collocated with NAACL 2025

点击查看摘要

Abstract:The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs.
zh

[NLP-109] How do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance ESWC2025

【速读】：该论文旨在解决在知识图谱工程（Knowledge Graph Engineering, KGE）任务中，如何平衡大型语言模型（Large Language Models, LLMs）的性能与资源成本的问题。其关键解决方案是通过构建LLM-KG-Bench框架，对不同规模的LLMs在KGE任务中的表现进行系统性评估，从而揭示模型规模与任务性能之间的关系，并识别出在特定任务下具有高成本效益的模型规模。

链接: https://arxiv.org/abs/2505.16276
作者: Desiree Heim,Lars-Peter Meyer,Markus Schröder,Johannes Frey,Andreas Dengel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Peer reviewed and to appear in the ESWC 2025 Workshops and Tutorials Joint Proceedings (Workshop on Evaluation of Language Models in Knowledge Engineering [ELMKE])

点击查看摘要

Abstract:When using Large Language Models (LLMs) to support Knowledge Graph Engineering (KGE), one of the first indications when searching for an appropriate model is its size. According to the scaling laws, larger models typically show higher capabilities. However, in practice, resource costs are also an important factor and thus it makes sense to consider the ratio between model performance and costs. The LLM-KG-Bench framework enables the comparison of LLMs in the context of KGE tasks and assesses their capabilities of understanding and producing KGs and KG queries. Based on a dataset created in an LLM-KG-Bench run covering 26 open state-of-the-art LLMs, we explore the model size scaling laws specific to KGE tasks. In our analyses, we assess how benchmark scores evolve between different model size categories. Additionally, we inspect how the general score development of single models and families of models correlates to their size. Our analyses revealed that, with a few exceptions, the model size scaling laws generally also apply to the selected KGE tasks. However, in some cases, plateau or ceiling effects occurred, i.e., the task performance did not change much between a model and the next larger model. In these cases, smaller models could be considered to achieve high cost-effectiveness. Regarding models of the same family, sometimes larger models performed worse than smaller models of the same family. These effects occurred only locally. Hence it is advisable to additionally test the next smallest and largest model of the same family.
zh

[NLP-110] ransformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

【速读】：该论文旨在解决传统微调方法在优化模型参数时仅关注生成损失最小化，而忽视模型自身学习信号的问题。其解决方案的关键在于引入“Mistake Log”以系统性地跟踪模型的学习行为和重复错误，并设计了一个Copilot模型通过logits修正来优化Pilot模型的推理性能，形成Transformer Copilot框架，该框架包括新型Copilot模型设计、联合训练范式以及融合推理范式，从而在保持较低计算开销的同时显著提升模型性能。

链接: https://arxiv.org/abs/2505.16270
作者: Jiaru Zou,Yikun Ban,Zihao Li,Yunzhe Qi,Ruizhong Qiu,Ling Yang,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 7 figures

点击查看摘要

Abstract:Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model’s own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model’s learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot’s inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot’s logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.
zh

[NLP-111] All You Need is “Leet”: Evading Hate-speech Detection AI

【速读】：该论文旨在解决在线平台上仇恨言论（hate speech）检测模型被规避的问题，通过设计黑盒技术生成扰动，使先进的基于深度学习的仇恨言论检测模型失效，从而保护用户免受仇恨言论的影响。解决方案的关键在于生成能够欺骗检测模型的扰动，同时确保对原始仇恨言论语义的最小改变。

链接: https://arxiv.org/abs/2505.16263
作者: Sampanna Yashwant Kahu,Naman Ahuja
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 22 figures, The source code and data used in this work is available at: this https URL

点击查看摘要

Abstract:Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
zh

[NLP-112] IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection

【速读】：该论文试图解决跨多模态输入的隐喻语言（如讽刺）识别问题，该问题通常需要任务特定的微调和复杂的推理步骤。其解决方案的关键在于提出IRONIC框架，该框架利用多模态一致性关系来分析指称性、类比性和语用性的图像-文本关联，从而更有效地模拟人类识别讽刺的认知过程。

链接: https://arxiv.org/abs/2505.16258
作者: Aashish Anantha Ramakrishnan,Aadarsh Anantha Ramakrishnan,Dongwon Lee
机构: The Pennsylvania State University (宾夕法尼亚州立大学); National Institute of Technology, Tiruchirappalli (印度理工学院特里奇拉帕利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: this https URL
zh

[NLP-113] Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models

【速读】：该论文试图解决大型语言模型中无意保留内容的问题，即知识去遗忘（knowledge unlearning）问题。现有方法倾向于采用局部化去遗忘（localized unlearning），通过限制参数更新范围来移除目标知识同时保留其他通用知识。然而，该论文指出，这些方法的有效性尚未得到充分验证，因为缺乏对去遗忘与知识保留之间权衡的系统评估。论文的关键在于通过控制实验验证局部参数更新是否因果性地促进去遗忘，其发现表明，有效去遗忘所需的参数集合并非严格确定，从而挑战了局部化去遗忘的核心假设，即参数局部性可直接指示知识移除的有效性。

链接: https://arxiv.org/abs/2505.16252
作者: Hwiyeong Lee,Uiji Hwang,Hyelim Lim,Taeuk Kim
机构: Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models often retain unintended content, prompting growing interest in knowledge unlearning. Recent approaches emphasize localized unlearning, which restricts parameter updates to specific regions in an effort to remove target knowledge while preserving unrelated general knowledge. However, their effectiveness remains uncertain due to the lack of robust and thorough evaluation of the trade-off between the competing goals of unlearning. In this paper, we begin by revisiting existing localized unlearning approaches. We then conduct controlled experiments to rigorously evaluate whether local parameter updates causally contribute to unlearning. Our findings reveal that the set of parameters that must be modified for effective unlearning is not strictly determined, challenging the core assumption of localized unlearning that parameter locality is inherently indicative of effective knowledge removal.
zh

[NLP-114] Diverse not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

【速读】：该论文试图解决语言模型输出多样性不足的问题，尤其是在创造性生成、开放性任务和自我改进训练中，现有多样性度量和用于偏好优化的奖励模型存在偏向于较短输出的系统性偏差，从而限制了模型的表达能力。解决方案的关键在于引入Diverse-NS，这是一个长度可控的自学习框架，通过生成和筛选在多样性、质量和长度之间取得平衡的偏好数据，实现仅使用3,000对偏好数据的有效训练，从而在保持长度一致性的同时提升响应的词汇和语义多样性。

链接: https://arxiv.org/abs/2505.16245
作者: Vijeta Deshpande,Debasmita Ghose,John D. Patterson,Roger Beaty,Anna Rumshisky
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校); Yale University (耶鲁大学); Pennsylvania State University (宾夕法尼亚州立大学); Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective “diversity teachers” for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.
zh

[NLP-115] hree Minds One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在逻辑能力增强后可能引入更严重的安全漏洞的问题，以及现有越狱方法在对抗自适应安全机制时难以平衡效果与鲁棒性的挑战。解决方案的关键在于提出SEAL，一种通过自适应加密流程对LRMs进行越狱攻击的新方法，其核心是采用堆叠加密技术，结合多种密码算法以淹没模型的推理能力，从而有效绕过内置的安全机制，并通过随机和自适应两种动态策略调整密钥长度、顺序和组合，防止模型发展出应对措施。

链接: https://arxiv.org/abs/2505.16241
作者: Viet-Anh Nguyen,Shiqian Zhao,Gia Dao,Runyi Hu,Yi Xie,Luu Anh Tuan
机构: Nanyang Technological University (南洋理工大学); University of Texas at Arlington (德克萨斯大学阿灵顿分校); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.
zh

[NLP-116] Align-GRAG : Reasoning -Guided Dual Alignment for Graph Retrieval-Augmented Generation

【速读】：该论文旨在解决图增强生成（Graph-based RAG）系统中存在的两个核心问题：一是检索过程中引入的无关节点导致输入过长，影响效率；二是图结构与语言模型之间的表征差异限制了图结构在理解中的有效利用。其解决方案的关键在于提出了一种基于推理引导的双对齐框架Align-GRAG，通过联合优化图编码器与语言模型摘要推理，利用KL散度损失和对比损失实现图节点与表示的双重对齐，从而高效过滤无关知识并建立统一语义空间，最终提升生成答案的准确性和连贯性。

链接: https://arxiv.org/abs/2505.16237
作者: Derong Xu,Pengyue Jia,Xiaopeng Li,Yingyi Zhang,Maolin Wang,Qidong Liu,Xiangyu Zhao,Yichao Wang,Huifeng Guo,Ruiming Tang,Enhong Chen,Tong Xu
机构: University of Science and Technology of China (中国科学技术大学); City University of Hong Kong (香港城市大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.
zh

[NLP-117] LIFEBench: Evaluating Length Instruction Following in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在遵循明确长度指令方面的不足，例如生成指定字数的文本。现有评估基准主要关注生成内容的质量，而忽视了输出是否符合长度要求。为此，研究者提出了Length Instruction Following Evaluation Benchmark (LIFEBench)，其关键在于构建一个涵盖多种任务类别和广泛长度约束的综合性评估基准，以全面衡量LLMs在遵循长度指令方面的能力。

链接: https://arxiv.org/abs/2505.16234
作者: Wei Zhang,Zhenhong Zhou,Junfeng Fang,Rongwu Xu,Kun Wang,Yuanhe Zhang,Rui Wang,Ge Zhang,Xinfeng Li,Li Sun,Lingjuan Lyu,Yang Liu,Sen Su
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 81 pages, 22 tables, 32 figures. Homepage: this https URL

点击查看摘要

Abstract:While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs’ ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs’ length instructions following ability, offering critical insights for future progress.
zh

[NLP-118] MuseRAG : Idea Originality Scoring At Scale

【速读】：该论文旨在解决如何在大规模数据中自动化评估创意想法原创性的问题，传统方法依赖人工对想法进行重新表述分组，存在劳动强度大、易出错且在大规模语料中表现脆弱的缺点。解决方案的关键在于提出一种完全自动化的心理测量验证流程——MuseRAG，该方法结合了大型语言模型（Large Language Models, LLMs）与外部协调的检索增强生成（Retrieval-Augmented Generation, RAG）框架，通过检索语义相似的先前想法桶并零样本提示LLM判断新想法是否属于现有桶或形成新桶，从而实现基于频率的原创性评分。

链接: https://arxiv.org/abs/2505.16232
作者: Ali Sarosh Bangash,Krish Veera,Ishfat Abrar Islam,Raiyan Abdul Baten
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An objective, face-valid way to assess the originality of creative ideas is to measure how rare each idea is within a population – an approach long used in creativity research but difficult to automate at scale. Tabulating response frequencies via manual bucketing of idea rephrasings is labor-intensive, error-prone, and brittle under large corpora. We introduce a fully automated, psychometrically validated pipeline for frequency-based originality scoring. Our method, MuseRAG, combines large language models (LLMs) with an externally orchestrated retrieval-augmented generation (RAG) framework. Given a new idea, the system retrieves semantically similar prior idea buckets and zero-shot prompts the LLM to judge whether the new idea belongs to an existing bucket or forms a new one. The resulting buckets enable computation of frequency-based originality metrics. Across five datasets (N=1143, n_ideas=16294), MuseRAG matches human annotators in idea clustering structure and resolution (AMI = 0.59) and in participant-level scoring (r = 0.89) – while exhibiting strong convergent and external validity. Our work enables intent-sensitive, human-aligned originality scoring at scale to aid creativity research.
zh

[NLP-119] Explain Less Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

【速读】：该论文旨在解决技术文档中术语（jargon）检测与解释的个性化问题，以提升不同学科背景读者对技术文档的可访问性。传统方法在针对个体用户进行模型微调时需要大量标注数据和计算资源，难以满足实际部署需求。论文提出的解决方案关键在于采用两种高效的个性化策略：一是基于开源模型的轻量级微调（Low-Rank Adaptation, LoRA），二是推理阶段的个性化提示（personalized prompting），无需保留用户特定数据。此外，研究还探索了结合有限标注数据与无监督用户背景信号的混合方法，显著降低了资源消耗并提升了模型性能。

链接: https://arxiv.org/abs/2505.16227
作者: Bohao Wu,Qingyun Wang,Yue Guo
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.
zh

[NLP-120] Dont Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在代码评估任务中对语义等价但表面形式不同的代码是否能公平且稳健地进行评价的问题。其关键解决方案是通过定义六种潜在的偏差类型，并系统性地分析这些偏差对LLM评判者的影响，从而揭示当前LLM在面对代码形式变化时存在的公平性和可靠性问题。

链接: https://arxiv.org/abs/2505.16222
作者: Jiwon Moon,Yerin Hwang,Dongryeol Lee,Taegwan Kang,Yongil Kim,Kyomin Jung
机构: IPAI, Seoul National University(人工智能与计算研究所，首尔国立大学); Dept. of ECE, Seoul National University(电子与计算机工程系，首尔国立大学); LG AI Research(LG人工智能研究)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 26 pages

点击查看摘要

Abstract:With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.
zh

[NLP-121] Memorization or Reasoning ? Exploring the Idiom Understanding of LLM s

【速读】：该论文试图解决的是大型语言模型（Large Language Models, LLMs）在多语言环境下对习语（idioms）处理机制的理解不足问题。解决方案的关键在于引入MIDAS，这是一个包含六种语言的习语及其对应含义的大规模数据集，并基于此对LLMs的习语处理能力进行系统评估，从而识别影响其性能的关键因素。研究发现，LLMs不仅依赖记忆，还采用结合上下文线索和推理的混合方法来处理结构化习语，表明习语理解源于内部知识检索与基于推理的推断之间的相互作用。

链接: https://arxiv.org/abs/2505.16216
作者: Jisu Kim,Youngwoo Shin,Uiji Hwang,Jihun Choi,Richeng Xuan,Taeuk Kim
机构: Hanyang University (汉阳大学); Sony AI (索尼人工智能); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Idioms have long posed a challenge due to their unique linguistic properties, which set them apart from other common expressions. While recent studies have leveraged large language models (LLMs) to handle idioms across various tasks, e.g., idiom-containing sentence generation and idiomatic machine translation, little is known about the underlying mechanisms of idiom processing in LLMs, particularly in multilingual settings. To this end, we introduce MIDAS, a new large-scale dataset of idioms in six languages, each paired with its corresponding meaning. Leveraging this resource, we conduct a comprehensive evaluation of LLMs’ idiom processing ability, identifying key factors that influence their performance. Our findings suggest that LLMs rely not only on memorization, but also adopt a hybrid approach that integrates contextual cues and reasoning, especially when processing compositional idioms. This implies that idiom understanding in LLMs emerges from an interplay between internal knowledge retrieval and reasoning-based inference.
zh

[NLP-122] Large Language Models based ASR Error Correction for Child Conversations INTERSPEECH2025

【速读】：该论文试图解决儿童对话语音自动语音识别（ASR）中准确转录仍面临重大挑战的问题，特别是如何利用大型语言模型（LLMs）来纠正ASR错误。解决方案的关键在于探索LLMs在不同ASR输出（包括零样本和微调的CTC基础ASR以及微调的自回归ASR如Whisper）上的纠错能力，并分析其在引入上下文信息时的局限性。

链接: https://arxiv.org/abs/2505.16212
作者: Anfeng Xu,Tiantian Feng,So Hyun Kim,Somer Bishop,Catherine Lord,Shrikanth Narayanan
机构: Viterbi School of Engineering (维特比工程学院); School of Psychology (心理学学院); Weill Institute for Neurosciences (韦尔神经科学研究所); Semel Institute of Neuroscience and Human Behavior (塞梅尔神经科学与人类行为研究所)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children’s speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children’s conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
zh

[NLP-123] AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

【速读】：该论文旨在解决音频大语言模型（Audio Large Language Models, ALLMs）在可信度评估方面的不足，尤其是针对音频模态特有的风险缺乏系统性研究的问题。现有评估框架主要关注文本模态或仅涵盖有限的安全维度，未能充分考虑音频模态的独特属性和应用场景。论文提出的解决方案是构建AudioTrust——首个专为ALLMs设计的多维可信度评估框架与基准测试平台，其关键在于通过18个实验设置和包含4,420个真实场景音频/文本样本的数据集，全面评估公平性、幻觉、安全、隐私、鲁棒性和认证等六个核心维度，并结合9个音频特定的评估指标及大规模自动化评分流程，以客观、可扩展的方式揭示当前主流ALLMs在高风险音频场景下的可信度边界与局限性。

链接: https://arxiv.org/abs/2505.16211
作者: Kai Li,Can Shen,Yile Liu,Jirui Han,Kelong Zheng,Xuechao Zou,Zhe Wang,Xingjian Du,Shun Zhang,Hanjun Luo,Yingbin Jin,Xinxin Xing,Ziyang Ma,Yue Liu,Xiaojun Jia,Yifan Zhang,Junfeng Fang,Kun Wang,Yibo Yan,Haoyang Li,Yiming Li,Xiaobin Zhuang,Yang Liu,Haibo Hu,Zhuo Chen,Zhizheng Wu,Xiaolin Hu,Eng-Siong Chng,XiaoFeng Wang,Wenyuan Xu,Wei Dong,Xinfeng Li
机构: Nanyang Technological University; Tsinghua University; BNBU; Waseda University; Independent Researcher; HUST; BJTU; Hong Kong Polytechnic University; University of Rochester; QHU; Zhejiang University; Shanghai Jiao Tong University; National Univeristy of Singapore; CAS; Hong Kong University of Science and Technology (Guangzhou); Bytedance; The Chinese University of Hong Kong (Shenzhen); ACM Member
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Technical Report

点击查看摘要

Abstract:The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at this https URL.
zh

[NLP-124] NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中由于需要更大的批量大小或更长的上下文长度而导致的键值（Key-Value, KV）缓存内存消耗过大的问题，这一问题已成为LLM部署的主要瓶颈。其解决方案的关键在于对KV缓存进行低比特量化，通过分析KV缓存元素的分布特性，设计了NQKV算法，该算法采用基于块的分位数量化方法，在不显著影响模型输出质量的前提下，实现了信息论最优的量化误差，从而有效节省了内存空间并提升了推理效率。

链接: https://arxiv.org/abs/2505.16210
作者: Zhihang Cai,Xingjun Zhang,Zhendong Tan,Zheng Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.
zh

[NLP-125] An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLM s Sentimental Perception Capability

【速读】：该论文试图解决多模态情感分析（Multimodal Sentiment Analysis, MSA）在零样本（zero-shot）范式下表现不佳的问题，即当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在无需微调的情况下难以有效处理MSA任务。解决方案的关键在于将零样本范式扩展到上下文学习（In-Context Learning, ICL），并通过深入研究演示样本的配置，优化演示检索、展示和分布三个关键因素，同时发现并有效抵消了MLLMs中固有的情感预测偏差。

链接: https://arxiv.org/abs/2505.16193
作者: Daiqing Wu,Dongbao Yang,Sicheng Zhao,Can Ma,Yu Zhou
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations’ retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.
zh

[NLP-126] he Language of Interoception: Examining Embodiment and Emotion Through a Corpus of Body Part Mentions

【速读】：该论文试图解决情感、具身性（embodiment）与日常语言之间关系的系统性研究问题，特别是在大规模自然语言数据中的表现。其解决方案的关键在于构建包含身体部位提及（Body Part Mentions, BPMs）的语料库，并利用情感词典和人工标注数据，分析BPMs在文本中的使用模式及其与情感强度和健康结果的关联。研究发现，BPMs在个人叙述和推文中较为常见，且其使用模式随时间和地理区域显著变化，同时显示身体相关语言与多种较差健康结果存在强统计学相关性。

链接: https://arxiv.org/abs/2505.16189
作者: Sophie Wu,Jan Philip Wahle,Saif M. Mohammad
机构: McGill University (麦吉尔大学); University of Göttingen (哥廷根大学); National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注: 8 pages, 26 figures

点击查看摘要

Abstract:This paper is the first investigation of the connection between emotion, embodiment, and everyday language in a large sample of natural language data. We created corpora of body part mentions (BPMs) in online English text (blog posts and tweets). This includes a subset featuring human annotations for the emotions of the person whose body part is mentioned in the text. We show that BPMs are common in personal narratives and tweets (~5% to 10% of posts include BPMs) and that their usage patterns vary markedly by time and %geographic location. Using word-emotion association lexicons and our annotated data, we show that text containing BPMs tends to be more emotionally charged, even when the BPM is not explicitly used to describe a physical reaction to the emotion in the text. Finally, we discover a strong and statistically significant correlation between body-related language and a variety of poorer health outcomes. In sum, we argue that investigating the role of body-part related words in language can open up valuable avenues of future research at the intersection of NLP, the affective sciences, and the study of human wellbeing.
zh

[NLP-127] SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在开放生成场景中行为控制不可靠的问题。其解决方案的关键在于引入一种监督引导方法，该方法在稀疏且可解释的表示空间中操作。通过使用稀疏自编码器（Sparse Autoencoders, SAEs）获取稀疏潜在表示以解耦语义属性，并训练线性分类器识别任务相关维度的小子空间，最终学习受该子空间约束的监督引导向量，以优化对目标行为的对齐。实验表明，该方法在多个任务中实现了更高的成功率，同时保持生成质量的最小下降。

链接: https://arxiv.org/abs/2505.16188
作者: Zirui He,Mingyu Jin,Bo Shen,Ali Payani,Yongfeng Zhang,Mengnan Du
机构: NJIT; Rutgers University; Cisco
类目: Computation and Language (cs.CL)
备注: 30 pages, 24 figures, 12 tables

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs)to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.
zh

[NLP-128] SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在面对有害查询和对抗性攻击时的安全性问题，特别是现有基于监督微调（Supervised Fine-Tuning, SFT）的方法在处理未见过的越狱提示（jailbreak prompts）时泛化能力不足的问题。解决方案的关键在于识别出一个“安全顿悟时刻”（safety aha moment），该时刻通常出现在模型的“关键句子”（key sentence）中，能够激活安全推理并引导生成安全响应。为增强这一机制，作者提出了SafeKey方法，包含两个互补目标：（1）双路径安全头（Dual-Path Safety Head）以增强关键句子前模型内部表示中的安全信号；（2）查询掩码建模（Query-Mask Modeling）目标以提升模型对查询理解的关注度，从而提高安全性提示的捕捉能力。

链接: https://arxiv.org/abs/2505.16186
作者: Kaiwen Zhou,Xuandong Zhao,Gaowen Liu,Jayanth Srinivasa,Aosong Feng,Dawn Song,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs’ generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence’, which follows models’ query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model’s internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models’ attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.
zh

[NLP-129] Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Prag matics

【速读】：该论文试图解决图像描述评估中视觉语义与语言语用学的协同评估问题，现有大多数评估指标无法完全捕捉这一复杂性。解决方案的关键在于提出 Redemption Score，这是一种结合三种互补信号的混合框架：(1) 基于互信息散度（Mutual Information Divergence, MID）的全局图像-文本分布对齐，(2) 基于 DINO 的循环生成图像感知相似性以实现视觉定位，(3) 基于 BERTScore 的上下文文本相似性与人类参考文本对比。通过校准这些信号的融合，Redemption Score 能够提供更全面的评估。

链接: https://arxiv.org/abs/2505.16180
作者: Ashim Dahal,Ankit Ghimire,Saydul Akbar Murad,Nick Rahimi
机构: The University of Southern Mississippi (南密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating image captions requires cohesive assessment of both visual semantics and language pragmatics, which is often not entirely captured by most metrics. We introduce Redemption Score, a novel hybrid framework that ranks image captions by triangulating three complementary signals: (1) Mutual Information Divergence (MID) for global image-text distributional alignment, (2) DINO-based perceptual similarity of cycle-generated images for visual grounding, and (3) BERTScore for contextual text similarity against human references. A calibrated fusion of these signals allows Redemption Score to offer a more holistic assessment. On the Flickr8k benchmark, Redemption Score achieves a Kendall- \tau of 56.43, outperforming twelve prior methods and demonstrating superior correlation with human judgments without requiring task-specific training. Our framework provides a more robust and nuanced evaluation by effectively redeeming image semantics and linguistic interpretability indicated by strong transfer of knowledge in the Conceptual Captions and MS COCO datasets.
zh

[NLP-130] Understanding Fact Recall in Language Models: Why Two-Stage Training Encourag es Memorization but Mixed Training Teaches Knowledge

【速读】：该论文试图解决语言模型（Language Models, LMs）在事实回忆（Fact Recall）任务中的能力不足问题，即模型难以从大量训练数据中泛化地检索特定事实知识。现有常见的两阶段训练策略容易导致模型依赖机械记忆而非真正理解事实，而混合训练虽然已被实证有效，但其内在机制尚不明确。论文的关键解决方案是引入跨任务梯度追踪（cross-task gradient trace），以识别同时受事实存储和事实回忆示例影响的共享参数，从而揭示混合训练如何通过增强共享参数的集中性和规模来提升模型的事实回忆能力。

链接: https://arxiv.org/abs/2505.16178
作者: Ying Zhang,Benjamin Heinzerling,Dongyuan Li,Ryoma Ishigaki,Yuta Hitomi,Kentaro Inui
机构: RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心); Tohoku University (东北大学); The University of Tokyo (东京大学); Tokyo Denki University (东京电机大学); Alt Inc (Alt公司); MBZUAI (穆巴达拉湾人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.
zh

[NLP-131] Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning

【速读】：该论文旨在解决传统数据选择方法在推理任务中依赖外部预定义静态指标（如难度和多样性）所带来的适应性不足问题，这些方法通常针对监督微调（SFT）设计，难以适配连续训练过程。其关键解决方案是提出SAI-DPO算法，通过持续评估模型在不同训练阶段的特定推理能力，动态选择训练数据，并结合实时模型性能反馈，自适应调整数据选择策略，从而提升数据利用效率和最终任务表现。

链接: https://arxiv.org/abs/2505.16176
作者: Jun Rao,Xuebo Liu,Hexuan Deng,Zepeng Lin,Zixiong Yu,Jiansheng Wei,Xiaojun Meng,Min Zhang
机构: Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China (计算与智能研究院，哈尔滨工业大学，深圳); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the realm of data selection for reasoning tasks, existing approaches predominantly rely on externally predefined static metrics such as difficulty and diversity, which are often designed for supervised fine-tuning (SFT) and lack adaptability to continuous training processes. A critical limitation of these methods is their inability to dynamically align with the evolving capabilities of models during online training, a gap that becomes increasingly pronounced with the rise of dynamic training paradigms and online reinforcement learning (RL) frameworks (e.g., R1 models). To address this, we introduce SAI-DPO, an algorithm that dynamically selects training data by continuously assessing a model’s stage-specific reasoning abilities across different training phases. By integrating real-time model performance feedback, SAI-DPO adaptively adapts data selection to the evolving strengths and weaknesses of the model, thus enhancing both data utilization efficiency and final task performance. Extensive experiments on three state-of-the-art models and eight mathematical reasoning benchmarks, including challenging competition-level datasets (e.g., AIME24 and AMC23), demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with particularly notable improvements of 10 and 15 points on AIME24 and AMC23, respectively. These results highlight the superiority of dynamic, model-adaptive data selection over static, externally defined strategies in advancing reasoning.
zh

[NLP-132] Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss

【速读】：该论文试图解决健康信息在简化过程中因生成式AI（Generative AI）算法删除关键信息而导致理解困难的问题。其解决方案的关键在于识别并补全简化文本中缺失的实体，实验结果表明，添加所有缺失实体能够显著提升文本的再生质量，优于仅添加排名靠前的实体或随机选择的实体或词语。当前工具虽能识别这些实体，但在对其重要性进行排序方面仍存在不足。

链接: https://arxiv.org/abs/2505.16172
作者: Abhay Kumara Sri Krishna Nandiraju,Gondy Leroy,David Kauchak,Arif Ahmed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding health information is essential in achieving and maintaining a healthy life. We focus on simplifying health information for better understanding. With the availability of generative AI, the simplification process has become efficient and of reasonable quality, however, the algorithms remove information that may be crucial for comprehension. In this study, we compare generative AI to detect missing information in simplified text, evaluate its importance, and fix the text with the missing information. We collected 50 health information texts and simplified them using gpt-4-0613. We compare five approaches to identify missing elements and regenerate the text by inserting the missing elements. These five approaches involve adding missing entities and missing words in various ways: 1) adding all the missing entities, 2) adding all missing words, 3) adding the top-3 entities ranked by gpt-4-0613, and 4, 5) serving as controls for comparison, adding randomly chosen entities. We use cosine similarity and ROUGE scores to evaluate the semantic similarity and content overlap between the original, simplified, and reconstructed simplified text. We do this for both summaries and full text. Overall, we find that adding missing entities improves the text. Adding all the missing entities resulted in better text regeneration, which was better than adding the top-ranked entities or words, or random words. Current tools can identify these entities, but are not valuable in ranking them.
zh

[NLP-133] When Do LLM s Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在应知错误时是否能够承认并撤回先前生成答案的问题，即“重写”（retraction）行为的机制与影响因素。研究的关键在于揭示LLMs的内部信念（internal belief）如何影响其是否选择撤回错误答案，并通过实验验证内部信念对模型重写行为的因果影响。此外，研究还表明，通过简单的监督微调可以有效提升模型的重写性能，从而帮助模型建立更准确的内部信念。

链接: https://arxiv.org/abs/2505.16170
作者: Yuqing Yang,Robin Jia
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as “retraction” and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models’ internal belief: models fail to retract wrong answers that they “believe” to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on this https URL.
zh

[NLP-134] Can LLM s Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

【速读】：该论文试图解决生成式 AI (Generative AI) 在模拟人类认知行为变异性方面的局限性问题，具体聚焦于语言模型是否能够再现人类在音素流畅性任务中的个体差异。研究通过评估34种模型配置（包括提示 specificity、采样温度和模型类型）的输出，并与106名人类参与者的响应进行比较，发现尽管某些配置（如 Claude 3.7 Sonnet）能够匹配人类平均表现和词汇偏好，但无法再现人类行为的多样性。解决方案的关键在于识别出生成式 AI 输出在多样性与结构灵活性上的不足，以及其在信息检索结构上与人类的根本差异。

链接: https://arxiv.org/abs/2505.16164
作者: Mengyang Qiu,Zoe Brisebois,Siena Sun
机构: Trent University (特伦特大学); Saint Elizabeth University (圣伊丽莎白大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.
zh

[NLP-135] KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

【速读】：该论文旨在解决Speculative Decoding (SD)在面对领域迁移时加速性能显著下降的问题，即drafting过程在层跳过（layer skipping）情况下对领域变化的敏感性问题。其解决方案的关键在于引入KNN-SSD算法，该算法通过K-Nearest Neighbor (KNN)搜索机制，将不同跳过的层与各类领域输入进行匹配，从而提升该范式的领域泛化能力。

链接: https://arxiv.org/abs/2505.16162
作者: Mingbo Song,Heming Xia,Jun Zhang,Chak Tou Leong,Qiancheng Xu,Wenjie Li,Sujian Li
机构: Peking University (北京大学); The Hong Kong Polytechnic University (香港理工大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.
zh

[NLP-136] EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

【速读】：该论文旨在解决大型语言模型在教育场景中的应用研究不足及优化不够的问题（Educational Scenario Application），其关键解决方案是构建了一个针对教育场景的首个多样化基准测试集，包含9个主要教育场景和超过4,000个不同的教育情境，并设计了一套涵盖12个关键方面的多维评估指标，同时通过人工标注确保模型生成评估结果的有效性。此外，研究者还在该数据集上训练了一个相对小规模的模型，并证明其在测试集上的表现可与当前最先进的大模型（如Deepseek V3、Qwen Max）相媲美。

链接: https://arxiv.org/abs/2505.16160
作者: Bin Xu,Yu Bai,Huashan Sun,Yiguan Lin,Siming Liu,Xinyue Liang,Yaolin Li,Yang Gao,Heyan Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at this https URL.
zh

[NLP-137] When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification

【速读】：该论文试图解决图像分类基准数据集中存在的噪声标签和缺失标签问题，这些问题会导致模型比较和评估的误导与不公平。解决方案的关键在于提出一个名为REVEAL的综合框架，该框架整合了先进的预训练视觉-语言模型（如LLaVA、BLIP、Janus、Qwen）与机器/人工标签整理方法（如Docta、Cleanlab、MTurk），通过检测潜在的噪声标签和遗漏标签、聚合多种方法的预测结果，并利用置信度指导的预测和基于共识的过滤来提升标签准确性。

链接: https://arxiv.org/abs/2505.16149
作者: Zirui Pang,Haosheng Tan,Yuhan Pu,Zhijie Deng,Zhouan Shen,Keyu Hu,Jiaheng Wei
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Glasgow (格拉斯哥大学); Boston University (波士顿大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image classification benchmark datasets such as CIFAR, MNIST, and ImageNet serve as critical tools for model evaluation. However, despite the cleaning efforts, these datasets still suffer from pervasive noisy labels and often contain missing labels due to the co-existing image pattern where multiple classes appear in an image sample. This results in misleading model comparisons and unfair evaluations. Existing label cleaning methods focus primarily on noisy labels, but the issue of missing labels remains largely overlooked. Motivated by these challenges, we present a comprehensive framework named REVEAL, integrating state-of-the-art pre-trained vision-language models (e.g., LLaVA, BLIP, Janus, Qwen) with advanced machine/human label curation methods (e.g., Docta, Cleanlab, MTurk), to systematically address both noisy labels and missing label detection in widely-used image classification test sets. REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering. Additionally, we provide a thorough analysis of state-of-the-art vision-language models and pre-trained image classifiers, highlighting their strengths and limitations within the context of dataset renovation by revealing 10 observations. Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods. Through human verifications, REVEAL significantly improves the quality of 6 benchmark test sets, highly aligning to human judgments and enabling more accurate and meaningful comparisons in image classification.
zh

[NLP-138] NAN: A Training-Free Solution to Coefficient Estimation in Model Merging

【速读】：该论文试图解决模型融合（model merging）中合并系数确定依赖启发式方法而导致的可扩展性和通用性受限的问题。解决方案的关键在于通过最小二乘优化的视角重新审视模型融合，并提出一种名为NAN的方法，该方法通过参数范数的逆来估计合并系数，从而实现一种无需训练、即插即用且适用于多种融合策略的有效方案。

链接: https://arxiv.org/abs/2505.16148
作者: Chongjie Si,Kangtao Lv,Jingjing Jiang,Yadao Wang,Yongwei Wang,Xiaokang Yang,Wenbo Su,Bo Zheng,Wei Shen
机构: Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model merging offers a training-free alternative to multi-task learning by combining independently fine-tuned models into a unified one without access to raw data. However, existing approaches often rely on heuristics to determine the merging coefficients, limiting their scalability and generality. In this work, we revisit model merging through the lens of least-squares optimization and show that the optimal merging weights should scale with the amount of task-specific information encoded in each model. Based on this insight, we propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm. NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies. Extensive experiments on show that NAN consistently improves performance of baseline methods.
zh

[NLP-139] Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

【速读】：该论文试图解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在多模态任务中产生的幻觉问题，即生成与视觉输入不一致的文本，这在实际应用中会带来重大风险。解决方案的关键在于利用稀疏自编码器（Sparse Autoencoders, SAE）识别与幻觉或真实性紧密相关的语义方向，从而实现更精确和直接的幻觉相关表征。通过在这些方向上进行干预，可以有效缓解幻觉，而不会对正常语义造成负面影响。基于此，作者提出了无需训练的SSL方法，即通过SAE潜在方向引导LVLM以减轻幻觉，实验表明该方法在减少幻觉方面优于现有解码方法，同时保持了跨不同模型架构的可迁移性且计算开销极低。

链接: https://arxiv.org/abs/2505.16146
作者: Zhenglin Hua,Jinghan He,Zijun Yao,Tianxu Han,Haiyun Guo,Yuheng Jia,Junfeng Fang
机构: Southeast University (东南大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Wuhan University of Technology (武汉理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs’ internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.
zh

[NLP-140] Distilling the Implicit Multi-Branch Structure in LLM s Reasoning via Reinforcement Learning

【速读】：该论文试图解决小规模大型语言模型（Large Language Models, LLMs）在通过监督微调（Supervised Fine-Tuning, SFT）从教师模型中蒸馏推理路径时，由于教师模型生成的推理路径仅反映表层痕迹而缺乏真实推理的复杂结构问题。解决方案的关键在于提出一种基于强化学习（Reinforcement Learning, RL）的蒸馏框架RLKD，其核心是引入一种新颖的生成结构奖励模型（Generative Structure Reward Model, GSRM），该模型能够将推理路径转化为多个元推理-求解步骤，并计算结构对齐奖励，从而引导学生模型内化教师模型的隐式多分支推理结构，而非仅仅模仿固定的输出路径。

链接: https://arxiv.org/abs/2505.16142
作者: Shicheng Xu,Liang Pang,Yunchang Zhu,Jia Gu,Zihao Wei,Jingcheng Deng,Feiyang Pan,Huawei Shen,Xueqi Cheng
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, CAS (国家人工智能安全重点实验室，计算技术研究所，中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Huawei Inc. (华为公司)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher’s reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher’s implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.
zh

[NLP-141] Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中存在的位置偏差（positional bias）问题，即模型对特定上下文位置信息的系统性忽视，并探讨其与语言多样性的相互作用。解决方案的关键在于通过跨语言研究，分析位置偏差如何与模型不确定性、句法结构及提示工程相关联，揭示位置偏差的模型驱动特性及其在不同语言中的差异，同时提出通过调整上下文位置以适应模型偏差来提升性能的方法。

链接: https://arxiv.org/abs/2505.16134
作者: Menschikov Mikhail,Alexander Kharitonov,Maiia Kotyga,Vadim Porvatov,Anna Zhukovskaya,David Kagramanyan,Egor Shvetsov,Evgeny Burnaev
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models exhibit positional bias – systematic neglect of information at specific context positions – yet its interplay with linguistic diversity remains poorly understood. We present a cross-linguistic study across five typologically distinct languages (English, Russian, German, Hindi, Vietnamese), examining how positional bias interacts with model uncertainty, syntax, and prompting. Key findings: (1) Positional bias is model-driven, with language-specific variations – Qwen2.5-7B favors late positions, challenging assumptions of early-token bias; (2) Explicit positional guidance (e.g., correct context is at position X) reduces accuracy across languages, undermining prompt-engineering practices; (3) Aligning context with positional bias increases entropy, yet minimal entropy does not predict accuracy. (4) We further uncover that LLMs differently impose dominant word order in free-word-order languages like Hindi.
zh

[NLP-142] LLM s Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods ACL

【速读】：该论文试图解决机器翻译质量评估（MTQE）中直接评分方法与人类判断在段落级别相关性较低的问题。解决方案的关键在于提出一种基于生成的评估范式，利用仅解码器的大型语言模型（LLMs）生成高质量参考文本，随后通过句子嵌入进行语义相似性评分，从而提升评估的准确性。

链接: https://arxiv.org/abs/2505.16129
作者: Hyang Cui
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 2 tables. Conforms to the ACL Rolling Review (ARR) short paper track. Code and data available at: this https URL

点击查看摘要

Abstract:Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.
zh

[NLP-143] Veracity Bias and Beyond: Uncovering LLM s Hidden Beliefs in Problem-Solving Reasoning ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对不同人口统计学群体时表现出的隐性偏见问题，特别是其在判断解决方案真实性（veracity）时的偏差。研究的关键在于通过多任务实验揭示两种形式的 veracity bias：归因偏差（Attribution Bias）和评估偏差（Evaluation Bias），并进一步验证这些偏差在模型的推理过程中具有深层次的嵌入性，从而强调了LLMs在教育和评估场景中应用时可能带来的潜在风险。

链接: https://arxiv.org/abs/2505.16128
作者: Yue Zhou,Barbara Di Eugenio
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 (Main)

点击查看摘要

Abstract:Despite LLMs’ explicit alignment against demographic stereotypes, they have been shown to exhibit biases under various social contexts. In this work, we find that LLMs exhibit concerning biases in how they associate solution veracity with demographics. Through experiments across five human value-aligned LLMs on mathematics, coding, commonsense, and writing problems, we reveal two forms of such veracity biases: Attribution Bias, where models disproportionately attribute correct solutions to certain demographic groups, and Evaluation Bias, where models’ assessment of identical solutions varies based on perceived demographic authorship. Our results show pervasive biases: LLMs consistently attribute fewer correct solutions and more incorrect ones to African-American groups in math and coding, while Asian authorships are least preferred in writing evaluation. In additional studies, we show LLMs automatically assign racially stereotypical colors to demographic groups in visualization code, suggesting these biases are deeply embedded in models’ reasoning processes. Our findings indicate that demographic bias extends beyond surface-level stereotypes and social context provocations, raising concerns about LLMs’ deployment in educational and evaluation settings.
zh

[NLP-144] KoBALT: Korean Benchmark For Advanced Linguistic Tasks

【速读】：该论文试图解决现有基准在评估大型语言模型（Large Language Models, LLMs）时缺乏语言学深度和类型学基础的问题，特别是在韩语这一形态丰富的语言中。其解决方案的关键在于构建一个由专家精心设计、具有语言学动机的基准测试集——KoBALT，该基准包含700道多选题，覆盖语法、语义、语用、语音/音系和形态学五个语言领域，并通过最小化与标准韩语文本语料库的n-gram重叠来减少数据污染风险，从而更准确地评估模型的真实语言理解能力。

链接: https://arxiv.org/abs/2505.16125
作者: Hyopil Shin,Sangah Lee,Dongjun Jang,Wooseok Song,Jaeyoon Kim,Chaeyoung Oh,Hyemi Jo,Youngchae Ahn,Sihyun Oh,Hyohyeong Chang,Sunkyoung Kim,Jinsik Lee
机构: Seoul National University (首尔国立大学); LG AI Research (LG人工智能研究)
类目: Computation and Language (cs.CL)
备注: Under Reveiw

点击查看摘要

Abstract:We introduce KoBALT (Korean Benchmark for Advanced Linguistic Tasks), a comprehensive linguistically-motivated benchmark comprising 700 multiple-choice questions spanning 24 phenomena across five linguistic domains: syntax, semantics, pragmatics, phonetics/phonology, and morphology. KoBALT is designed to advance the evaluation of large language models (LLMs) in Korean, a morphologically rich language, by addressing the limitations of conventional benchmarks that often lack linguistic depth and typological grounding. It introduces a suite of expert-curated, linguistically motivated questions with minimal n-gram overlap with standard Korean corpora, substantially mitigating the risk of data contamination and allowing a more robust assessment of true language understanding. Our evaluation of 20 contemporary LLMs reveals significant performance disparities, with the highest-performing model achieving 61% general accuracy but showing substantial variation across linguistic domains - from stronger performance in semantics (66%) to considerable weaknesses in phonology (31%) and morphology (36%). Through human preference evaluation with 95 annotators, we demonstrate a strong correlation between KoBALT scores and human judgments, validating our benchmark’s effectiveness as a discriminative measure of Korean language understanding. KoBALT addresses critical gaps in linguistic evaluation for typologically diverse languages and provides a robust framework for assessing genuine linguistic competence in Korean language models.
zh

[NLP-145] Semiotic Reconstruction of Destination Expectation Constructs An LLM -Driven Computational Paradigm for Social Media Tourism Analytics

【速读】：该论文试图解决旅游决策中用户生成内容（UGC）分析方法缺乏可扩展性的问题，旨在提升对用户期望的量化能力。其解决方案的关键在于提出一种双方法大语言模型（LLM）框架，结合无监督的UGC期望提取与基于调查的有监督微调，从而实现对用户期望的精准识别与量化，推动旅游分析方法的改进及个性化体验策略的制定。

链接: https://arxiv.org/abs/2505.16118
作者: Haotian Lan,Yao Gao,Yujun Cheng,Wei Yuan,Kun Wang
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: 33 pages, 6 figures

点击查看摘要

Abstract:Social media’s rise establishes user-generated content (UGC) as pivotal for travel decisions, yet analytical methods lack scalability. This study introduces a dual-method LLM framework: unsupervised expectation extraction from UGC paired with survey-informed supervised fine-tuning. Findings reveal leisure/social expectations drive engagement more than foundational natural/emotional factors. By establishing LLMs as precision tools for expectation quantification, we advance tourism analytics methodology and propose targeted strategies for experience personalization and social travel promotion. The framework’s adaptability extends to consumer behavior research, demonstrating computational social science’s transformative potential in marketing optimization.
zh

[NLP-146] MPL: Multiple Programming Languages with Large Language Models for Information Extraction ACL2025

【速读】：该论文试图解决信息抽取（Information Extraction, IE）任务中，现有研究主要依赖Python进行代码风格输入模拟，而忽视了其他广泛使用的编程语言（如C++和Java）在监督微调（Supervised Fine-Tuning, SFT）阶段的潜力问题。解决方案的关键在于提出一种名为MPL（Multiple Programming Languages with large language models for information extraction）的框架，该框架探索在SFT阶段整合多种编程语言的可能性，并引入了带有虚拟运行的函数提示（function-prompt）以更高效地模拟代码风格输入。

链接: https://arxiv.org/abs/2505.16107
作者: Bo Li,Gexiang Fang,Wei Ye,Zhenghua Xu,Jinglei Zhang,Hao Cheng,Shikun Zhang
机构: Hebei University of Technology (河北工业大学); Peking University (北京大学); School of Artificial Intelligence (人工智能学院)
类目: Computation and Language (cs.CL)
备注: Findings of ACL2025

点击查看摘要

Abstract:Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbfMultiple \textbfProgramming \textbfLanguages with large language models for information extraction (abbreviated as \textbfMPL), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \textttfunction-prompt with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.
zh

[NLP-147] Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models ACL2025

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在进行网络剪枝压缩后安全性性能下降的问题。解决方案的关键在于提出一种轻量级的分层安全重对齐（Hierarchical Safety Realignment, HSR）方法，该方法通过量化每个注意力头对安全性的贡献，识别关键注意力头，并在其内部选择性地恢复对维持安全性起关键作用的神经元，从而从注意力头层级到神经元层级逐步提升剪枝后模型的安全性。

链接: https://arxiv.org/abs/2505.16104
作者: Yue Li,Xin Yi,Dongsheng Shi,Gerard de Melo,Xiaoling Wang,Linlin Wang
机构: East China Normal University (华东师范大学); University of Potsdam (波茨坦大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ACL 2025 Findings

点击查看摘要

Abstract:With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.
zh

[NLP-148] Continually Self-Improving Language Models for Bariatric Surgery Question–Answering

【速读】：该论文旨在解决病态肥胖患者在接受减重与代谢手术（MBS）过程中，因医疗资源分布不均、信息获取困难等问题导致的治疗效果受限问题。其解决方案的关键在于提出一种基于检索增强生成（RAG）的自适应模型——bRAGgen，该模型能够在响应置信度低于动态阈值时自动整合实时医学证据，确保输出信息的时效性与准确性，从而减少错误信息的风险。此外，研究还构建了bRAGq数据集，作为首个针对MBS护理的大规模、领域特定基准。

链接: https://arxiv.org/abs/2505.16102
作者: Yash Kumar Atri,Thomas H Shin,Thomas Hartvigsen
机构: University of Virginia(弗吉尼亚大学); School of Data Science(数据科学学院); Department of Surgery(外科系); University of Virginia School of Medicine(弗吉尼亚大学医学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery–related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)–based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.
zh

[NLP-149] BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

【速读】：该论文试图解决在生物医学研究中验证科学假设的挑战，特别是人工智能（AI）代理在处理现实世界数据复杂性及证据解释时所面临的困难。其解决方案的关键在于提出BioDSA-1K基准，该基准包含1,029个以假设为中心的任务和1,177个分析计划，来源于300多篇已发表的生物医学研究，旨在反映真实研究流程中的结构与推理。每个任务均包含从原始研究结论中提取的结构化假设，并配有基于实证数据表的支持性证据，同时支持通过标准统计或机器学习方法进行测试。此外，该基准还包含不可验证的假设，以反映现实科学中常见但尚未充分探索的数据不足场景。

链接: https://arxiv.org/abs/2505.16100
作者: Zifeng Wang,Benjamin Danek,Jimeng Sun
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study’s conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable using standard statistical or machine learning methods. The benchmark enables evaluation along four axes: (1) hypothesis decision accuracy, (2) alignment between evidence and conclusion, (3) correctness of the reasoning process, and (4) executability of the AI-generated analysis code. Importantly, BioDSA-1K includes non-verifiable hypotheses: cases where the available data are insufficient to support or refute a claim, reflecting a common yet underexplored scenario in real-world science. We propose BioDSA-1K as a foundation for building and evaluating generalizable, trustworthy AI agents for biomedical discovery.
zh

[NLP-150] A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization

【速读】：该论文试图解决如何利用生成式 AI (Generative AI) 在分子发现中的应用问题，特别是在分子生成和分子优化两个核心任务上的挑战。其解决方案的关键在于通过构建一个针对这些问题的分类体系，分析各类代表性技术，并探讨如何在不同的学习设置中充分利用大语言模型 (Large Language Models, LLMs) 的能力，同时结合常用的数据集和评估协议，为该领域的发展提供系统性的综述与指导。

链接: https://arxiv.org/abs/2505.16094
作者: Ziqing Wang,Kexin Zhang,Zihan Zhao,Yibo Wen,Abhishek Pandey,Han Liu,Kaize Ding
机构: Northwestern University (西北大学); AbbVie (艾伯维)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at this https URL.
zh

[NLP-151] Can AI Read Between The Lines? Benchmarking LLM s On Financial Nuance MICRO

【速读】：该论文试图解决生成式 AI（Generative AI）在金融文本情感分析中的准确性和可靠性问题，特别是在处理具有策略性模糊性的财务披露内容时的挑战。解决方案的关键在于通过基准测试 Microsoft 的 Copilot、OpenAI 的 ChatGPT、Google 的 Gemini 以及传统机器学习模型，评估其情感分析性能，并利用提示工程优化结果，同时通过可视化手段分析情感一致性与股票表现之间的关联性。

链接: https://arxiv.org/abs/2505.16090
作者: Dominick Kubica,Dylan T. Gordon,Nanami Emura,Derleen Saini,Charlie Goldenberg
机构: Santa Clara University - Leavey School of Business(圣克拉拉大学-利维商学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 4 figures. Research conducted as part of a Microsoft-sponsored Capstone Project at Santa Clara University

点击查看摘要

Abstract:As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft’s Copilot, OpenAI’s ChatGPT, Google’s Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft’s lines of business to determine which segments exert the greatest influence.
zh

[NLP-152] Date Frag ments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

【速读】：该论文旨在解决现代字节对编码（BPE）分词器在处理日历日期时将其分割为无意义片段的问题，这导致了标记数量增加并掩盖了用于鲁棒时间推理的固有结构。其解决方案的关键在于引入一种称为“日期碎片化比率”的简单且可解释的度量标准，以评估分词器保留多位日期组件的准确性；同时，通过层级探测和因果注意力跳跃分析，揭示出大语言模型（LLM）中出现的日期抽象机制，该机制能够将月份、日期和年份的片段拼接起来进行时间推理。

链接: https://arxiv.org/abs/2505.16088
作者: Gagan Bhatia,Maxime Peyrard,Wei Zhao
机构: University of Aberdeen (阿伯丁大学); Université Grenoble Alpes & CNRS (格勒诺布尔阿尔卑斯大学与法国国家科学研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 \rightarrow 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year \rightarrow month \rightarrow day).
zh

[NLP-153] Optimizing LLM -Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

【速读】：该论文旨在解决基于大型语言模型（Large Language Models, LLMs）的多智能体系统在复杂任务中的优化难题，特别是在需要专家协作的软件开发任务中。其关键解决方案是提出一种两步代理提示优化流程：首先利用自然语言反馈识别表现不佳的智能体并分析其失败原因，随后根据这些失败解释优化被识别智能体的系统提示。该方法通过评估不同优化设置对系统性能的影响，为多智能体系统的群体行为提供了实用的见解。

链接: https://arxiv.org/abs/2505.16086
作者: Ming Shen,Raphael Shu,Anurag Pratik,James Gung,Yubin Ge,Monica Sunkara,Yi Zhang
机构: Arizona State University (亚利桑那州立大学); Amazon Web Services (亚马逊网络服务)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.
zh

[NLP-154] BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators

【速读】：该论文试图解决政治新闻中感知意识形态偏见的可解释性检测问题，旨在通过标注数据支持透明且社会敏感的自然语言处理（NLP）系统的发展。其解决方案的关键在于构建BiasLab数据集，该数据集包含300篇政治新闻文章，每篇文章由众包工作者在两个独立尺度上标注对民主党与共和党的情感倾向，并附有理由说明。此外，通过引入基于GPT-4o的模式约束标注模拟，揭示了人类标注与模型预测之间的对称性偏差，特别是对右翼倾向内容的误分类问题。该研究还定义了两种建模任务：感知漂移预测和理由类型分类，并提供了基线性能以展示可解释偏见检测的挑战。

链接: https://arxiv.org/abs/2505.16081
作者: KMA Solaiman
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:We present BiasLab, a dataset of 300 political news articles annotated for perceived ideological bias. These articles were selected from a curated 900-document pool covering diverse political events and source biases. Each article is labeled by crowdworkers along two independent scales, assessing sentiment toward the Democratic and Republican parties, and enriched with rationale indicators. The annotation pipeline incorporates targeted worker qualification and was refined through pilot-phase analysis. We quantify inter-annotator agreement, analyze misalignment with source-level outlet bias, and organize the resulting labels into interpretable subsets. Additionally, we simulate annotation using schema-constrained GPT-4o, enabling direct comparison to human labels and revealing mirrored asymmetries, especially in misclassifying subtly right-leaning content. We define two modeling tasks: perception drift prediction and rationale type classification, and report baseline performance to illustrate the challenge of explainable bias detection. BiasLab’s rich rationale annotations provide actionable interpretations that facilitate explainable modeling of political bias, supporting the development of transparent, socially aware NLP systems. We release the dataset, annotation schema, and modeling code to encourage research on human-in-the-loop interpretability and the evaluation of explanation effectiveness in real-world settings.
zh

[NLP-155] Small Language Models in the Real World: Insights from Industrial Text Classification

【速读】：该论文试图解决在工业场景下，小型语言模型是否能够有效处理文本分类任务的问题，特别是针对生成式AI（Generative AI）在推理效率、提示质量依赖性以及GPU资源消耗方面的局限性。解决方案的关键在于对提示工程和监督微调方法进行系统评估，并重点关注模型在视频随机存取存储器（VRAM）利用效率方面的表现，以期为紧凑模型在实际部署中的应用提供有价值的参考。

链接: https://arxiv.org/abs/2505.16078
作者: Lujun Li,Lama Sleem,Niccolo’ Gentile,Geoffrey Nichil,Radu State
机构: University of Luxembourg (卢森堡大学); Foyer S.A. (Foyer S.A.)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.
zh

[NLP-156] Merge to Mix: Mixing Datasets via Model Merging

【速读】：该论文试图解决在微调大型语言模型（Large Language Models, LMs）时，如何高效组合数据集以最大化下游任务性能的问题。传统方法依赖启发式策略和试错过程，通常需要多次微调运行才能获得理想结果。论文提出的解决方案是“Merge to Mix”，其关键在于利用模型合并（model merging）技术，通过少量简单的算术运算将多个针对不同数据集微调的模型能力整合为一个模型，从而有效替代对整个数据集混合进行完整微调的需求，显著加速数据集混合的选择过程。

链接: https://arxiv.org/abs/2505.16066
作者: Zhixu Silvia Tao,Kasper Vinken,Hao-Wei Yeh,Avi Cooper,Xavier Boix
机构: Princeton University (普林斯顿大学); Fujitsu Research of America (富士通美国研究院); Fujitsu Limited (富士通有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks. However, composing effective dataset mixtures typically relies on heuristics and trial-and-error, often requiring multiple fine-tuning runs to achieve the desired outcome. We propose a novel method, \textitMerge to Mix , that accelerates composing dataset mixtures through model merging. Model merging is a recent technique that combines the abilities of multiple individually fine-tuned LMs into a single LM by using a few simple arithmetic operations. Our key insight is that merging models individually fine-tuned on each dataset in a mixture can effectively serve as a surrogate for a model fine-tuned on the entire mixture. Merge to Mix leverages this insight to accelerate selecting dataset mixtures without requiring full fine-tuning on each candidate mixture. Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
zh

[NLP-157] Aug2Search: Enhancing Facebook Marketplace Search with LLM -Generated Synthetic Data Augmentation

【速读】：该论文旨在解决嵌入式检索（Embedding-Based Retrieval, EBR）模型在训练过程中因搜索日志数据缺乏多样性与细节而导致的性能受限问题。其解决方案的关键在于提出Aug2Search框架，该框架利用生成式AI（Generative AI, GenAI）模型生成高质量的合成数据，并通过多模态和多任务的方法优化查询与产品之间的相关性。实验表明，Llama模型在生成具有高一致性、相关性和多样性的合成数据方面表现出色，且在使用1亿条合成数据样本时，ROC_AUC指标提升了最高4%，验证了该方法的有效性。

链接: https://arxiv.org/abs/2505.16065
作者: Ruijie Xi,He Ba,Hao Yuan,Rishu Agrawal,Arul Prakash
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Embedding-Based Retrieval (EBR) is an important technique in modern search engines, enabling semantic match between search queries and relevant results. However, search logging data on platforms like Facebook Marketplace lacks the diversity and details needed for effective EBR model training, limiting the models’ ability to capture nuanced search patterns. To address this challenge, we propose Aug2Search, an EBR-based framework leveraging synthetic data generated by Generative AI (GenAI) models, in a multimodal and multitask approach to optimize query-product relevance. This paper investigates the capabilities of GenAI, particularly Large Language Models (LLMs), in generating high-quality synthetic data, and analyzing its impact on enhancing EBR models. We conducted experiments using eight Llama models and 100 million data points from Facebook Marketplace logs. Our synthetic data generation follows three strategies: (1) generate queries, (2) enhance product listings, and (3) generate queries from enhanced listings. We train EBR models on three different datasets: sampled engagement data or original data ((e.g., “Click” and “Listing Interactions”)), synthetic data, and a mixture of both engagement and synthetic data to assess their performance across various training sets. Our findings underscore the robustness of Llama models in producing synthetic queries and listings with high coherence, relevance, and diversity, while maintaining low levels of hallucination. Aug2Search achieves an improvement of up to 4% in ROC_AUC with 100 million synthetic data samples, demonstrating the effectiveness of our approach. Moreover, our experiments reveal that with the same volume of training data, models trained exclusively on synthetic data often outperform those trained on original data only or a mixture of original and synthetic data.
zh

[NLP-158] Internal and External Impacts of Natural Language Processing Papers ACL2025

【速读】：该论文试图解决的问题是分析自然语言处理（Natural Language Processing, NLP）研究在顶级会议（如ACL、EMNLP和NAACL）上发表后，其影响力在学术界和更广泛社会中的分布情况。解决方案的关键在于通过分析来自科研论文及专利、媒体和政策文件等外部来源的引用情况，评估不同NLP主题在内部学术圈和外部应用领域的接受度与影响范围。研究揭示了语言建模在内部和外部均具有最广泛的影响，而语言学基础相关主题的影响相对较低，并指出伦理、偏见和公平性等议题在政策文件中受到更多关注，但在学术引用中较少。

链接: https://arxiv.org/abs/2505.16061
作者: Yu Zhang
机构: Texas A&M University (德克萨斯A&M大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 7 pages; Accepted to ACL 2025

点击查看摘要

Abstract:We investigate the impacts of NLP research published in top-tier conferences (i.e., ACL, EMNLP, and NAACL) from 1979 to 2024. By analyzing citations from research articles and external sources such as patents, media, and policy documents, we examine how different NLP topics are consumed both within the academic community and by the broader public. Our findings reveal that language modeling has the widest internal and external influence, while linguistic foundations have lower impacts. We also observe that internal and external impacts generally align, but topics like ethics, bias, and fairness show significant attention in policy documents with much fewer academic citations. Additionally, external domains exhibit distinct preferences, with patents focusing on practical NLP applications and media and policy documents engaging more with the societal implications of NLP models.
zh

[NLP-159] Causal LLM Routing: End-to-End Regret Minimization from Observational Data

【速读】：该论文试图解决语言模型（Language Model, LM）路由问题，即在多个语言模型中为每个查询选择最优模型，以平衡准确性和成本等性能指标。现有方法通常采用解耦策略，先预测性能指标再进行模型选择，但该方法容易累积误差，并依赖于全反馈数据，这在实际中成本高昂。论文的关键解决方案是基于观察性数据学习因果端到端框架，通过最小化决策后悔来学习路由策略，并引入两个理论支持的替代目标函数以实现高效优化，同时通过区间条件架构处理异质成本偏好。实验表明，该方法在公开基准上优于现有基线，达到了最先进的性能。

链接: https://arxiv.org/abs/2505.16037
作者: Asterios Tsiourvas,Wei Sun,Georgia Perakis
机构: MIT(麻省理工学院); IBM Research(IBM研究院); Georgia Perakis(佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
zh

[NLP-160] OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models

【速读】：该论文试图解决当前生成式 AI（Generative AI）在伦理评估方面存在的不足，包括评估维度狭窄、语言和模型多样性有限的问题。其解决方案的关键在于构建一个涵盖鲁棒性（robustness）、可靠性（reliability）、安全性和公平性的全面伦理评估框架，并通过 LLM-as-a-Judge 方法对29个开源大语言模型进行评估，同时覆盖英语和低资源语言土耳其语，以提升评估的广度、语言覆盖范围和模型多样性。

链接: https://arxiv.org/abs/2505.16036
作者: Burak Erinç Çetin,Yıldırım Özen,Elif Naz Demiryılmaz,Kaan Engür,Cagri Toraman
机构: Middle East Technical University (中东技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative large language models present significant potential but also raise critical ethical concerns. Most studies focus on narrow ethical dimensions, and also limited diversity of languages and models. To address these gaps, we conduct a broad ethical evaluation of 29 recent open-source large language models using a novel data collection including four ethical aspects: Robustness, reliability, safety, and fairness. We analyze model behavior in both a commonly used language, English, and a low-resource language, Turkish. Our aim is to provide a comprehensive ethical assessment and guide safer model development by filling existing gaps in evaluation breadth, language coverage, and model diversity. Our experimental results, based on LLM-as-a-Judge, reveal that optimization efforts for many open-source models appear to have prioritized safety and fairness, and demonstrated good robustness while reliability remains a concern. We demonstrate that ethical evaluation can be effectively conducted independently of the language used. In addition, models with larger parameter counts tend to exhibit better ethical performance, with Gemma and Qwen models demonstrating the most ethical behavior among those evaluated.
zh

[NLP-161] Prototypical Human-AI Collaboration Behaviors from LLM -Assisted Writing in the Wild

【速读】：该论文试图解决用户在复杂写作工作流中与大型语言模型（Large Language Models, LLMs）进行多轮交互时，如何有效协作以生成更符合需求文本的问题。其解决方案的关键在于识别并分析用户与LLMs互动中的原型人类-人工智能协作行为（Prototypical Human-AI Collaboration Behaviors, PATHs），并通过这些PATHs揭示用户意图如何影响其协作行为，从而为LLM的对齐提供理论支持与实践指导。

链接: https://arxiv.org/abs/2505.16023
作者: Sheshera Mysore,Debarati Das,Hancheng Cao,Bahareh Sarrafzadeh
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Pre-print under-review

点击查看摘要

Abstract:As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users’ intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.
zh

[NLP-162] NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

【速读】：该论文试图解决传统激励训练（incentive training）依赖外部验证器（verifier）的问题，这限制了其在数学和编程等领域的应用。解决方案的关键在于提出NOVER（NO-VERifier Reinforcement Learning），这是一个无需外部验证器的通用强化学习框架，仅需标准监督微调数据即可实现激励训练，从而扩展了激励训练的应用范围并提升了模型性能。

链接: https://arxiv.org/abs/2505.16022
作者: Wei Liu,Siya Qi,Xinyu Wang,Chen Qian,Yali Du,Yulan He
机构: King’s College London (国王学院); The Alan Turing Institute (艾伦·图灵研究所); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 5 tables, 12 figures

点击查看摘要

Abstract:Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model’s output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.
zh

[NLP-163] Ranking Free RAG : Replacing Re-ranking with Selection in RAG for Sensitive Domains

【速读】：该论文试图解决传统检索增强生成（Retrieval-Augmented Generation, RAG）流水线中依赖基于相似性的检索和重排序所带来的可解释性差、可解释性不足以及对对抗性内容鲁棒性弱的问题。其解决方案的关键在于提出一种新的方法METEORA，该方法通过基于理由的选段策略替代传统的重排序过程，利用经过偏好调优的通用大语言模型生成与输入查询相关的理由，并据此指导证据片段的选择，从而实现更准确、可解释且鲁棒的生成过程。

链接: https://arxiv.org/abs/2505.16014
作者: Yash Saxena,Anpur Padia,Mandar S Chaudhary,Kalpa Gunaratna,Srinivasan Parthasarathy,Manas Gaur
机构: UMBC(马里兰大学巴尔的摩县分校); eBay Inc.(eBay公司); Samsung Research America (SRA)(三星美国研究院); Ohio State University(俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional Retrieval-Augmented Generation (RAG) pipelines rely on similarity-based retrieval and re-ranking, which depend on heuristics such as top-k, and lack explainability, interpretability, and robustness against adversarial content. To address this gap, we propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach. METEORA operates in two stages. First, a general-purpose LLM is preference-tuned to generate rationales conditioned on the input query using direct preference optimization. These rationales guide the evidence chunk selection engine, which selects relevant chunks in three stages: pairing individual rationales with corresponding retrieved chunks for local relevance, global selection with elbow detection for adaptive cutoff, and context expansion via neighboring chunks. This process eliminates the need for top-k heuristics. The rationales are also used for consistency check using a Verifier LLM to detect and filter poisoned or misleading content for safe generation. The framework provides explainable and interpretable evidence flow by using rationales consistently across both selection and verification. Our evaluation across six datasets spanning legal, financial, and academic research domains shows that METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods. In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44 over the state-of-the-art perplexity-based defense baseline, demonstrating strong resilience to poisoning attacks. Code available at: this https URL
zh

[NLP-164] LAGO: Few-shot Crosslingual Embedding Inversion Attacks via Language Similarity-Aware Graph Optimization

【速读】：该论文试图解决多语言自然语言处理系统中由于嵌入空间逆向攻击（embedding inversion attacks）引发的隐私漏洞问题，特别是针对少样本（few-shot）跨语言场景下的攻击有效性问题。解决方案的关键在于提出LAGO（Language Similarity-Aware Graph Optimization），该方法通过图优化框架显式建模语言间的语义关系，利用句法和词法相似性作为边约束，实现相关语言间的协同参数学习，从而提升攻击在不同语言间的迁移能力。其核心创新在于将Frobenius范数正则化与线性不等式或总变分约束相结合，确保在极少量数据（每种语言仅10个样本）情况下仍能实现跨语言嵌入空间的稳健对齐。

链接: https://arxiv.org/abs/2505.16008
作者: Wenrui Yu,Yiyi Chen,Johannes Bjerva,Sokol Kosta,Qiongxiu Li
机构: Aalborg University (奥尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We propose LAGO - Language Similarity-Aware Graph Optimization - a novel approach for few-shot cross-lingual embedding inversion attacks, addressing critical privacy vulnerabilities in multilingual NLP systems. Unlike prior work in embedding inversion attacks that treat languages independently, LAGO explicitly models linguistic relationships through a graph-based constrained distributed optimization framework. By integrating syntactic and lexical similarity as edge constraints, our method enables collaborative parameter learning across related languages. Theoretically, we show this formulation generalizes prior approaches, such as ALGEN, which emerges as a special case when similarity constraints are relaxed. Our framework uniquely combines Frobenius-norm regularization with linear inequality or total variation constraints, ensuring robust alignment of cross-lingual embedding spaces even with extremely limited data (as few as 10 samples per language). Extensive experiments across multiple languages and embedding models demonstrate that LAGO substantially improves the transferability of attacks with 10-20% increase in Rouge-L score over baselines. This work establishes language similarity as a critical factor in inversion attack transferability, urging renewed focus on language-aware privacy-preserving multilingual embeddings.
zh

[NLP-165] Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

【速读】：该论文试图解决稀疏自编码器（Sparse Autoencoders, SAEs）在生成概念表示时对输入扰动的鲁棒性不足的问题，这一问题在现有评估中被忽视。研究认为，概念表示的鲁棒性是其可靠性的关键指标，而现有的评估主要关注重构-稀疏性权衡、人类（自动）可解释性和特征解耦等指标。论文的关键解决方案是将鲁棒性量化为输入空间的优化问题，并构建了一个包含对抗性扰动场景的全面评估框架，以测试SAE概念表示在面对微小输入扰动时的稳定性。

链接: https://arxiv.org/abs/2505.16004
作者: Aaron J. Li,Suraj Srinivas,Usha Bhalla,Himabindu Lakkaraju
机构: Harvard University(哈佛大学); Bosch Research(博世研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.
zh

[NLP-166] SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

【速读】：该论文试图解决当前基于大语言模型作为评估者（LLM-as-a-Judge）的校准方法在真实世界开放任务中表现不佳的问题，尤其是在这些任务中，最先进的校准评估者与人类判断之间的相关性较弱甚至为负。解决方案的关键在于提出SLMEval，这是一种基于小规模人类偏好数据进行熵最大化校准的新方法，通过估计模型质量的潜在分布并重新加权评估者分数，从而实现与人类评估的高度相关性，并显著降低评估成本。

链接: https://arxiv.org/abs/2505.16003
作者: Roland Daynauth,Christopher Clarke,Krisztian Flautner,Lingjia Tang,Jason Mars
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.16003 [cs.CL] (or arXiv:2505.16003v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.16003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-167] Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions

【速读】：该论文试图解决语言学理论如何从大规模语言模型（Large Language Models, LLMs）中获取更有价值的证据的问题，特别是如何揭示LLMs所学习到的抽象机制。解决方案的关键在于应用因果可解释性方法（causal interpretability methods），通过分析LLMs对英语填充-缺口依赖结构（filler-gap dependency constructions）的处理，揭示其内部工作机制，并发现可能影响标准语言学理论的新因素，如频率、填充词类型和上下文等。

链接: https://arxiv.org/abs/2505.16002
作者: Sasha Boguraev,Christopher Potts,Kyle Mahowald
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 19 figures, 11 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LLMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LLMs learn to use. Our empirical focus is a set of English filler-gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LLMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors – relating to frequency, filler type, and surrounding context – that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LLMs can push linguistic theory forward.
zh

[NLP-168] Leverag ing Online Data to Enhance Medical Knowledge in a Small Persian Language Model

【速读】：该论文试图解决在低资源语言如波斯语中，小型语言模型在医疗领域知识不足的问题（Small language models struggle with specialized domains in low-resource languages like Persian）。解决方案的关键在于利用可获取的在线数据，包括从医学杂志中爬取的语料库和真实医生-患者问答对的数据集，对基线模型进行微调，以提升其医疗知识水平。

链接: https://arxiv.org/abs/2505.16000
作者: Mehrdad ghassabi,Pedram Rostami,Hamidreza Baradaran Kashani,Amirhossein Poursina,Zahra Kazemi,Milad Tavakoli
机构: University of Isfahan (伊斯法罕大学); University of Tehran (德黑兰大学); Isfahan University of Medical Sciences (伊斯坦布尔医学科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient QA pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.
zh

[NLP-169] Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在人机协作决策中的可信性、渐进性和定制化解释能力不足的问题。研究通过评估五种LLMs在解决和解释六宫格数独谜题中的表现，揭示了当前模型在提供反映策略推理或直觉问题解决过程的解释方面存在显著缺陷。解决方案的关键在于提升LLMs的解释能力，使其能够生成更清晰、更具逻辑性和可理解性的过程说明，从而增强其作为人机协作决策有效伙伴的潜力。

链接: https://arxiv.org/abs/2505.15993
作者: Anirudh Maiya,Razan Alghamdi,Maria Leonor Pacheco,Ashutosh Trivedi,Fabio Somenzi
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:The success of Large Language Models (LLMs) in human-AI collaborative decision-making hinges on their ability to provide trustworthy, gradual, and tailored explanations. Solving complex puzzles, such as Sudoku, offers a canonical example of this collaboration, where clear and customized explanations often hold greater importance than the final solution. In this study, we evaluate the performance of five LLMs in solving and explaining \sixsix Sudoku puzzles. While one LLM demonstrates limited success in solving puzzles, none can explain the solution process in a manner that reflects strategic reasoning or intuitive problem-solving. These findings underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.
zh

[NLP-170] Pixel Reason er: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

【速读】：该论文试图解决视觉密集型任务中传统链式思维推理仅限于文本空间而导致效果受限的问题。其解决方案的关键在于引入像素空间（pixel-space）中的推理机制，使视觉语言模型（VLMs）能够通过诸如缩放和选择帧等视觉推理操作直接处理和分析视觉证据，从而提升视觉任务的推理准确性。为克服模型在像素空间推理能力上的不足，作者采用两阶段训练方法：第一阶段通过合成推理轨迹进行指令微调以熟悉新操作，第二阶段利用基于好奇心的强化学习奖励机制平衡像素空间与文本空间推理的探索。

链接: https://arxiv.org/abs/2505.15966
作者: Alex Su,Haozhe Wang,Weimin Ren,Fangzhen Lin,Wenhu Chen
机构: University of Waterloo(滑铁卢大学); HKUST(香港科技大学); USTC(中国科学技术大学); Vector Institute(矢量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Haozhe Wang and Alex Su contributed equally and listed alphabetically

点击查看摘要

Abstract:Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model’s initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.
zh

[NLP-171] OViP: Online Vision-Language Preference Learning

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中存在的幻觉问题，即模型生成的内容与视觉输入不一致。现有方法虽然通过多模态直接偏好优化（Multi-modal Direct Preference Optimization, DPO）来缓解幻觉，但通常依赖于预定义或随机编辑的负样本，无法反映模型的真实错误，限制了训练效果。该论文提出的在线视觉语言偏好学习框架（Online Vision-language Preference Learning, OViP）的关键在于动态构建对比训练数据，基于模型自身的幻觉输出进行语义差异分析，并利用扩散模型合成负样本图像，从而实时生成更相关的监督信号，实现文本和视觉偏好的自适应对齐。

链接: https://arxiv.org/abs/2505.15963
作者: Shujun Liu,Siyuan Wang,Zejun Li,Jianxiang Wang,Cheng Zeng,Zhongyu Wei
机构: Fudan University (复旦大学); University of Southern California (南加州大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 22 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model’s own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.
zh

[NLP-172] Pre-training Large Memory Language Models with Internal and External Knowledge

【速读】：该论文试图解决神经语言模型作为黑箱系统，其语言模式和事实知识分散在数十亿个不透明参数中的问题，这使得难以可靠地检查、验证或更新特定事实。解决方案的关键在于提出一种新型语言模型——大内存语言模型（Large Memory Language Models, LMLM），其通过预训练策略将事实知识同时存储在内部权重和外部数据库中，并在训练过程中战略性地屏蔽从外部检索到的事实值，从而教导模型进行有针对性的查找而非依赖模型权重中的记忆。

链接: https://arxiv.org/abs/2505.15962
作者: Linxi Zhao,Sofian Zalouk,Christian K. Belardi,Justin Lovelace,Jin Peng Zhou,Kilian Q. Weinberger,Yoav Artzi,Jennifer J. Sun
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural language models are black-boxes – both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
zh

[NLP-173] raining Step-Level Reasoning Verifiers with Formal Verification Tools

【速读】：该论文试图解决生成式 AI (Generative AI) 在过程奖励模型 (Process Reward Models, PRMs) 训练中面临的两个关键问题：一是收集准确的步骤级错误标签需要高昂的人工标注成本，二是现有 PRMs 仅限于数学推理任务。解决方案的关键在于提出 FoVer 方法，通过形式化验证工具（如 Z3 和 Isabelle）自动标注步骤级错误标签，从而无需人工标注即可构建训练数据集，并实现 PRMs 在多样化推理任务中的跨任务泛化能力。

链接: https://arxiv.org/abs/2505.15960
作者: Ryo Kamoi,Yusen Zhang,Nan Zhang,Sarkar Snigdha Sarathi Das,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Datasets, models, and code are provided at this https URL . Please also refer to our project website at this https URL

点击查看摘要

Abstract:Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at this https URL.
zh

[NLP-174] Citation Parsing and Analysis with Language Models

【速读】：该论文试图解决全球知识生产与传播中不平等的问题，特别是由于缺乏有效的工具导致对南方国家知识共享网络的信息不足，进而导致该地区研究人员在索引服务中的边缘化。解决方案的关键在于利用开源大语言模型对学术手稿的引用进行标注，以生成可索引的格式。研究发现，当前的语言模型在识别引用组成部分方面表现出色，甚至优于现有最先进方法，表明通过微调可以开发出小型、稳健的引用解析模型，从而提升引用网络的准确性，改善科研索引与发现，以及促进元科学的研究。

链接: https://arxiv.org/abs/2505.15948
作者: Parth Sarin,Juan Pablo Alperin
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Social and Information Networks (cs.SI)
备注: Presented at the Workshop on Open Citations Open Scholarly Metadata 2025

点击查看摘要

Abstract:A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today’s language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in 2^5 passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.
zh

[NLP-175] MAPS: A Multilingual Benchmark for Global Agent Performance and Security

【速读】：该论文试图解决当前基于大型语言模型（Large Language Models, LLMs）的代理型人工智能（Agentic AI）系统在多语言环境下的性能下降与安全性问题。现有评估基准仅针对英语环境，忽视了多语言场景下的系统表现，导致非英语用户可能面临不可靠或存在安全风险的代理行为。解决方案的关键在于提出MAPS，一个覆盖多种语言和任务的多语言评估基准套件，通过将四个主流代理基准数据集翻译为十种语言，构建了805个独特任务和8,855个语言特定实例，从而系统分析多语言环境对代理性能和鲁棒性的影响，并提供针对性的开发与评估建议。

链接: https://arxiv.org/abs/2505.15935
作者: Omer Hofman,Oren Rachmil,Shamik Bose,Vikas Pahuja,Jonathan Brokman,Toshiya Shimizu,Trisha Starostina,Kelly Marchisio,Seraphina Goldfarb-Tarrant,Roman Vainshtein
机构: Fujitsu Research of Europe (富士通欧洲研究院); Fujitsu Limited (富士通有限公司); Cohere (Cohere)
类目: Databases (cs.DB); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the global accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI, existing benchmarks focus exclusively on English, leaving multilingual settings unexplored. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into ten diverse languages, resulting in 805 unique tasks and 8,855 total language-specific instances. Our benchmark suite enables a systematic analysis of how multilingual contexts affect agent performance and robustness. Empirically, we observe consistent degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. Building on these findings, we provide actionable recommendations to guide agentic AI systems development and assessment under multilingual settings. This work establishes a standardized evaluation framework, encouraging future research towards equitable, reliable, and globally accessible agentic AI. MAPS benchmark suite is publicly available at this https URL
zh

[NLP-176] ViQAgent : Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

【速读】：该论文旨在解决视频问答（VideoQA）中对象定位（grounding）随时间跟踪的挑战以及基于推理的决策问题，以更好地将对象引用与语言模型输出对齐。其解决方案的关键在于构建一个结合思维链（Chain-of-Thought）框架与定位推理的大型语言模型（LLM）驱动代理，并融合YOLO-World以增强对象跟踪和对齐能力，从而在多个基准测试中实现了新的最先进性能。

链接: https://arxiv.org/abs/2505.15928
作者: Tony Montes,Fernando Lozano
机构: Universidad de los Andes(安第斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at this https URL.
zh

[NLP-177] Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

【速读】：该论文试图解决对话智能体对齐问题，即如何在仅获得单次会话级反馈信号的情况下，生成高质量的对话响应。解决方案的关键在于利用预训练的大语言模型（Large Language Model, LLM）的推理能力，将全局的会话级反馈分解为细粒度的局部隐式奖励，从而实现对话策略的优化。通过文本和多模态两种变体，分别基于对话转录文本或结合行为线索（如语调、眼神和面部情感）进行奖励分解，并将推断出的回合级奖励提炼为轻量级奖励模型，用于强化学习（Reinforcement Learning, RL）驱动的对话生成微调。

链接: https://arxiv.org/abs/2505.15922
作者: Dong Won Lee,Hae Won Park,Cynthia Breazeal,Louis-Philippe Morency
机构: MIT(麻省理工学院); CMU(卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.
zh

[NLP-178] Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

【速读】：该论文试图解决如何利用生成式 AI (Generative AI) 中固有的概率知识来为事件及其相互关系构建贝叶斯网络（Bayesian Network, BN）的问题，特别是在缺乏充足真实数据的情况下提升概率建模的准确性。解决方案的关键在于通过查询大型语言模型（Large Language Models, LLMs）获取条件概率信息，并将其作为专家先验知识用于优化从少量数据中提取的概率分布，从而减少系统性偏差，实现贝叶斯网络的自动构建。

链接: https://arxiv.org/abs/2505.15918
作者: Aliakbar Nafar,Kristen Brent Venable,Zijun Cui,Parisa Kordjamshidi
机构: Michigan State University (密歇根州立大学); Florida Institute for Human and Machine Cognition (佛罗里达人类与机器认知研究所); University of West Florida (西佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. This paper investigates using probabilistic knowledge inherent in LLMs to derive probability estimates for statements concerning events and their interrelationships captured via a Bayesian Network (BN). Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from minimal data, significantly reducing systematic biases. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with small amounts of real-world data. Additionally, we evaluate several prompting strategies for eliciting probabilistic knowledge from LLMs and establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.
zh

[NLP-179] BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law including case law

【速读】：该论文旨在解决在巴西个人所得税法律背景下，基于引用进行问答（Question Answering with References）的问题。其关键解决方案是构建了一个名为BR-TaxQA-R的新数据集，该数据集整合了巴西国家税务局2024年官方问答文档中的715个问题，并补充了Conselho Administrativo de Recursos Fiscais（CARF）的法规和行政裁决。此外，研究者还实现了基于OpenAI嵌入的检索增强生成（Retrieval-Augmented Generation, RAG）管道，并使用GPT-4o-mini进行答案生成，以提升法律问答的准确性和相关性。

链接: https://arxiv.org/abs/2505.15916
作者: Juvenal Domingos Júnior,Augusto Faria,E. Seiti de Oliveira,Erick de Brito,Matheus Teotonio,Andre Assumpção,Diedre Carmo,Roberto Lotufo,Jayr Pereira
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents BR-TaxQA-R, a novel dataset designed to support question answering with references in the context of Brazilian personal income tax law. The dataset contains 715 questions from the 2024 official Q\A document published by Brazil’s Internal Revenue Service, enriched with statutory norms and administrative rulings from the Conselho Administrativo de Recursos Fiscais (CARF). We implement a Retrieval-Augmented Generation (RAG) pipeline using OpenAI embeddings for searching and GPT-4o-mini for answer generation. We compare different text segmentation strategies and benchmark our system against commercial tools such as ChatGPT and this http URL using RAGAS-based metrics. Results show that our custom RAG pipeline outperforms commercial systems in Response Relevancy, indicating stronger alignment with user queries, while commercial models achieve higher scores in Factual Correctness and fluency. These findings highlight a trade-off between legally grounded generation and linguistic fluency. Crucially, we argue that human expert evaluation remains essential to ensure the legal validity of AI-generated answers in high-stakes domains such as taxation. BR-TaxQA-R is publicly available at this https URL.
zh

[NLP-180] GRIT: Teaching MLLM s to Think with Images

【速读】：该论文旨在解决现有开放源码视觉推理模型在生成推理内容时仅依赖自然语言而缺乏显式视觉信息整合的问题，从而限制了其生成清晰且具有视觉基础的推理链的能力。其解决方案的关键在于提出一种名为GRIT的新方法，该方法引入了一种基于视觉的推理范式，使模型在生成推理链时能够交错使用自然语言和显式的边界框坐标，以指明模型在推理过程中参考的输入图像区域，并结合一种基于GRPO算法的强化学习方法GRPO-GR，通过关注最终答案准确性和推理输出格式的鲁棒奖励机制，无需依赖带有推理链标注或显式边界框标签的数据，从而实现了高效的数据利用。

链接: https://arxiv.org/abs/2505.15879
作者: Yue Fan,Xuehai He,Diji Yang,Kaizhi Zheng,Ching-Chen Kuo,Yuting Zheng,Sravana Jyothi Narayanaraju,Xinze Guan,Xin Eric Wang
机构: UC Santa Cruz(加州大学圣克鲁兹分校); eBay(贝宝)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.
zh

[NLP-181] Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

【速读】：该论文旨在解决文本到图像（T2I）检索中针对特定视觉属性的查询处理效果不佳的问题。现有基于CLIP等模型的检索器在处理此类查询时表现较差且不平衡，可能是因为其图像嵌入更关注全局语义和主体而忽略了其他细节。尽管近期基于多模态大语言模型（MLLM）的检索器具有更大的输出维度，但仍存在类似局限性。因此，作者提出使用由多模态检索器生成的可提示图像嵌入作为解决方案，通过突出所需属性来提升性能。该方法的关键在于利用可提示嵌入，从而更有效地捕捉与查询相关的视觉属性。

链接: https://arxiv.org/abs/2505.15877
作者: Siting Li,Xiang Gao,Simon Shaolei Du
机构: University of Washington (华盛顿大学); IIIS, Tsinghua University (清华大学交叉信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.
zh

[NLP-182] xt-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines

【速读】：该论文试图解决从自然语言（Natural Language, NL）指令生成数据准备（Data Preparation, DP）管道的问题，旨在降低数据预处理的技术门槛。其解决方案的关键在于提出Text-to-Pipeline任务，并开发了PARROT基准以支持系统评估，同时引入Pipeline-Agent方法，通过迭代预测和执行操作并结合中间表反馈来提升性能。

链接: https://arxiv.org/abs/2505.15874
作者: Yuhang Ge,Yachuan Liu,Yuren Mao,Yunjun Gao
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data preparation (DP) transforms raw data into a form suitable for downstream applications, typically by composing operations into executable pipelines. Building such pipelines is time-consuming and requires sophisticated programming skills. If we can build the pipelines with natural language (NL), the technical barrier of DP will be significantly reduced. However, constructing DP pipelines from NL instructions remains underexplored. To fill the gap, we introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines. Furthermore, we develop a benchmark named PARROT to support systematic evaluation. To simulate realistic DP scenarios, we mined transformation patterns from production pipelines and instantiated them on 23,009 real-world tables collected from six public sources. The resulting benchmark comprises ~18,000 pipelines covering 16 core DP operators. We evaluated cutting-edge large language models on PARROTand observed that they only solved 72.86% of the cases, revealing notable limitations in instruction understanding and multi-step reasoning. To address this, we propose Pipeline-Agent, a stronger baseline that iteratively predicts and executes operations with intermediate table feedback, achieving the best performance of 76.17%. Despite this improvement, there remains substantial room for progress on Text-to-Pipeline. Our data, codes, and evaluation tools are available at this https URL.
zh

[NLP-183] InfoDeepSeek : Benchmarking Agent ic Information Seeking for Retrieval-Augmented Generation

【速读】：该论文试图解决现有基准在评估自主性检索增强生成（Agentic RAG）系统时存在的不足，这些问题包括静态检索环境、有限语料库、简单查询以及依赖预定义文档集的评估协议，无法适应真实世界动态网络环境下的开放性和复杂性。解决方案的关键在于提出InfoDeepSeek基准，通过系统化的方法构建符合确定性、难度和多样性标准的挑战性问题，并开发首个针对动态自主信息检索的评估框架，包含对信息检索结果准确性、实用性和紧凑性的细粒度指标。

链接: https://arxiv.org/abs/2505.15872
作者: Yunjia Xi,Jianghao Lin,Menghui Zhu,Yongzhao Xiao,Zhuoying Ou,Jiaqi Liu,Tong Wan,Bo Chen,Weiwen Liu,Yasheng Wang,Ruiming Tang,Weinan Zhang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.
zh

[NLP-184] UltraEdit: Training- Subject- and Memory-Free Lifelong Editing in Large Language Models

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在持续学习（Lifelong Learning）过程中面临的高效、广泛更新同时保持已有能力并确保可靠部署的问题。其解决方案的关键在于提出ULTRAEDIT，这是一种无需训练、无需特定主题且无记忆依赖的编辑方法，通过仅使用轻量级线性代数运算计算参数变化，实现了快速且一致的参数修改，具有极低的计算开销。此外，ULTRAEDIT采用持续归一化策略，以适应分布变化并保持长期一致性，从而在实际应用中具备超大规模的可扩展性。

链接: https://arxiv.org/abs/2505.14679
作者: Xiaojie Gu,Guangxu Chen,Jungang Li,Jia-Chen Gu,Xuming Hu,Kai Zhang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: this https URL.
zh

[NLP-185] Problem-Solving Logic Guided Curriculum In-Context Learning for LLM s Complex Reasoning

【速读】：该论文旨在解决传统基于上下文学习（In-context learning, ICL）方法在选择和排序示范例子时存在的不足，即依赖简单特征衡量例子间的相关性，而无法准确反映例子之间的内在联系。其解决方案的关键在于提出一种基于问题求解逻辑的课程式ICL策略，通过分析问题求解逻辑来选择示范例子，并依据求解步骤的数量评估难度，按由易到难的顺序排列，以提升大型语言模型（Large Language Models, LLMs）的复杂推理能力。

链接: https://arxiv.org/abs/2502.15401
作者: Xuetao Ma,Wenbin Jiang,Hua Huang
机构: Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.
zh

[NLP-186] Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning INTERSPEECH2025

【速读】：该论文试图解决传统语音情感识别（Speech Emotion Recognition, SER）系统依赖聚合标注而忽视个体差异导致预测不一致的问题。其解决方案的关键在于提出一种新型元学习框架Meta-PerSER，该框架采用模型无关元学习（Model-Agnostic Meta-Learning, MAML）方法，并结合联合集元训练、导数退火以及每层每步的学习率策略，从而在仅有少量标注样本的情况下实现快速适应。通过整合预训练自监督模型的鲁棒表示，该框架首先捕捉通用情感线索，再进一步微调以适配个人标注风格，显著提升了在已见和未见数据场景下的性能。

链接: https://arxiv.org/abs/2505.16220
作者: Liang-Yeh Shen,Shi-Xin Fang,Yi-Cheng Lin,Huang-Cheng Chou,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted by INTERSPEECH 2025. 7 pages, including 2 pages of appendix

点击查看摘要

Abstract:This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener’s unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, which often overlook individual subtleties and lead to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
zh

[NLP-187] owards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

【速读】：该论文试图解决当前大型音频语言模型（Large Audio-Language Models, LALMs）评估体系碎片化、缺乏系统分类的问题。其解决方案的关键在于提出一个基于四个维度的系统性评估分类框架，即：(1) 通用听觉感知与处理、(2) 知识与推理、(3) 对话能力、(4) 公平性、安全性与可信度，通过这一结构化框架对LALMs进行全面调研与分析，并为该领域的发展提供清晰的指导方向。

链接: https://arxiv.org/abs/2505.15957
作者: Chih-Kai Yang,Neo S. Ho,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Project Website: this https URL

点击查看摘要

Abstract:With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs’ performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
zh

计算机视觉

[CV-0] ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark UAI

【速读】：该论文试图解决当前大型多模态模型（Large Multimodal Models, LMMs）在非英语语言中的推理能力评估不足的问题，特别是针对阿拉伯语这类具有丰富语言和文化背景的语言。解决方案的关键在于引入首个专注于阿拉伯语的多模态推理基准（Comprehensive Arabic Multimodal Reasoning Benchmark, ARB），该基准覆盖11个不同领域，并包含1,356个多模态样本及5,119条人工标注的推理步骤与对应操作，旨在系统评估模型在文本和视觉模态上的逐步推理能力。

链接: https://arxiv.org/abs/2505.17021
作者: Sara Ghaboura,Ketan More,Wafa Alghallabi,Omkar Thawakar,Jorma Laaksonen,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); Australian National University (澳大利亚国立大学); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github : this https URL , Huggingface: this https URL

点击查看摘要

Abstract:As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: this https URL
zh

[CV-1] CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在处理长视频序列时因输入复杂度增加而导致的计算成本呈二次增长的问题，特别是在视频标记（token）压缩方面如何在保持性能完整性的同时实现高效压缩。其解决方案的关键在于提出CrossLMM框架，通过双交叉注意力机制将长视频序列与LMM解耦，首先利用池化方法显著减少预训练视觉编码器中的标记数量，随后在语言模型层中引入视觉到视觉的交叉注意力机制，使压缩后的视觉标记作为查询与原始视觉标记集进行交互，从而提升标记利用率并保留细粒度信息；此外，还引入了文本到视觉的交叉注意力机制，增强文本标记对视觉内容的理解。

链接: https://arxiv.org/abs/2505.17020
作者: Shilin Yan,Jiaming Han,Joey Tsai,Hongwei Xue,Rongyao Fang,Lingyi Hong,Ziyu Guo,Ray Zhang
机构: Fudan(复旦大学); CUHK MMLab(香港中文大学多媒体实验室); Tsinghua(清华大学); CUHK MiuLar Lab(香港中文大学MiuLar实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.
zh

[CV-2] Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework

【速读】：该论文试图解决图像隐喻理解中的上下文缺失问题，即现有多模态大语言模型（MLLMs）在处理图像隐含意义时难以准确捕捉视觉元素之间的关系及其抽象含义。解决方案的关键在于提出了一种名为Let Androids Dream (LAD)的三阶段框架，通过感知、搜索和推理三个步骤，将视觉信息转化为多层次文本表示，迭代地整合跨领域知识以消除歧义，并通过显式推理生成上下文对齐的图像隐含意义，从而有效弥补上下文缺口。

链接: https://arxiv.org/abs/2505.17019
作者: Chenhao Zhang,Yazhe Niu
机构: Shanghai AI Laboratory (上海人工智能实验室); Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages, 9 figures. Code Dataset: this https URL

点击查看摘要

Abstract:Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
zh

[CV-3] SophiaVL-R1: Reinforcing MLLM s Reasoning with Thinking Reward

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在基于规则的强化学习（Rule-based Reinforcement Learning, RL）中缺乏对推理过程监督的问题，这可能导致模型学习到次优的推理策略，从而影响其泛化能力。解决方案的关键在于提出SophiaVL-R1，通过训练一个评估整个思考过程质量的思考奖励模型，并引入Trust-GRPO方法，该方法在训练过程中为思考奖励分配可信度权重，以减轻潜在不可靠奖励的影响；此外，还设计了退火训练策略，逐步降低思考奖励的权重，使模型在后期训练阶段更依赖准确的规则结果奖励。

链接: https://arxiv.org/abs/2505.17018
作者: Kaixuan Fan,Kaituo Feng,Haoming Lyu,Dongzhan Zhou,Xiangyu Yue
机构: Shanghai Artifcial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL.
zh

[CV-4] Interactive Post-Training for Vision-Language-Action Models

【速读】：该论文试图解决现有视觉-语言-动作（Vision-Language-Action, VLA）模型在低数据环境下适应新任务和环境的能力受限的问题，这些问题主要源于现有训练流程高度依赖离线专家演示数据和监督模仿学习。解决方案的关键在于提出RIPT-VLA，这是一种基于强化学习的交互式后训练范式，通过动态滚动采样和留一法优势估计实现稳定策略优化，仅使用稀疏二进制成功奖励即可对预训练VLA模型进行微调，从而显著提升模型的适应性和效率。

链接: https://arxiv.org/abs/2505.17016
作者: Shuhan Tan,Kairan Dou,Yue Zhao,Philipp Krähenbühl
机构: UT Austin(德州大学奥斯汀分校); Nankai University(南开大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision. Comments: Project page: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2505.17016 [cs.LG] (or arXiv:2505.17016v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.17016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-5] When Are Concepts Erased From Diffusion Models?

【速读】：该论文试图解决扩散模型中概念擦除（concept erasure）的彻底性问题，即如何有效防止模型生成特定概念。其解决方案的关键在于提出两种概念擦除机制的理论模型：一是降低生成目标概念的可能性，二是干扰模型内部的引导机制。同时，论文引入了一套全面的评估框架，包括对抗攻击、新型探测技术以及对替代生成内容的分析，以更准确地判断目标概念是否被真正擦除。

链接: https://arxiv.org/abs/2505.17013
作者: Kevin Lu,Nicky Kriplani,Rohit Gandikota,Minh Pham,David Bau,Chinmay Hegde,Niv Cohen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model’s internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model’s alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.
zh

[CV-6] SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在三维空间感知与理解能力方面的不足问题。其关键解决方案是提出SpatialScore，这是一个迄今为止最全面且多样化的多模态空间理解基准，整合了VGBench及其他11个现有数据集的相关数据，涵盖28K个样本，并包含一个精心设计的挑战性子集SpatialScore-Hard，用于评估和推动MLLMs在空间推理任务中的性能提升。

链接: https://arxiv.org/abs/2505.17012
作者: Haoning Wu,Xiao Huang,Yaohui Chen,Ya Zhang,Yanfeng Wang,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
zh

[CV-7] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

【速读】：该论文试图解决视频生成与重建中如何在有限的token预算下实现高效且内容感知的token分配问题。解决方案的关键在于提出AdapTok，一种自适应时间因果视频分词器，其核心包括块级掩码策略和块因果评分器，能够在训练阶段随机丢弃每个块的尾部token，并通过评分器预测不同token数量下的视频帧重建质量；在推理阶段，基于整数线性规划的自适应token分配策略根据预测得分调整token使用，从而实现在可控总预算下的样本级、内容感知且时间动态的token分配。

链接: https://arxiv.org/abs/2505.17011
作者: Yan Li,Changyao Tian,Renqiu Xia,Ning Liao,Weiwei Guo,Junchi Yan,Hongsheng Li,Jifeng Dai,Hao Li,Xue Yang
机构: Shanghai Jiao Tong University (上海交通大学); MMLab, The Chinese University of Hong Kong (多媒体实验室，香港中文大学); OpenGVLab, Shanghai AI Laboratory (开放视觉实验室，上海人工智能实验室); Tongji University (同济大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
zh

[CV-8] Deep mineralogical segmentation of thin section images based on QEMSCAN maps

【速读】：该论文旨在解决碳酸盐岩薄片图像中矿物学特征自动识别的问题，传统的人工分析方法存在主观性强和耗时的缺点，而现有的技术如QEMSCAN®虽然能实现自动化，但成本高昂且分析过程繁琐。其解决方案的关键在于提出一种基于卷积神经网络（Convolutional Neural Network）的模型，利用U-Net语义分割架构，通过平面和交叉偏振光薄片图像以及对应的QEMSCAN地图进行训练，实现对碳酸盐岩薄片图像的自动矿物学分割，从而在低成本、通用性和效率方面取得突破。

链接: https://arxiv.org/abs/2505.17008
作者: Jean Pablo Vieira de Mello,Matheus Augusto Alves Cuglieri,Leandro P. de Figueiredo,Fernando Bordignon,Marcelo Ramalho Albuquerque,Rodrigo Surmas,Bruno Cavalcanti de Paula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN® are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named “Others”, while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.
zh

[CV-9] CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

【速读】：该论文旨在解决从互联网视频中学习具有信息量的连续运动表示的问题，现有离散潜在动作方法存在信息丢失以及难以处理复杂和细粒度动态的缺陷。其解决方案的关键在于提出CoMo框架，该框架通过早期时间特征差异机制防止模型崩溃并抑制静态外观噪声，同时基于信息瓶颈原理约束潜在运动嵌入的维度，以在保留足够动作相关信息与最小化无关外观噪声之间取得更好的平衡。

链接: https://arxiv.org/abs/2505.17006
作者: Jiange Yang,Yansong Shi,Haoyi Zhu,Mingyu Liu,Kaijing Ma,Yating Wang,Gangshan Wu,Tong He,Limin Wang
机构: Nanjing University (南京大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Xi’an Jiaotong University (西安交通大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
zh

[CV-10] PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association INTERSPEECH2025

【速读】：该论文试图解决人脸与语音关联学习中的嵌入空间对齐与融合问题（face-voice association），现有方法存在依赖远距离间隔参数以及需要精心设计的负样本挖掘过程的问题。解决方案的关键在于学习一个联合嵌入空间，并在其中施加正交性约束以对齐人脸和语音的嵌入特征，随后通过增强的门控融合策略进行融合，从而提升关联性能。

链接: https://arxiv.org/abs/2505.17002
作者: Abdul Hannan,Muhammad Arslan Manzoor,Shah Nawaz,Muhammad Irzam Liaqat,Markus Schedl,Mubashir Noman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at InterSpeech 2025

点击查看摘要

Abstract:We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
zh

[CV-11] Seeing through Satellite Images at Street Views ICCV2023

【速读】：该论文试图解决卫星图像到逼真街景全景图像合成的问题（SatStreet-view synthesis），即根据任意卫星图像和指定的相机位置或轨迹生成高保真的街景全景图像和视频。解决方案的关键在于通过学习卫星视图与街景视图配对图像中的神经辐射场，克服了由于视图稀疏性和卫星图像与街景图像之间极端视角变化带来的挑战。研究提出了一种新的方法Sat2Density++，其核心是通过建模街景视图特有的元素（如天空和光照效果）来实现逼真街景全景图像的渲染。

链接: https://arxiv.org/abs/2505.17001
作者: Ming Qian,Bin Tan,Qiuyu Wang,Xianwei Zheng,Hanjiang Xiong,Gui-Song Xia,Yujun Shen,Nan Xue
机构: Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , journal extension of ICCV 2023 conference paper ‘Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs’, submitted to TPAMI

点击查看摘要

Abstract:This paper studies the task of SatStreet-view synthesis, which aims to render photorealistic street-view panorama images and videos given any satellite image and specified camera positions or trajectories. We formulate to learn neural radiance field from paired images captured from satellite and street viewpoints, which comes to be a challenging learning problem due to the sparse-view natural and the extremely-large viewpoint changes between satellite and street-view images. We tackle the challenges based on a task-specific observation that street-view specific elements, including the sky and illumination effects are only visible in street-view panoramas, and present a novel approach Sat2Density++ to accomplish the goal of photo-realistic street-view panoramas rendering by modeling these street-view specific in neural networks. In the experiments, our method is testified on both urban and suburban scene datasets, demonstrating that Sat2Density++ is capable of rendering photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.
zh

[CV-12] Native Segmentation Vision Transformers

【速读】：该论文试图解决传统视觉主干网络中通过均匀下采样降低空间分辨率所导致的语义信息丢失问题，进而提升分割性能。其解决方案的关键在于提出一种基于内容感知的空间分组层（content-aware spatial grouping layer），该层能够根据图像边界和语义内容动态地将标记分配到一个精简的集合中，从而在特征提取过程中自然地生成层次化分割结果，无需额外的分割专用头部。

链接: https://arxiv.org/abs/2505.16993
作者: Guillem Brasó,Aljoša Ošep,Laura Leal-Taixé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.
zh

[CV-13] An Effective Training Framework for Light-Weight Automatic Speech Recognition Models INTERSPEECH2025

【速读】：该论文旨在解决在资源受限设备上部署大型自动语音识别（ASR）模型的问题，这些问题通常由于计算和内存限制而难以实现。现有方法如剪枝、知识蒸馏和层跳过等虽然能够将大模型转换为小模型，但往往导致性能显著下降或需要长时间训练小模型以获得更好效果。该论文提出的解决方案的关键在于一种基于两步表示学习的方法，能够从一个大型模型中生成多个小型模型，并在有限的训练轮次内实现显著更好的性能，从而有效提升了训练速度并降低了词错误率。

链接: https://arxiv.org/abs/2505.16991
作者: Abdul Hannan,Alessio Brutti,Shah Nawaz,Mubashir Noman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at InterSpeech 2025

点击查看摘要

Abstract:Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
zh

[CV-14] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

【速读】：该论文试图解决纯离散扩散方法在训练过程中导致的显著训练不稳定、性能欠优以及严重长度偏差等问题。解决方案的关键在于设计了一种新的训练范式，将初始的自回归阶段与后续的扩散阶段相结合，从而提升了模型性能。通过这种范式，作者提出了Dimple-7B模型，并在相同数据集和训练流程下实现了比LLaVA-NEXT更高的性能提升。此外，为提高推理效率，还引入了自信解码策略，动态调整每步生成的标记数量，显著减少了生成迭代次数。

链接: https://arxiv.org/abs/2505.16990
作者: Runpeng Yu,Xinyin Ma,Xinchao Wang
机构: National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only \frac\textresponse length3 . We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple’s capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.
zh

[CV-15] Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

【速读】：该论文旨在解决在安全关键型应用中，如自动驾驶和机器人辅助手术，如何有效进行分布外（Out-of-distribution, OOD）检测与分割的问题。现有研究主要针对单模态图像数据，而实际应用中数据具有多模态特性，需要融合多种模态以提升OOD检测效果。其关键挑战在于未知数据缺乏监督信号，导致模型对OOD样本产生过度自信的预测。为此，作者提出了一种名为Feature Mixing的简单且快速的多模态异常合成方法，该方法具有理论支持，并可通过进一步优化帮助模型更好地区分分布内（In-distribution, ID）与OOD数据，且具有模态无关性，适用于多种模态组合。

链接: https://arxiv.org/abs/2505.16985
作者: Moru Liu,Hao Dong,Jessica Kelly,Olga Fink,Mario Trapp
机构: Technical University of Munich (慕尼黑工业大学); ETH Zürich (苏黎世联邦理工学院); Fraunhofer IKS (弗劳恩霍夫信息知识系统研究所); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a 10 \times to 370 \times speedup. Our source code and dataset will be available at this https URL.
zh

[CV-16] Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction CVPR2025

【速读】：该论文旨在解决视频虚拟试衣（video virtual try-on）中的时空姿态交互问题，即在保持服装视觉真实性的前提下，使服装能够动态适应主体的姿态和体型变化，同时避免视频序列中的时间不一致性。解决方案的关键在于提出一种名为动态姿态交互扩散模型（Dynamic Pose Interaction Diffusion Models, DPIDM）的框架，其核心是通过基于骨架的姿态适配器将同步的人体与服装姿态整合到去噪网络中，并设计分层注意力模块以建模帧内人体与服装的姿态交互以及跨帧的长期姿态动态，同时引入时序正则化注意力损失来增强时间一致性。

链接: https://arxiv.org/abs/2505.16980
作者: Dong Li,Wenqi Zhong,Wei Yu,Yingwei Pan,Dingwen Zhang,Ting Yao,Junwei Han,Tao Mei
机构: HiDream.ai Inc. (HiDream.ai 公司); Northwest Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: CVPR 2025

点击查看摘要

Abstract:Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.
zh

[CV-17] Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On ICLR2025

【速读】：该论文旨在解决生成式 AI (Generative AI) 在虚拟试衣（Virtual Try-On, VTON）任务中难以保持给定衣物的形状和细节的问题，这一挑战源于扩散模型固有的随机性。解决方案的关键在于利用视觉对应关系作为先验信息，以控制扩散过程，而非简单地将整件衣物作为外观参考输入到 UNet 中。具体而言，作者将细粒度的外观和纹理细节解释为一组结构化的语义点，并通过局部流变形将这些点与目标人体上的语义点进行匹配，进而将二维点扩展为包含目标人体深度/法线图的3D感知线索，以此作为语义点匹配来监督扩散模型的训练，并设计了一种聚焦于点的扩散损失以充分利用语义点匹配的优势。

链接: https://arxiv.org/abs/2505.16977
作者: Siqi Wan,Jingwen Chen,Yingwei Pan,Ting Yao,Tao Mei
机构: HiDream.ai (HiDream.ai)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICLR 2025. Code is publicly available at: this https URL

点击查看摘要

Abstract:Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.
zh

[CV-18] Creatively Upscaling Images with Global-Regional Priors

【速读】：该论文旨在解决高分辨率图像生成中同时保持全局语义结构和生成创造性局部细节的问题。现有方法在提升图像分辨率时难以兼顾全局一致性与局部创新性，导致生成结果可能出现语义不连贯或局部重复的问题。解决方案的关键在于提出C-Upscale，其核心是利用给定的全局提示和通过多模态大语言模型（Multimodal LLM）估计的局部提示，构建全局-局部先验。通过识别低分辨率图像的低频成分作为全局结构先验，并结合局部注意力控制机制与丰富的局部描述性提示，有效提升了高分辨率图像的视觉保真度和局部细节的创造性。

链接: https://arxiv.org/abs/2505.16976
作者: Yurui Qian,Qi Cai,Yingwei Pan,Ting Yao,Tao Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: International Journal of Computer Vision (IJCV) 2025

点击查看摘要

Abstract:Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
zh

[CV-19] OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

【速读】：该论文旨在解决开放词汇分割（Open-Vocabulary Segmentation, OVS）中模型在开放世界场景下难以区分相似类别的问题，这主要是由于现有方法缺乏显式的推理过程和上下文理解，导致分割结果缺乏可解释性与准确性。其解决方案的关键在于提出了一种分步的视觉推理框架OpenSeg-R，该框架利用大型多模态模型（Large Multimodal Models, LMMs）在分割前进行层次化视觉推理，生成通用和图像特定的推理步骤，并构建结构化的三元组以解释对象的视觉原因，从而生成详细的描述提示，提升分割精度与可解释性。

链接: https://arxiv.org/abs/2505.16974
作者: Zongyan Han,Jiale Cao,Shuo Chen,Tong Wang,Jorma Laaksonen,Rao Muhammad Anwer
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Tianjin University (天津大学); Nanjing University (南京大学); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.
zh

[CV-20] UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation CVPR2025

【速读】：该论文试图解决材料属性未知情况下的逆向仿真问题，即通过观察物体的运动来推断其物理属性，并实现对物体在不同场景下的重模拟。解决方案的关键在于提出UniPhy，一个共享潜在条件的神经本构模型，该模型通过跨材料的联合训练提升估计的鲁棒性和准确性，并利用可微分仿真进行潜在空间优化以匹配观测数据，从而实现无需用户指定材料类型的信息即可完成材料属性的推断。

链接: https://arxiv.org/abs/2505.16971
作者: Himangi Mittal,Peiye Zhuang,Hsin-Ying Lee,Shubham Tulsiani
机构: Carnegie Mellon University (卡内基梅隆大学); Snap Inc. (Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. At inference UniPhy allows `inverse simulation’ i.e. inferring material properties by optimizing the scene-specific latent to match the available observations via differentiable simulation. In contrast to existing methods that treat such inference as system identification, UniPhy does not rely on user-specified material type information. Compared to prior neural constitutive modeling approaches which learn instance specific networks, the shared training across materials improves both, robustness and accuracy of the estimates. We train UniPhy using simulated trajectories across diverse geometries and materials – elastic, plasticine, sand, and fluids (Newtonian non-Newtonian). At inference, given an object with unknown material properties, UniPhy can infer the material properties via latent optimization to match the motion observations, and can then allow re-simulating the object under diverse scenarios. We compare UniPhy against prior inverse simulation methods, and show that the inference from UniPhy enables more accurate replay and re-simulation under novel conditions.
zh

[CV-21] Efficient Correlation Volume Sampling for Ultra-High-Resolution Optical Flow Estimation

【速读】：该论文旨在解决现有光流估计方法中由于密集全对相关体积的局部成本采样导致的计算和内存复杂度呈二次增长的问题。传统方法为了降低内存占用，通常在降采样后的图像上进行处理，从而丢失了细粒度细节。论文提出的解决方案的关键在于实现一种更高效的全对相关体积采样方式，该方式在保持与RAFT定义的数学算子完全一致的前提下，显著提升了效率，相比按需计算的成本采样方法提高了高达90%，同时内存使用量降低了多达95%。这一改进有效减少了整体运行时间，尤其在内存受限环境中可实现高达50%的端到端模型推理时间节省。

链接: https://arxiv.org/abs/2505.16942
作者: Karlis Martins Briedis,Markus Gross,Christopher Schroers
机构: ETH Zürich (苏黎世联邦理工学院); DisneyResearch|Studios (迪士尼研究院|工作室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is slower in practice and therefore prior methods typically process images at reduced resolutions, missing fine-grained details. To address this, we propose a more efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 90% while maintaining low memory usage, and performs on par with the default implementation with up to 95% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 50% savings for the total end-to-end model inference in memory-constrained environments. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an additional inference-time modification of the recent SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and efficiency. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.16942 [cs.CV] (or arXiv:2505.16942v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.16942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-22] Backdoor Cleaning without External Guidance in MLLM Fine-tuning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在微调即服务（Fine-tuning-as-a-Service, FTaaS）场景中因恶意微调导致的后门攻击问题。其解决方案的关键在于发现并利用注意力崩溃（attention collapse）现象，即后门触发器会系统性地干扰跨模态处理，导致注意力异常集中于非语义区域。基于此，作者提出了一种无需清洁监督、辅助标签或模型修改的数据过滤框架Believe Your Eyes (BYE)，通过注意力熵模式作为自监督信号，实现对后门样本的有效识别与过滤。

链接: https://arxiv.org/abs/2505.16916
作者: Xuankun Rong,Wenke Huang,Jian Liang,Jinhe Bi,Xun Xiao,Yiming Li,Bo Du,Mang Ye
机构: Wuhan University (武汉大学); Munich Research Center (慕尼黑研究中心); Huawei Technologies (华为技术有限公司); Nanyang Technological University (南洋理工大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions–a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE’s effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.
zh

[CV-23] DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在处理长且细节丰富的提示时性能显著下降的问题。现有模型在专业应用场景中面对复杂组合性要求的长文本输入时表现不佳，缺乏系统性的评估基准。解决方案的关键在于提出DetailMaster，这是一个专门设计用于评估T2I模型处理长文本输入能力的综合性基准，包含四个关键评估维度：角色属性、结构化角色位置、多维场景属性以及显式空间/交互关系，并通过高质量的专家标注数据进行验证，以揭示模型在结构理解与细节处理方面的系统性缺陷。

链接: https://arxiv.org/abs/2505.16915
作者: Qirui Jiao,Daoyuan Chen,Yilun Huang,Xika Lin,Ying Shen,Yaliang Li
机构: Sun Yat-Sen University (中山大学); Alibaba Group (阿里巴巴集团); Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures, 10 tables

点击查看摘要

Abstract:While recent text-to-image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, their performance significantly degrades when confronted with long, detail-intensive prompts required in professional applications. We present DetailMaster, the first comprehensive benchmark specifically designed to evaluate T2I models’ systematical abilities to handle extended textual inputs that contain complex compositional requirements. Our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Explicit Spatial/Interactive Relationships. The benchmark comprises long and detail-rich prompts averaging 284.89 tokens, with high quality validated by expert annotators. Evaluation on 7 general-purpose and 5 long-prompt-optimized T2I models reveals critical performance limitations: state-of-the-art models achieve merely ~50% accuracy in key dimensions like attribute binding and spatial reasoning, while all models showing progressive performance degradation as prompt length increases. Our analysis highlights systemic failures in structural comprehension and detail overload handling, motivating future research into architectures with enhanced compositional reasoning. We open-source the dataset, data curation code, and evaluation tools to advance detail-rich T2I generation and enable broad applications that would otherwise be infeasible due to the lack of a dedicated benchmark.
zh

[CV-24] RealEngine: Simulating Autonomous Driving in Realistic Context

【速读】：该论文试图解决现有驾驶模拟器在多模态感知能力、闭环评估、交通场景多样性、多智能体协作及计算效率等方面无法全面满足实际需求的问题（the limitations of existing simulators and benchmarks in comprehensively meeting these fundamental criteria）。其解决方案的关键在于提出RealEngine框架，该框架通过融合三维场景重建与新视角合成技术，实现驾驶场景中背景与前景交通参与者的分别重建，从而支持高度多样且逼真的交通场景生成，并结合多模态传感器数据实现跨模态的高保真渲染，为驾驶代理的评估提供可靠且全面的基准。

链接: https://arxiv.org/abs/2505.16902
作者: Junzhe Jiang,Nan Song,Jingyu Li,Xiatian Zhu,Li Zhang
机构: Fudan University (复旦大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.
zh

[CV-25] racking the Flight: Exploring a Computational Framework for Analyzing Escape Responses in Plains Zebra (Equus quagga) CVPR2025

【速读】：该论文旨在解决无人机拍摄的动物运动视频中，如何有效分离动物自身运动与无人机运动的技术难题（motion separation）。其解决方案的关键在于利用计算机视觉技术，包括基于生物成像的配准技术、Structure-from-Motion (SfM) 流水线以及一种混合插值方法，以实现对动物行为的精准分析。通过评估这些方法，研究者最终选择性能最佳的方案来提取个体轨迹并识别关键行为模式，从而为群体动物行为的深入研究提供了可行的技术路径。

链接: https://arxiv.org/abs/2505.16882
作者: Isla Duporge,Sofia Minano,Nikoloz Sirmpilatze,Igor Tatarnikov,Scott Wolf,Adam L. Tyson,Daniel Rubenstein
机构: Princeton University (普林斯顿大学); Sainsbury Wellcome Centre, University College London (萨克勒威康中心，伦敦大学学院); Gatsby Computational Neuroscience Unit, University College London (盖茨计算神经科学单位，伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CV4Animals workshop at CVPR 2025

点击查看摘要

Abstract:Ethological research increasingly benefits from the growing affordability and accessibility of drones, which enable the capture of high-resolution footage of animal movement at fine spatial and temporal scales. However, analyzing such footage presents the technical challenge of separating animal movement from drone motion. While non-trivial, computer vision techniques such as image registration and Structure-from-Motion (SfM) offer practical solutions. For conservationists, open-source tools that are user-friendly, require minimal setup, and deliver timely results are especially valuable for efficient data interpretation. This study evaluates three approaches: a bioimaging-based registration technique, an SfM pipeline, and a hybrid interpolation method. We apply these to a recorded escape event involving 44 plains zebras, captured in a single drone video. Using the best-performing method, we extract individual trajectories and identify key behavioral patterns: increased alignment (polarization) during escape, a brief widening of spacing just before stopping, and tighter coordination near the group’s center. These insights highlight the method’s effectiveness and its potential to scale to larger datasets, contributing to broader investigations of collective animal behavior.
zh

[CV-26] 2I-ConBench: Text-to-Image Benchmark for Continual Post-training

【速读】：该论文试图解决文本到图像扩散模型在持续后训练（continual post-training）过程中出现的预训练知识遗忘和零样本组合性下降的问题。其解决方案的关键在于引入T2I-ConBench，这是一个统一的基准测试平台，用于评估文本到图像模型的持续后训练性能，涵盖了物品定制和领域增强两种实际场景，并从泛化性保持、目标任务性能、灾难性遗忘和跨任务泛化四个维度进行综合评估。

链接: https://arxiv.org/abs/2505.16875
作者: Zhehao Huang,Yuhang Liu,Yixin Lou,Zhengbao He,Mingzhen He,Wenxing Zhou,Tao Li,Kehan Li,Zeyi Huang,Xiaolin Huang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint “oracle” training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
zh

[CV-27] raining-Free Efficient Video Generation via Dynamic Token Carving

【速读】：该论文旨在解决视频扩散变换器（video Diffusion Transformer, DiT）模型在实际部署中因计算需求庞大而受到的限制，具体表现为自注意力机制与token长度呈二次复杂度的关系，以及扩散模型的多步骤特性。其解决方案的关键在于提出Jenga推理流水线，该方法结合了动态注意力裁剪和渐进式分辨率生成，通过两个核心洞察：早期去噪步骤不需要高分辨率潜在表示，后期步骤不需要密集注意力。Jenga引入了一种基于3D空间填充曲线的块级注意力机制，并采用渐进式分辨率策略，在生成过程中逐步提升潜在表示的分辨率，从而显著提升生成速度并保持生成质量。

链接: https://arxiv.org/abs/2505.16864
作者: Yuechen Zhang,Jinbo Xing,Bin Xia,Shaoteng Liu,Bohao Peng,Xin Tao,Pengfei Wan,Eric Lo,Jiaya Jia
机构: CUHK(香港中文大学); HKUST(香港科技大学); Kuaishou Technology(快手科技); SmartMore(思谋科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , 24 pages

点击查看摘要

Abstract:Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83 \times speedup with 0.01% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds – without requiring model retraining. Code: this https URL
zh

[CV-28] Conditional Panoramic Image Generation via Masked Autoregressive Modeling

【速读】：该论文旨在解决现有全景图像生成方法中的两个关键问题：一是基于扩散模型的方法在处理等距圆柱投影（ERP）全景图时，由于球面映射导致独立同分布（i.i.d.）高斯噪声假设被破坏，从而不适用；二是文本条件生成（text-to-panorama）和图像条件生成（panorama outpainting）通常被视为独立任务，依赖不同的架构和任务特定数据。论文提出的解决方案是构建一个统一框架——全景自回归模型（PAR），其关键在于利用掩码自回归建模来规避i.i.d.假设的限制，并将文本和图像条件整合到统一架构中，实现跨任务的无缝生成。

链接: https://arxiv.org/abs/2505.16862
作者: Chaoyang Wang,Xiangtai Li,Lu Qi,Xiaofan Lin,Jinbin Bai,Qianyu Zhou,Yunhai Tong
机构: Peking University (北京大学); Insta360 Research; National University of Singapore (新加坡国立大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.
zh

[CV-29] hink or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在推理过程中因生成完整的推理轨迹而导致的高计算成本和冗余问题。其核心挑战在于如何使模型在适当的时候进行推理，而非对所有问题都生成详尽的推理过程。解决方案的关键在于提出一种两阶段训练策略——TON（Two-Stage Optimization Network），其中第一阶段通过监督微调（Supervised Fine-Tuning, SFT）引入“思考或不思考”的选择机制，第二阶段通过组相对策略优化（Group Relative Policy Optimization, GRPO）让模型自主探索何时进行推理，从而在保持性能甚至提升性能的同时，显著减少推理长度。

链接: https://arxiv.org/abs/2505.16854
作者: Jiaqi Wang,Kevin Qinghong Lin,James Cheng,Mike Zheng Shou
机构: The Chinese University of Hong Kong (香港中文大学); Show Lab, National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective ‘thought dropout’ operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at this https URL.
zh

[CV-30] LaViDa: A Large Diffusion Language Model for Multimodal Understanding

【速读】：该论文旨在解决现有自回归（Autoregressive, AR）视觉语言模型（Vision-Language Models, VLMs）在实际应用场景中存在的推理速度慢和生成控制性不足的问题。其解决方案的关键在于引入基于离散扩散模型（Discrete Diffusion Models, DMs）的架构，通过将视觉编码器与扩散模型结合，并进行多模态指令跟随的联合微调，以实现更快的并行解码、双向上下文支持以及更灵活的生成控制。此外，论文还提出了一系列创新技术，如互补掩码、前缀键值缓存和时间步长偏移，以提升训练效果和推理效率。

链接: https://arxiv.org/abs/2505.16839
作者: Shufan Li,Konstantinos Kallidromitis,Hritik Bansal,Akash Gokul,Yusuke Kato,Kazuki Kozuka,Jason Kuen,Zhe Lin,Kai-Wei Chang,Aditya Grover
机构: UCLA; Panasonic AI Research; Adobe Research; Salesforce Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs’ potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.
zh

[CV-31] Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning

【速读】：该论文旨在解决社交媒体上多模态虚假信息快速传播所带来的挑战，尤其是视频虚假信息检测研究因缺乏大规模、多样化数据集而受限的问题。现有方法往往过度依赖固定模板，缺乏对欺骗性内容的深度推理能力。其解决方案的关键在于引入FakeVV，一个包含超过100,000个视频-文本对的大规模基准数据集，并提出Fact-R1框架，该框架通过整合深度推理与协作式规则强化学习，在三阶段训练流程中实现虚假信息的长链式思维（CoT）指令调优、偏好对齐以及基于可验证奖励函数的群体相对策略优化，从而在复杂的多模态虚假信息场景中展现出类文本强化学习系统的涌现推理能力。

链接: https://arxiv.org/abs/2505.16836
作者: Fanrui Zhang,Dian Li,Qiang Zhang,Chenjun,sinbadliu,Junxiong Lin,Jiahong Yan,Jiawei Liu,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Tencent QQ (腾讯QQ); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 27 figures

点击查看摘要

Abstract:The rapid spread of multimodal misinformation on social media has raised growing concerns, while research on video misinformation detection remains limited due to the lack of large-scale, diverse datasets. Existing methods often overfit to rigid templates and lack deep reasoning over deceptive content. To address these challenges, we introduce FakeVV, a large-scale benchmark comprising over 100,000 video-text pairs with fine-grained, interpretable annotations. In addition, we further propose Fact-R1, a novel framework that integrates deep reasoning with collaborative rule-based reinforcement learning. Fact-R1 is trained through a three-stage process: (1) misinformation long-Chain-of-Thought (CoT) instruction tuning, (2) preference alignment via Direct Preference Optimization (DPO), and (3) Group Relative Policy Optimization (GRPO) using a novel verifiable reward function. This enables Fact-R1 to exhibit emergent reasoning behaviors comparable to those observed in advanced text-based reinforcement learning systems, but in the more complex multimodal misinformation setting. Our work establishes a new paradigm for misinformation detection, bridging large-scale video understanding, reasoning-guided alignment, and interpretable verification.
zh

[CV-32] Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts

【速读】：该论文试图解决场景化视频生成中缺乏以角色驱动的对话和语音表达的问题（character-driven dialogue and speech），这一维度在叙事中至关重要但尚未得到充分探索。解决方案的关键在于提出一个模块化流程，将动作级提示转换为具有视觉和听觉基础的叙事对话，通过预训练的视觉-语言编码器提取关键语义特征，并结合结构化提示引导大语言模型生成自然且角色一致的对话，同时利用递归叙事库确保跨场景的上下文一致性。

链接: https://arxiv.org/abs/2505.16819
作者: Taewon Kang,Ming C. Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling – character-driven dialogue and speech – remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character’s behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
zh

[CV-33] Perceptual Quality Assessment for Embodied AI

【速读】：该论文试图解决在真实世界中由于各种失真限制，具身人工智能（Embodied AI）难以有效部署的问题，特别是缺乏针对具身任务中图像可用性的图像质量评估（IQA）方法。其解决方案的关键在于提出“具身AI的图像质量评估”概念，构建感知-认知-决策-执行流程，并建立包含超过36k参考/失真图像对的Embodied-IQA数据库，同时利用视觉语言模型、视觉语言动作模型和真实世界机器人进行细粒度标注，以验证主流IQA方法在该场景下的性能并推动更精确质量指标的发展。

链接: https://arxiv.org/abs/2505.16815
作者: Chunyi Li,Jiaohao Xiao,Jianbo Zhang,Farong Wen,Zicheng Zhang,Yuan Tian,Xiangyang Zhu,Xiaohong Liu,Zhengxue Cheng,Weisi Lin,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab (上海人工智能实验室); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: this https URL
zh

[CV-34] Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining CVPR2025

【速读】：该论文旨在解决雨天视频修复中现有方法依赖成对数据导致的泛化能力不足问题，主要由于合成雨效与真实雨效之间的差异。其解决方案的关键在于提出一种双分支时空状态空间模型，通过设计空间和时间状态空间模型层分别提取空间特征并整合帧间时序依赖关系，同时引入动态堆叠滤波器以提升多帧特征融合效果，并采用中值堆叠损失函数实现半监督学习，从而增强雨纹去除效果及模型在下游任务中的适用性。

链接: https://arxiv.org/abs/2505.16811
作者: Shangquan Sun,Wenqi Ren,Juxiang Zhou,Shu Wang,Jianhou Gan,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University; MoE Key Laboratory of Information Technology; Guangdong Provincial Key Laboratory of Information Security Technology; Key Laboratory of Educational Information for Nationalities, Yunnan Normal University; School of Mechanical Engineering and Automation, Fuzhou University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages, 8 figures, CVPR 2025 Oral Presentation

点击查看摘要

Abstract:Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.
zh

[CV-35] Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities

【速读】：该论文试图解决在多模态MRI分割中，由于某些模态在临床实践中可能缺失而导致模型性能下降的问题，以及在新增可用模态时重新训练模型效率低下和可能过拟合从而损害已有知识的问题。解决方案的关键在于提出基于回放的超图领域增量学习（Replay-based Hypergraph Domain Incremental Learning, ReHyDIL），该方法利用领域增量学习（Domain Incremental Learning, DIL）使分割模型能够在不遗忘已有知识的前提下学习新获取的MRI模态，并通过跨患者超图分割网络（Cross-Patient Hypergraph Segmentation Network, CHSNet）和Tversky-Aware Contrastive（TAC）损失函数提升不同患者场景下的分割性能。

链接: https://arxiv.org/abs/2505.16809
作者: Junze Wang(1),Lei Fan(2,3),Weipeng Jing(1),Donglin Di(4),Yang Song(3),Sidong Liu(5),Cong Cong(5) ((1) College of Computer and Control Engineering, Northeast Forestry University, Harbin, China, (2) The Centre for Healthy Brain Ageing (CHeBA), University of New South Wales, Sydney, Australia, (3) School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, (4) Space AI, Li Auto, Beijing, China, (5) Centre for Health Informatics, Macquarie University, Sydney, Australia)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Existing methods for multimodal MRI segmentation with missing modalities typically assume that all MRI modalities are available during training. However, in clinical practice, some modalities may be missing due to the sequential nature of MRI acquisition, leading to performance degradation. Furthermore, retraining models to accommodate newly available modalities can be inefficient and may cause overfitting, potentially compromising previously learned knowledge. To address these challenges, we propose Replay-based Hypergraph Domain Incremental Learning (ReHyDIL) for brain tumor segmentation with missing modalities. ReHyDIL leverages Domain Incremental Learning (DIL) to enable the segmentation model to learn from newly acquired MRI modalities without forgetting previously learned information. To enhance segmentation performance across diverse patient scenarios, we introduce the Cross-Patient Hypergraph Segmentation Network (CHSNet), which utilizes hypergraphs to capture high-order associations between patients. Additionally, we incorporate Tversky-Aware Contrastive (TAC) loss to effectively mitigate information imbalance both across and within different modalities. Extensive experiments on the BraTS2019 dataset demonstrate that ReHyDIL outperforms state-of-the-art methods, achieving an improvement of over 2% in the Dice Similarity Coefficient across various tumor regions. Our code is available at ReHyDIL.
zh

[CV-36] SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving CVPR2025

【速读】：该论文旨在解决自动驾驶系统中因计算需求高而导致的高效集成与实时决策问题。其解决方案的关键在于提出SOLVE框架，该框架通过共享视觉编码器在特征层面实现生成式AI（Generative AI）与端到端（E2E）模型的知识共享，从而增强自主车辆规划能力，并引入轨迹思维链（T-CoT）范式以逐步优化轨迹预测，结合时间解耦策略实现高质量VLM输出与E2E实时性能的对齐。

链接: https://arxiv.org/abs/2505.16805
作者: Xuesong Chen,Linjiang Huang,Tao Ma,Rongyao Fang,Shaoshuai Shi,Hongsheng Li
机构: MMLab, CUHK (多媒体实验室，香港中文大学); Institute of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); Voyager Research, Didi Chuxing (滴滴出行伏尔泰研究部); CPII under InnoHK (InnoHK下的计算与智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
zh

[CV-37] V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

【速读】：该论文试图解决事件相机（event camera）训练数据集在存储需求和输入/输出负担方面的瓶颈问题，以及真实数据稀缺导致的模型扩展性和泛化能力受限的问题。其解决方案的关键在于引入Video-to-Voxel (V2V)方法，该方法直接将传统视频帧转换为基于事件的体素网格表示，从而完全绕过了存储密集型事件流生成过程，实现了存储需求降低150倍，并支持实时参数随机化以增强模型鲁棒性。

链接: https://arxiv.org/abs/2505.16797
作者: Hanyue Lou,Jinxiu Liang,Minggui Teng,Yi Wang,Boxin Shi
机构: Peking University (北京大学); Shanghai Innovation Institute (上海创新研究院); Shanghai AI Laboratory (上海人工智能实验室); National Institute of Informatics (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours–an order of magnitude larger than existing event datasets, yielding substantial improvements.
zh

[CV-38] REOBench: Benchmarking Robustness of Earth Observation Foundation Models

【速读】：该论文旨在解决地球观测基础模型在真实世界扰动下的鲁棒性不足问题，当前这些模型虽在多种地球观测任务中表现出良好的泛化能力，但其在面对实际场景中的图像退化时仍缺乏系统性评估与研究。解决方案的关键在于提出REOBench，这是首个针对地球观测基础模型鲁棒性的综合性基准，涵盖六种任务和十二种图像退化类型，包括基于外观和几何的扰动，并聚焦于高分辨率光学遥感图像，以实现更真实和细粒度的评估。通过系统评估多种预训练范式的模型，研究揭示了现有模型在输入扰动下的性能下降情况及其影响因素，并为提升模型的鲁棒性和可靠性提供了重要参考。

链接: https://arxiv.org/abs/2505.16793
作者: Xiang Li,Yong Tao,Siyuan Zhang,Siwei Liu,Zhitong Xiong,Chunbo Luo,Lu Liu,Mykola Pechenizkiy,Xiao Xiang Zhu,Tianjin Huang
机构: University of Reading, UK; University of Exeter, UK; South China Normal University, China; Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates; Technical University of Munich, Germany; Eindhoven University of Technology, NL
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages

点击查看摘要

Abstract:Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.
zh

[CV-39] REPA Works Until It Doesnt: Early-Stopped Holistic Alignment Supercharges Diffusion Training

【速读】：该论文试图解决Diffusion Transformers (DiTs)训练过程缓慢的问题，尤其是现有方法如Representation Alignment (REPA)在早期阶段虽能加速训练但后期性能停滞或下降的问题。解决方案的关键在于提出HASTE（Holistic Alignment with Stage-wise Termination for Efficient training），其核心是通过两阶段策略：第一阶段采用整体对齐损失，将教师模型的注意力图和特征投影同时蒸馏到DiT的中层，实现快速收敛；第二阶段在达到预设条件后一次性终止对齐损失，使DiT专注于去噪并发挥其生成能力，从而在不改变架构的情况下显著提升训练效率。

链接: https://arxiv.org/abs/2505.16792
作者: Ziqiao Wang,Wangbo Zhao,Yuhao Zhou,Zekai Li,Zhiyuan Liang,Mingjia Shi,Xuanlei Zhao,Pengfei Zhou,Kaipeng Zhang,Zhangyang Wang,Kai Wang,Yang You
机构: NUS HPC-AI Lab; Shanghai AI Laboratory; UT Austin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy – representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) – dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher’s lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA’s best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .
zh

[CV-40] Four Eyes Are Better Than Two: Harnessing the Collaborative Potential of Large Models via Differentiated Thinking and Complementary Ensembles

【速读】：该论文旨在解决视频理解任务中如何有效利用多模态大模型的问题，特别是在少样本学习和模型集成策略下提升模型性能。其解决方案的关键在于系统性地探索和评估多样化的提示风格与处理范式，以有效引导大模型的注意力，充分发挥其强大的泛化与适应能力。通过精心设计的方法，直接使用单个多模态模型即可超越先前的最先进（SOTA）方法，同时引入额外的阶段进一步促进周期性结果的合作与集成，从而实现显著的性能提升。

链接: https://arxiv.org/abs/2505.16784
作者: Jun Xie,Xiongjun Guan,Yingjian Zhu,Zhaoran Zhao,Xinming Wang,Feng Chen,Zhepeng Wang
机构: Lenovo research(联想研究); Tsinghua University(清华大学); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Zhongguancun Academy(中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present the runner-up solution for the Ego4D EgoSchema Challenge at CVPR 2025 (Confirmed on May 20, 2025). Inspired by the success of large models, we evaluate and leverage leading accessible multimodal large models and adapt them to video understanding tasks via few-shot learning and model ensemble strategies. Specifically, diversified prompt styles and process paradigms are systematically explored and evaluated to effectively guide the attention of large models, fully unleashing their powerful generalization and adaptability abilities. Experimental results demonstrate that, with our carefully designed approach, directly utilizing an individual multimodal model already outperforms the previous state-of-the-art (SOTA) method which includes several additional processes. Besides, an additional stage is further introduced that facilitates the cooperation and ensemble of periodic results, which achieves impressive performance improvements. We hope this work serves as a valuable reference for the practical application of large models and inspires future research in the field.
zh

[CV-41] Single Domain Generalization for Few-Shot Counting via Universal Representation Matching CVPR2025

【速读】：该论文旨在解决少样本计数（few-shot counting）中因领域偏移（domain shift）导致模型泛化能力不足的问题。现有方法通常遵循标准流程，即从少量标注示例中提取目标对象原型，并将其与图像特征进行匹配以构建相关性图。然而，这些方法忽视了学习高度泛化原型的重要性。论文提出的解决方案关键在于引入从大规模预训练视觉-语言模型中蒸馏得到的通用视觉-语言表示（universal vision-language representations），将其融入相关性构建过程中，从而显著提升模型对领域偏移的鲁棒性，同时保持域内性能。

链接: https://arxiv.org/abs/2505.16778
作者: Xianing Chen,Si Huo,Borui Jiang,Hailin Hu,Xinghao Chen
机构: Huawei Noah’s Ark Lab. (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.
zh

[CV-42] Mitigating Overfitting in Medical Imaging: Self-Supervised Pretraining vs. ImageNet Transfer Learning for Dermatological Diagnosis

【速读】：该论文试图解决医学影像中因依赖自然图像数据集（如ImageNet）预训练模型而导致的领域特异性特征捕捉不足的问题。其解决方案的关键在于引入一种无监督学习框架，通过从头训练变分自编码器（Variational Autoencoder, VAE）在专有皮肤科数据集上提取高价值的皮肤科特征，从而构建一个结构化且临床相关的潜在空间，以实现更有效的领域特定特征提取。

链接: https://arxiv.org/abs/2505.16773
作者: Iván Matas,Carmen Serrano,Miguel Nogales,David Moreno,Lara Ferrándiz,Teresa Ojeda,Begoña Acha
机构: University of Seville (塞维利亚大学); Hospital Universitario Virgen Macarena (圣母玛卡蕾娜大学医院); Hospital Quirón Salud Sevilla (塞维利亚基罗纳健康医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables, 2 figures

点击查看摘要

Abstract:Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.
zh

[CV-43] RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

【速读】：该论文旨在解决当前多模态模型在视觉不可替代性推理能力评估方面的不足，即现有基准主要关注多模态输入和文本推理，而忽视了通过多模态输出进行推理的重要性。其解决方案的关键在于构建一个名为RBench-V的基准，专门用于评估模型在需要图像生成、辅助线构建等多模态输出操作的视觉思维过程中的推理能力，从而更全面地衡量多模态模型的综合表现。

链接: https://arxiv.org/abs/2505.16770
作者: Meng-Hao Guo,Xuanyu Chu,Qianrui Yang,Zhe-Han Mo,Yiqing Shen,Pei-lin Li,Xinjie Lin,Jinnian Zhang,Xin-Sheng Chen,Yi Zhang,Kiyohiro Nakayama,Zhengyang Geng,Houwen Peng,Han Hu,Shi-Nin Hu
机构: Tsinghua University (清华大学); Tencent Hunyuan X (腾讯混元X); Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models’ vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at this https URL
zh

[CV-44] Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

【速读】：该论文旨在解决文本到图像生成模型中用户提示（prompt）优化的问题，即如何将简单的用户提示转化为更复杂、更具表现力的提示以生成高质量图像。现有方法依赖大量手动标注数据和训练好的审美评估模型进行监督学习，但存在对数据规模的依赖性和模型偏差问题。该论文提出的解决方案的关键在于利用大视觉语言模型（LVLMs）作为求解器（solver）来重写用户提示，并同时作为奖励模型（reward model）评估生成图像的美学和对齐度，通过AI反馈替代人工反馈，实现模型在强化学习中的自我迭代与优化。

链接: https://arxiv.org/abs/2505.16763
作者: Hongji Yang,Yucheng Zhou,Wencheng Han,Jianbing Shen
机构: University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
zh

[CV-45] Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning

【速读】：该论文旨在解决现有3D网格生成预训练模型存在数据偏差、生成质量较低，以及全局强化学习方法依赖对象级奖励难以捕捉局部结构细节的问题。其解决方案的关键在于提出一种新颖的细粒度强化微调框架Mesh-RFT，该框架采用掩码直接偏好优化（M-DPO）技术，通过质量感知的面片掩码实现局部细化，同时引入拓扑感知评分系统，在对象和面片层面评估几何完整性和拓扑规则性，从而在个体面片粒度上优化网格质量，提升生成结果的局部精度与全局一致性。

链接: https://arxiv.org/abs/2505.16761
作者: Jian Liu,Jing Xu,Song Guo,Jing Li,Jingfeng Guo,Jiaao Yu,Haohan Weng,Biwen Lei,Xianghui Yang,Zhuo Chen,Fangqi Zhu,Tao Han,Chunchao Guo
机构: Tencent Hunyuan (腾讯混元); Hong Kong University of Science and Technology (香港科技大学); University of Science and Technology of China (中国科学技术大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Existing pretrained models for 3D mesh generation often suffer from data biases and produce low-quality results, while global reinforcement learning (RL) methods rely on object-level rewards that struggle to capture local structure details. To address these challenges, we present \textbfMesh-RFT, a novel fine-grained reinforcement fine-tuning framework that employs Masked Direct Preference Optimization (M-DPO) to enable localized refinement via quality-aware face masking. To facilitate efficient quality evaluation, we introduce an objective topology-aware scoring system to evaluate geometric integrity and topological regularity at both object and face levels through two metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first method to optimize mesh quality at the granularity of individual faces, resolving localized errors while preserving global coherence. Experiment results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6% and improves Topology Score (TS) by 3.8% over pre-trained models, while outperforming global DPO methods with a 17.4% HD reduction and 4.9% TS gain. These results demonstrate Mesh-RFT’s ability to improve geometric integrity and topological regularity, achieving new state-of-the-art performance in production-ready mesh generation. Project Page: \hrefthis https URLthis https URL.
zh

[CV-46] Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval

【速读】：该论文旨在解决遥感图像-文本检索（RSITR）任务中，由于文本模态的强区分性可能主导优化过程并抑制图像表示学习，导致跨模态优化不平衡的问题。其解决方案的关键在于提出一种表示差异桥接（RDB）方法，该方法通过设计一种跨模态非对称适配器（CMAA）实现模态特定优化和特征对齐，CMAA包含视觉增强适配器（VEA）和文本语义适配器（TSA），分别利用差分注意力（DA）机制挖掘细粒度图像特征和通过分层注意力（HA）机制识别关键文本语义；同时，将传统单任务检索框架扩展为双任务优化框架，并引入双任务一致性损失（DTCL），通过自适应加权组合跨模态、分类和指数移动平均一致性约束来提升跨模态对齐的鲁棒性。

链接: https://arxiv.org/abs/2505.16756
作者: Hailong Ning,Siying Wang,Tao Lei,Xiaopeng Cao,Huanmin Dou,Bin Zhao,Asoke K. Nandi,Petia Radeva
机构: School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an 710121, China; Xi’an Key Laboratory of Big Data and Intelligent Computing, Xi’an 710121, China; Shaanxi Joint Laboratory of Artificial Intelligence and the School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China; School of Computer Engineering, Weifang University, Shandong 261061, China; Shanghai Artificial Intelligence Laboratory, Shanghai, 200003, China; Department of Electronic and Electrical Engineering, Brunel University of London, Uxbridge, UB8 3PH, United Kingdom; Dept. Matemàtiques i Informàtica and Institute of Neuroscience, Univeritat de Barcelona, Barcelona, 08007, Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
zh

[CV-47] Robust Vision-Based Runway Detection through Conformal Prediction and Conformal mAP

【速读】：该论文旨在解决视觉着陆系统（Vision-Based Landing System, VLS）中跑道检测的统计不确定性保障问题，以提升系统可靠性并支持机器学习系统的航空认证。解决方案的关键在于应用共形预测（Conformal Prediction）技术，结合微调后的YOLOv5和YOLOv6模型，在用户定义的风险水平下量化定位可靠性，并引入一种新的评估指标——共形平均精度（Conformal mean Average Precision, C-mAP），以将目标检测性能与共形保证相匹配。

链接: https://arxiv.org/abs/2505.16740
作者: Alya Zouzou,Léo andéol,Mélanie Ducoffe,Ryma Boumazouza
机构: Airbus(空中客车); Paul Valery University (保罗·瓦莱里大学); Institut de Mathématiques de Toulouse (图卢兹数学研究所); SNCF(法国国家铁路公司); IRT Saint Exupéry(圣埃克苏佩里工业研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We explore the use of conformal prediction to provide statistical uncertainty guarantees for runway detection in vision-based landing systems (VLS). Using fine-tuned YOLOv5 and YOLOv6 models on aerial imagery, we apply conformal prediction to quantify localization reliability under user-defined risk levels. We also introduce Conformal mean Average Precision (C-mAP), a novel metric aligning object detection performance with conformal guarantees. Our results show that conformal prediction can improve the reliability of runway detection by quantifying uncertainty in a statistically sound way, increasing safety on-board and paving the way for certification of ML system in the aerospace domain.
zh

[CV-48] Masked Conditioning for Deep Generative Models

【速读】：该论文试图解决工程领域数据集通常规模小、标签稀疏且包含数值和类别条件，同时计算资源有限导致生成模型难以应用于工程任务的问题。解决方案的关键在于提出一种新的掩码条件方法，通过在训练过程中对条件进行掩码以模拟推理时的稀疏条件，并探索多种稀疏性调度策略，同时引入一种灵活的嵌入机制处理数值和类别条件，从而提升生成模型在混合类型稀疏数据上的表现。

链接: https://arxiv.org/abs/2505.16725
作者: Phillip Mueller,Jannik Wiese,Sebastian Mueller,Lars Mikelsons
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.
zh

[CV-49] SEDD-PCC: A Single Encoder-Dual Decoder Framework For End-To-End Learned Point Cloud Compression

【速读】：该论文旨在解决点云压缩中几何与属性编码分离导致的计算复杂度高及共享特征利用不充分的问题。其解决方案的关键在于提出SEDD-PCC框架，该框架通过单一编码器将几何与属性特征统一映射到潜在空间，并采用双专用解码器依次重建几何与属性，同时引入知识蒸馏以提升特征表示学习，从而实现高效的联合压缩。

链接: https://arxiv.org/abs/2505.16709
作者: Kai Hsiang Hsieh,Monyneath Yim,Jui Chiu Chiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:To encode point clouds containing both geometry and attributes, most learning-based compression schemes treat geometry and attribute coding separately, employing distinct encoders and decoders. This not only increases computational complexity but also fails to fully exploit shared features between geometry and attributes. To address this limitation, we propose SEDD-PCC, an end-to-end learning-based framework for lossy point cloud compression that jointly compresses geometry and attributes. SEDD-PCC employs a single encoder to extract shared geometric and attribute features into a unified latent space, followed by dual specialized decoders that sequentially reconstruct geometry and attributes. Additionally, we incorporate knowledge distillation to enhance feature representation learning from a teacher model, further improving coding efficiency. With its simple yet effective design, SEDD-PCC provides an efficient and practical solution for point cloud compression. Comparative evaluations against both rule-based and learning-based methods demonstrate its competitive performance, highlighting SEDD-PCC as a promising AI-driven compression approach.
zh

[CV-50] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

【速读】：该论文试图解决当前多模态生成模型在基于知识的推理编辑任务中的能力不足问题，尽管这些模型能够生成视觉上合理的输出，但在涉及知识推理的编辑任务中表现尚不明确。解决方案的关键是提出KRIS-Bench（Knowledge-based Reasoning in Image-editing Systems Benchmark），这是一个基于认知理论设计的诊断基准，通过分类编辑任务为事实性、概念性和程序性三种基础知识类型，并构建了涵盖7个推理维度的22个代表性任务及1,267个高质量标注的编辑实例。同时，该研究引入了一种新的知识合理性度量方法，并结合知识提示和人类研究进行校准，以支持细粒度评估。

链接: https://arxiv.org/abs/2505.16707
作者: Yongliang Wu,Zonghui Li,Xinting Hu,Xinyu Ye,Xianfang Zeng,Gang Yu,Wenbo Zhu,Bernt Schiele,Ming-Hsuan Yang,Xu Yang
机构: Southeast University (东南大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Shanghai Jiao Tong University (上海交通大学); StepFun (步履科技); University of California, Berkeley (加州大学伯克利分校); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 36 figures

点击查看摘要

Abstract:Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.
zh

[CV-51] One-Step Diffusion-Based Image Compression with Semantic Distillation

【速读】：该论文试图解决基于扩散模型的生成式图像编解码器在迭代采样过程中引入的高延迟问题。其解决方案的关键在于提出OneDC，一种单步扩散生成图像编解码器（One-step Diffusion-based generative image Codec），通过将潜在压缩模块与单步扩散生成器相结合，避免了多步采样的必要性，从而显著提升了编码速度并降低了比特率。

链接: https://arxiv.org/abs/2505.16687
作者: Naifu Xue,Zhaoyang Jia,Jiahao Li,Bin Li,Yuan Zhang,Yan Lu
机构: Communication University of China (中国传媒大学); University of Science and Technology of China (中国科学技术大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec – that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.
zh

[CV-52] On the use of Graphs for Satellite Image Time Series

【速读】：该论文试图解决卫星图像时间序列（SITS）数据在时空遥感分析中的处理与应用问题，特别是如何有效利用这些数据进行模式识别、分类和回归等任务。解决方案的关键在于引入基于图（graph-based）的方法，通过构建时空图来捕捉地物对象之间的空间和时间交互关系，从而提升对复杂地表过程的理解与建模能力。

链接: https://arxiv.org/abs/2505.16685
作者: Corentin Dufourg,Charlotte Pelletier,Stéphane May,Sébastien Lefèvre
机构: Université Bretagne Sud (布列塔尼南方大学); IRISA (IRISA); CNRS (法国国家科学研究中心); Centre National d’Études Spatiales (国家空间研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The Earth’s surface is subject to complex and dynamic processes, ranging from large-scale phenomena such as tectonic plate movements to localized changes associated with ecosystems, agriculture, or human activity. Satellite images enable global monitoring of these processes with extensive spatial and temporal coverage, offering advantages over in-situ methods. In particular, resulting satellite image time series (SITS) datasets contain valuable information. To handle their large volume and complexity, some recent works focus on the use of graph-based techniques that abandon the regular Euclidean structure of satellite data to work at an object level. Besides, graphs enable modelling spatial and temporal interactions between identified objects, which are crucial for pattern detection, classification and regression tasks. This paper is an effort to examine the integration of graph-based methods in spatio-temporal remote-sensing analysis. In particular, it aims to present a versatile graph-based pipeline to tackle SITS analysis. It focuses on the construction of spatio-temporal graphs from SITS and their application to downstream tasks. The paper includes a comprehensive review and two case studies, which highlight the potential of graph-based approaches for land cover mapping and water resource forecasting. It also discusses numerous perspectives to resolve current limitations and encourage future developments.
zh

[CV-53] Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds

【速读】：该论文试图解决传统3D物体压缩方法在高压缩率下出现的纹理伪影、多边形数量减少和网格间隙等问题，这些问题限制了其在极端压缩率下的性能。解决方案的关键在于采用语义压缩（semantic compression），即忽略结构信息，直接对物体的核心概念进行操作，并利用自然语言作为存储格式，从而实现更高的压缩率同时保持较好的质量。该方法通过先进的深度生成模型预测缺失信息，构建了一个基于公开生成模型的3D语义压缩流程，并在Objaverse数据集上验证了其在约100倍压缩率区域的优越性。

链接: https://arxiv.org/abs/2505.16679
作者: Jordan Dotzel,Tony Montes,Mohamed S. Abdelfattah,Zhiru Zhang
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First two authors have equal contribution

点击查看摘要

Abstract:Traditional methods for 3D object compression operate only on structural information within the object vertices, polygons, and textures. These methods are effective at compression rates up to 10x for standard object sizes but quickly deteriorate at higher compression rates with texture artifacts, low-polygon counts, and mesh gaps. In contrast, semantic compression ignores structural information and operates directly on the core concepts to push to extreme levels of compression. In addition, it uses natural language as its storage format, which makes it natively human-readable and a natural fit for emerging applications built around large-scale, collaborative projects within augmented and virtual reality. It deprioritizes structural information like location, size, and orientation and predicts the missing information with state-of-the-art deep generative models. In this work, we construct a pipeline for 3D semantic compression from public generative models and explore the quality-compression frontier for 3D object compression. We apply this pipeline to achieve rates as high as 105x for 3D objects taken from the Objaverse dataset and show that semantic compression can outperform traditional methods in the important quality-preserving region around 100x compression.
zh

[CV-54] Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge

【速读】：该论文旨在解决电池热图像中异常检测的问题，特别是在缺乏大量标注数据的情况下，传统深度学习方法难以有效应用。其解决方案的关键在于利用基于视觉问答（Visual Question Answering, VQA）的零样本学习方法，通过引入先验知识和文本提示，使模型能够在不依赖特定电池数据训练的情况下识别异常，从而提升模型的泛化能力和实用性。

链接: https://arxiv.org/abs/2505.16674
作者: Marcella Astrid,Abdelrahman Shabayek,Djamila Aouada
机构: Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in EUSIPCO 2025

点击查看摘要

Abstract:Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.
zh

[CV-55] CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation

【速读】：该论文旨在解决具身导航中多模态数据融合的挑战，特别是如何有效整合2D图像、3D点云和文本指令以提升导航性能。其解决方案的关键在于提出CoNav框架，通过预训练的3D-text模型向图像-文本导航代理提供结构化的空间语义知识，从而在导航过程中解决歧义问题。该框架通过跨模态信念对齐机制，将3D-text模型生成的文本假设传递给导航代理，使其能够结合视觉线索与空间语义信息，实现更有效的具身导航推理。

链接: https://arxiv.org/abs/2505.16663
作者: Haihong Hao,Mingfei Han,Changlin Li,Zhihui Li,Xiaojun Chang
机构: University of Science and Technology of China (中国科学技术大学); MBZUAI (MBZUAI); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: this https URL
zh

[CV-56] SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images

【速读】：该论文旨在解决在医疗异常检测（Medical Anomaly Detection, MAD）中，由于高质量医学影像数据获取受限（如隐私问题和数据孤岛）而导致的少样本学习挑战，同时针对多类别异常识别中忽略异常类别间差异的问题。其解决方案的关键在于提出一种两阶段的基于征兆驱动的少样本多异常检测框架（SD-MAD），通过增强异常间的差异性来对齐放射学征兆与异常类别，并在推理阶段采用自动征兆选择策略以缓解因数据不足导致的欠拟合和不确定样本问题。

链接: https://arxiv.org/abs/2505.16659
作者: Kaiyu Guo,Tan Pan,Chen Jiang,Zijian Wang,Brian C. Lovell,Limei Han,Yuan Cheng,Mahsa Baktashmotlagh
机构: Shanghai Academy of AI for Science (上海人工智能科学研究院); University of Queensland (昆士兰大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.
zh

[CV-57] Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control

【速读】：该论文旨在解决高光谱图像融合（hyperspectral pansharpening）中存在的光谱保真度不一致问题，特别是在处理高光谱数据时面临的挑战，如波段数量庞大、特定光谱范围内的噪声显著、全色与高光谱成分之间的显著光谱失配以及通常较高的分辨率比。其解决方案的关键在于提出一种轻量级神经网络方法，该网络能够实时适应每个波段的权重，并通过动态调整空间损失以确保光谱损失快速收敛至期望水平，同时重新定义空间损失以考虑全色与光谱波段之间的非线性关系，从而实现统一的光谱质量。

链接: https://arxiv.org/abs/2505.16658
作者: Giuseppe Guarino,Matteo Ciotola,Gemine Vivone,Giovanni Poggi,Giuseppe Scarpa
机构: University Federico II (那不勒斯费德里科二世大学); National Research Council (国家研究委员会); University Parthenope (帕尔特诺佩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral pansharpening has received much attention in recent years due to technological and methodological advances that open the door to new application scenarios. However, research on this topic is only now gaining momentum. The most popular methods are still borrowed from the more mature field of multispectral pansharpening and often overlook the unique challenges posed by hyperspectral data fusion, such as i) the very large number of bands, ii) the overwhelming noise in selected spectral ranges, iii) the significant spectral mismatch between panchromatic and hyperspectral components, iv) a typically high resolution ratio. Imprecise data modeling especially affects spectral fidelity. Even state-of-the-art methods perform well in certain spectral ranges and much worse in others, failing to ensure consistent quality across all bands, with the risk of generating unreliable results. Here, we propose a hyperspectral pansharpening method that explicitly addresses this problem and ensures uniform spectral quality. To this end, a single lightweight neural network is used, with weights that adapt on the fly to each band. During fine-tuning, the spatial loss is turned on and off to ensure a fast convergence of the spectral loss to the desired level, according to a hysteresis-like dynamic. Furthermore, the spatial loss itself is appropriately redefined to account for nonlinear dependencies between panchromatic and spectral bands. Overall, the proposed method is fully unsupervised, with no prior training on external data, flexible, and low-complexity. Experiments on a recently published benchmarking toolbox show that it ensures excellent sharpening quality, competitive with the state-of-the-art, consistently across all bands. The software code and the full set of results are shared online on this https URL.
zh

[CV-58] Seeing Far and Clearly: Mitigating Hallucinations in MLLM s with Attention Causal Decoding CVPR2025

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉问答任务中存在的幻觉（hallucination）问题，特别是初始幻觉和雪球式幻觉。其解决方案的关键在于通过因果掩码（causal mask）干预多模态标记之间的信息传播过程，以减少异常标记（outlier tokens）对上下文推理的干扰。核心方法是设计一种有效的标记传播机制，通过注意力注册结构动态分配注意力，并引入位置感知编码方法，提升模型对远距离前序标记的关注能力，从而增强上下文推理能力并降低幻觉发生率。

链接: https://arxiv.org/abs/2505.16652
作者: Feilong Tang,Chengzhi Liu,Zhongxing Xu,Ming Hu,Zelin Peng,Zhiwei Yang,Jionglong Su,Minquan Lin,Yifan Peng,Xuelian Cheng,Imran Razzak,Zongyuan Ge
机构: Monash University (莫纳什大学); MBZUAI (MBZUAI); XJTLU (西交利物浦大学); Shanghai Jiaotong University (上海交通大学); Fudan University (复旦大学); University of Minnesota (明尼苏达大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Clarification note for the CVPR 2025 paper (FarSight). Prepared by a subset of the original authors; remaining co-authors are acknowledged in the text

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
zh

[CV-59] Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images

【速读】：该论文旨在解决因连接设备数量增加而带来的安全问题检测需求，以及由于通信流量庞大所带来的数据处理挑战，同时应对设备异构性导致的计算能力差异问题。其解决方案的关键在于提出一种基于图像的网络流量表示方法，该方法能够在1秒的时间窗口内生成当前网络状态的紧凑摘要，并通过突出异常情况来减少对复杂处理架构的依赖。此外，还引入了一种无监督学习方法以有效检测异常。

链接: https://arxiv.org/abs/2505.16650
作者: Michael Neri,Sara Baldoni
机构: Tampere University (坦佩雷大学); University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Accepted for publication in EUSIPCO 2025

点击查看摘要

Abstract:Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at this https URL.
zh

[CV-60] Point Detect Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models

【速读】：该论文试图解决多任务医学图像理解中的检测、定位和计数问题，旨在通过微调视觉-语言模型（Vision-Language Models, VLMs）提升诊断的准确性和效率。其解决方案的关键在于利用指令微调（instruction-tuned）的VLMs，并通过多任务学习框架将不同任务转化为基于指令的提示（instruction-based prompts），以实现对医学图像中多种病灶的联合推理。研究采用LoRA（Low-Rank Adaptation）方法对Qwen2.5-VL-7B-Instruct模型进行微调，验证了多任务训练在提升模型鲁棒性和准确性方面的有效性，同时揭示了在边缘情况下可能出现的可靠性下降问题。

链接: https://arxiv.org/abs/2505.16647
作者: Sushant Gautam,Michael A. Riegler,Pål Halvorsen
机构: Simula Metropolitan Center for Digital Engineering (SimulaMet); Oslo Metropolitan University (OsloMet); Simula Research Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a full paper at the 38th IEEE International Symposium on Computer-Based Medical Systems (CBMS) 2025

点击查看摘要

Abstract:We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding, focusing on detection, localization, and counting of findings in medical images. Our objective is to evaluate whether instruction-tuned VLMs can simultaneously improve these tasks, with the goal of enhancing diagnostic accuracy and efficiency. Using MedMultiPoints, a multimodal dataset with annotations from endoscopy (polyps and instruments) and microscopy (sperm cells), we reformulate each task into instruction-based prompts suitable for vision-language reasoning. We fine-tune Qwen2.5-VL-7B-Instruct using Low-Rank Adaptation (LoRA) across multiple task combinations. Results show that multi-task training improves robustness and accuracy. For example, it reduces the Count Mean Absolute Error (MAE) and increases Matching Accuracy in the Counting + Pointing task. However, trade-offs emerge, such as more zero-case point predictions, indicating reduced reliability in edge cases despite overall performance gains. Our study highlights the potential of adapting general-purpose VLMs to specialized medical tasks via prompt-driven fine-tuning. This approach mirrors clinical workflows, where radiologists simultaneously localize, count, and describe findings - demonstrating how VLMs can learn composite diagnostic reasoning patterns. The model produces interpretable, structured outputs, offering a promising step toward explainable and versatile medical AI. Code, model weights, and scripts will be released for reproducibility at this https URL.
zh

[CV-61] From Evaluation to Defense: Advancing Safety in Video Large Language Models

【速读】：该论文旨在解决视频基础的大语言模型（Video LLMs）在安全性能上的系统性风险问题，相较于图像基础的模型，视频模态的引入导致安全性能平均下降42.3%，暴露出多模态攻击利用的漏洞。其解决方案的关键在于提出VideoSafety-R1框架，该框架包含两个创新点：一是通过Alarm Token-Guided Safety Fine-Tuning（AT-SFT）在视觉和文本序列中注入可学习的警报标记，实现跨模态的显式危害感知；二是通过Safety-Guided GRPO增强防御推理，利用双模态验证生成的规则奖励进行动态策略优化，从而将安全对齐从被动的危害识别转变为主动推理。

链接: https://arxiv.org/abs/2505.16643
作者: Yiwei Sun,Peiqi Jiang,Chuanbin Liu,Luohao Lin,Zhiying Lu,Hongtao Xie
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 49 pages, 12 figures, 17 tables

点击查看摘要

Abstract:While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbfVideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textitWe reveal that integrating video modality degrades safety performance by an average of 42.3%, exposing systemic risks in multimodal attack exploitation. To address this vulnerability, we propose \textbfVideoSafety-R1, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1% improvement on VSB-Eval-HH, and improves by 59.1%, 44.3%, and 15.0% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textitOur codes are available in the supplementary materials. \textcolorredWarning: This paper contains examples of harmful language and videos, and reader discretion is recommended.
zh

[CV-62] owards Texture- And Shape-Independent 3D Keypoint Estimation in Birds

【速读】：该论文试图解决多只鸽子的3D关节位置估计与跟踪问题，尤其关注于减少对纹理信息的依赖。解决方案的关键在于扩展现有的3D-MuPPET框架，通过使用分割方法生成个体轮廓，进而估计2D关键点，并利用这些关键点进行三角化以推断3D姿态，同时在首帧中匹配身份并在后续帧中进行2D跟踪，从而实现一种与纹理无关的3D姿态估计方法。

链接: https://arxiv.org/abs/2505.16633
作者: Valentin Schmuker,Alex Hoi Hang Chan,Bastian Goldluecke,Urs Waldmann
机构: University of Konstanz (康斯坦茨大学); Max Planck Institute of Animal Behavior (动物行为马克斯·普朗克研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a texture-independent approach to estimate and track 3D joint positions of multiple pigeons. For this purpose, we build upon the existing 3D-MuPPET framework, which estimates and tracks the 3D poses of up to 10 pigeons using a multi-view camera setup. We extend this framework by using a segmentation method that generates silhouettes of the individuals, which are then used to estimate 2D keypoints. Following 3D-MuPPET, these 2D keypoints are triangulated to infer 3D poses, and identities are matched in the first frame and tracked in 2D across subsequent frames. Our proposed texture-independent approach achieves comparable accuracy to the original texture-dependent 3D-MuPPET framework. Additionally, we explore our approach’s applicability to other bird species. To do that, we infer the 2D joint positions of four bird species without additional fine-tuning the model trained on pigeons and obtain preliminary promising results. Thus, we think that our approach serves as a solid foundation and inspires the development of more robust and accurate texture-independent pose estimation frameworks.
zh

[CV-63] SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding

【速读】：该论文旨在解决传统体育分析方法在捕捉比赛全貌和复杂动态方面的局限性，这些问题通常源于孤立数据流的使用。其解决方案的关键在于提出SoccerChat，一个融合视觉与文本数据的多模态对话式人工智能框架，通过整合丰富的SoccerNet数据集（包含球衣颜色标注和自动语音识别转录文本）以及结构化视频指令数据集进行微调，从而提升对比赛的理解、事件分类和裁判决策的准确性。

链接: https://arxiv.org/abs/2505.16630
作者: Sushant Gautam,Cise Midoglu,Vajira Thambawita,Michael A. Riegler,Pål Halvorsen,Mubarak Shah
机构: SimulaMet(西穆拉Met); OsloMet(奥斯陆Met); Forzasys(福尔扎系统); University of Central Florida(中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. this https URL
zh

[CV-64] Background Matters: A Cross-view Bidirectional Modeling Framework for Semi-supervised Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割（Semi-supervised Medical Image Segmentation, SSMIS）中对背景区域建模不足的问题，当前主流方法主要关注前景区域的建模，而忽略了显式建模背景区域可能带来的优势。论文提出的关键解决方案是跨视角双向建模（Cross-view Bidirectional Modeling, CVBM）框架，其核心在于通过引入背景建模作为辅助视角，提供互补的监督信号以提升前景建模的置信度，并通过双向一致性机制确保前景预测与背景引导预测之间的相互对齐。

链接: https://arxiv.org/abs/2505.16625
作者: Luyang Cao,Jianwei Li,Yinghuan Shi
机构: State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China; College of Physics and Information Engineering, Fuzhou University, Fuzhou, Fujian, China; National Institute of Healthcare Data Science, Nanjing University, Nanjing, Jiangsu, China; National Key Laboratory for Novel Software Technology, National Institute of Healthcare Data Science, Nanjing University, and Nanjing Drum Tower Hospital, Nanjing, Jiangsu, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Semi-supervised medical image segmentation (SSMIS) leverages unlabeled data to reduce reliance on manually annotated images. However, current SOTA approaches predominantly focus on foreground-oriented modeling (i.e., segmenting only the foreground region) and have largely overlooked the potential benefits of explicitly modeling the background region. Our study theoretically and empirically demonstrates that highly certain predictions in background modeling enhance the confidence of corresponding foreground modeling. Building on this insight, we propose the Cross-view Bidirectional Modeling (CVBM) framework, which introduces a novel perspective by incorporating background modeling to improve foreground modeling performance. Within CVBM, background modeling serves as an auxiliary perspective, providing complementary supervisory signals to enhance the confidence of the foreground model. Additionally, CVBM introduces an innovative bidirectional consistency mechanism, which ensures mutual alignment between foreground predictions and background-guided predictions. Extensive experiments demonstrate that our approach achieves SOTA performance on the LA, Pancreas, ACDC, and HRF datasets. Notably, on the Pancreas dataset, CVBM outperforms fully supervised methods (i.e., DSC: 84.57% vs. 83.89%) while utilizing only 20% of the labeled data. Our code is publicly available at this https URL.
zh

[CV-65] MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

【速读】：该论文旨在解决从第一视角RGB图像、文本和初始手部姿态中生成物理上合理的手-物体交互动作的问题，这一问题在沉浸式AR/VR和机器人模仿中具有重要意义，但受限于不稳定视角、自遮挡、透视失真和噪声自我运动等因素而具有挑战性。其解决方案的关键在于提出MEgoHand框架，该框架采用双层架构：高层“大脑”利用视觉语言模型（VLM）从视觉-文本上下文中推断运动先验，并结合单目深度估计器进行与物体无关的空间推理；低层则基于DiT的流匹配策略生成细粒度轨迹，并通过时间正交滤波提高稳定性。此外，为解决数据集不一致性问题，还设计了数据集构建范式，包括逆向MANO重定向网络和虚拟RGB-D渲染器，从而构建了一个统一的大规模数据集。

链接: https://arxiv.org/abs/2505.16602
作者: Bohan Zhou,Yi Zhan,Zhongbin Zhang,Zongqing Lu
机构: Peking University (北京大学); Tsinghua University (清华大学); BeingBeyond
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level “cerebrum” leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.
zh

[CV-66] Decoupled Geometric Parameterization and its Application in Deep Homography Estimation

【速读】：该论文试图解决传统基于四角位置偏移的参数化方法在几何可解释性上的不足以及需要求解线性系统的问题。其解决方案的关键在于提出了一种新的基于相似性-核-相似性（Similarity-Kernel-Similarity, SKS）分解的单应性几何参数化方法，通过将四个几何参数解耦为相似变换和核变换两组独立参数，并推导出核变换参数与角度偏移之间的线性关系，从而实现了无需求解线性系统即可直接通过矩阵乘法进行单应性估计。

链接: https://arxiv.org/abs/2505.16599
作者: Yao Huang,Si-Yuan Cao,Yaqing Ding,Hao Yin,Shibin Xie,Shuting Wang,Zhijun Fang,Jiachun Wang,Shen Cai,Junchi Yan,Shuhan Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Planar homography, with eight degrees of freedom (DOFs), is fundamental in numerous computer vision tasks. While the positional offsets of four corners are widely adopted (especially in neural network predictions), this parameterization lacks geometric interpretability and typically requires solving a linear system to compute the homography matrix. This paper presents a novel geometric parameterization of homographies, leveraging the similarity-kernel-similarity (SKS) decomposition for projective transformations. Two independent sets of four geometric parameters are decoupled: one for a similarity transformation and the other for the kernel transformation. Additionally, the geometric interpretation linearly relating the four kernel transformation parameters to angular offsets is derived. Our proposed parameterization allows for direct homography estimation through matrix multiplication, eliminating the need for solving a linear system, and achieves performance comparable to the four-corner positional offsets in deep homography estimation.
zh

[CV-67] mporal Object Captioning for Street Scene Videos from LiDAR Tracks

【速读】：该论文试图解决视频描述生成模型在捕捉和利用时间语义以进行有效时间特征提取方面的不足，特别是在高级驾驶辅助系统（Advanced Driver Assistance Systems, ADAS）的背景下。解决方案的关键在于提出一种基于LiDAR的自动化描述生成流程，该流程通过规则系统从目标轨迹中提取关键细节（如车道位置和相对运动），并结合模板生成描述，从而为模型提供具有细粒度时间行为的监督信号。实验表明，使用此类模板生成的描述对SwinBERT模型进行训练，能够显著提升其时间理解能力。

链接: https://arxiv.org/abs/2505.16594
作者: Vignesh Gopinathan,Urs Zimmermann,Michael Arnold,Matthias Rottmann
机构: Aptiv(艾普利特); University of Wuppertal(伍珀塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video captioning models have seen notable advancements in recent years, especially with regard to their ability to capture temporal information. While many research efforts have focused on architectural advancements, such as temporal attention mechanisms, there remains a notable gap in understanding how models capture and utilize temporal semantics for effective temporal feature extraction, especially in the context of Advanced Driver Assistance Systems. We propose an automated LiDAR-based captioning procedure that focuses on the temporal dynamics of traffic participants. Our approach uses a rule-based system to extract essential details such as lane position and relative motion from object tracks, followed by a template-based caption generation. Our findings show that training SwinBERT, a video captioning model, using only front camera images and supervised with our template-based captions, specifically designed to encapsulate fine-grained temporal behavior, leads to improved temporal understanding consistently across three datasets. In conclusion, our results clearly demonstrate that integrating LiDAR-based caption supervision significantly enhances temporal understanding, effectively addressing and reducing the inherent visual/static biases prevalent in current state-of-the-art model architectures.
zh

[CV-68] Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在动态空间推理任务中表现不足的问题，现有方法主要局限于文本或静态视觉领域，难以应对动态环境中的复杂推理需求。其解决方案的关键在于提出D2R（Dynamic Draft-Augmented Reasoning）框架，该框架无需模型微调，能够将文本推理链与动态视觉草图无缝整合到MLLM中，从而显著提升动态空间推理性能。

链接: https://arxiv.org/abs/2505.16579
作者: Siqu Ou,Hongcheng Liu,Pingjie Wang,Yusheng Liao,Chuan Xuan,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI lab (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at this https URL.
zh

[CV-69] M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

【速读】：该论文试图解决单目视频到立体视频的转换问题（monocular-to-stereo video conversion），其核心挑战在于生成高质量的右视图视频。解决方案的关键在于提出一种新颖的架构，用于对通过基于深度的重投影得到的扭曲右视图进行修复和优化。该方法扩展了Stable Video Diffusion (SVD)模型，利用输入左视图视频、扭曲右视图视频以及遮挡掩码作为条件输入，生成高质量的右摄像机视图。此外，为有效利用相邻帧的信息进行修复，修改了SVD中的注意力层以对未遮挡像素计算全注意力，从而提升生成效果。

链接: https://arxiv.org/abs/2505.16565
作者: Nina Shvetsova,Goutam Bhat,Prune Truong,Hilde Kuehne,Federico Tombari
机构: Google(谷歌); Tuebingen AI Center/University of Tuebingen(图宾根人工智能中心/图宾根大学); Goethe University Frankfurt(法兰克福歌德大学); MPI for Informatics, Saarland Informatics Campus(马克斯·普朗克信息研究所，萨尔兰信息学园区); Technical University of Munich(慕尼黑技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.
zh

[CV-70] Auto-nnU-Net: Towards Automated Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割（Medical Image Segmentation, MIS）中模型配置自动化与资源约束之间的矛盾，特别是在传统方法如nnU-Net中固定超参数和启发式设计选择带来的局限性。其解决方案的关键在于提出一种全自动化机器学习框架Auto-nnU-Net，该框架支持超参数优化（HPO）、神经网络架构搜索（NAS）以及分层NAS（HNAS），并通过引入Regularized PriorBand方法在模型精度与训练计算资源之间实现平衡，从而有效应对实际医疗场景中的资源限制问题。

链接: https://arxiv.org/abs/2505.16561
作者: Jannis Becktepe,Leona Hennig,Steffen Oeltze-Jafra,Marius Lindauer
机构: Institute of AI, Leibniz University Hannover(莱布尼茨汉诺威大学人工智能研究所); Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所); Peter L. Reichertz Institute for Medical Informatics, Hannover Medical School(汉诺威医学院彼得·L·赖歇茨医学信息学研究所); CAIMed: Lower Saxony Center for AI & Causal Methods in Medicine(CAIMed：下萨克森州人工智能与因果方法中心); L3S Research Center( L3S 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 19 figures. Accepted for publication at AutoML 2025

点击查看摘要

Abstract:Medical Image Segmentation (MIS) includes diverse tasks, from bone to organ segmentation, each with its own challenges in finding the best segmentation model. The state-of-the-art AutoML-related MIS-framework nnU-Net automates many aspects of model configuration but remains constrained by fixed hyperparameters and heuristic design choices. As a full-AutoML framework for MIS, we propose Auto-nnU-Net, a novel nnU-Net variant enabling hyperparameter optimization (HPO), neural architecture search (NAS), and hierarchical NAS (HNAS). Additionally, we propose Regularized PriorBand to balance model accuracy with the computational resources required for training, addressing the resource constraints often faced in real-world medical settings that limit the feasibility of extensive training procedures. We evaluate our approach across diverse MIS datasets from the well-established Medical Segmentation Decathlon, analyzing the impact of AutoML techniques on segmentation performance, computational efficiency, and model design choices. The results demonstrate that our AutoML approach substantially improves the segmentation performance of nnU-Net on 6 out of 10 datasets and is on par on the other datasets while maintaining practical resource requirements. Our code is available at this https URL.
zh

[CV-71] xtureSAM: Towards a Texture Aware Foundation Model for Segmentation

【速读】：该论文旨在解决Segment Anything Models (SAM)在纹理主导场景下的分割性能不足问题，特别是其对形状特征的偏好导致在医学影像、材料分类和遥感等依赖纹理变化定义物体边界的领域中表现受限。解决方案的关键在于提出一种新的纹理感知基础模型TextureSAM，通过引入纹理增强技术进行微调，逐步调整训练图像以突出纹理特征，并利用经过纹理变换的ADE20K数据集引导模型优先关注由纹理定义的区域，从而缓解原始SAM模型中的形状偏差。

链接: https://arxiv.org/abs/2505.16540
作者: Inbal Cohen,Boaz Meivar,Peihan Tu,Shai Avidan,Gal Oren
机构: Tel Aviv University (特拉维夫大学); University of Maryland, College Park (马里兰大学学院公园分校); Stanford University (斯坦福大学); Technion (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large-scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM’s bias toward semantics over textures and introduce a new texture-aware foundation model, TextureSAM, which performs superior segmentation in texture-dominant scenarios. To achieve this, we employ a novel fine-tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture-alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture-defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM-2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture-based segmentation datasets. The code and texture-augmented dataset will be publicly available.
zh

[CV-72] SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

【速读】：该论文旨在解决动态三维场景重建中的时空表示效率与渲染质量之间的平衡问题，特别是在处理稀疏视角动态输入时的鲁棒性不足。其关键解决方案是提出一种融合显式三平面形变场、视图条件化的规范辐射场以及时间感知潜在扩散先验的框架，通过三组正交二维特征平面的时序演化实现高效紧凑的时空表征，并利用球谐函数注意力机制替代传统MLP解码器以提升可解释性与渲染效率，同时引入Transformer引导的潜在扩散模块，在压缩潜在空间中优化三平面与形变特征，从而增强场景表示的保真度与时间一致性。

链接: https://arxiv.org/abs/2505.16535
作者: Asrar Alruwayqi
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.16535 [cs.CV] (or arXiv:2505.16535v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.16535 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-73] Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction

【速读】：该论文旨在解决在线自由视角视频（online free-viewpoint video, FVV）重建中现有方法因点对点建模导致的存储需求过高的问题（3D Gaussian Splatting, 3DGS）。其解决方案的关键在于提出了一种名为ComGS（Compact Gaussian Streaming）的框架，该框架通过关键点驱动的运动表示来建模物体一致的高斯点运动，利用动态场景中运动的局部性和一致性，仅传输关键点属性以实现更高效的存储。

链接: https://arxiv.org/abs/2505.16533
作者: Jiacong Chen,Qingyu Mao,Youneng Bao,Xiandong Meng,Fanyang Meng,Ronggang Wang,Yongsheng Liang
机构: Shenzhen University (深圳大学); Shenzhen Technology University (深圳技术大学); City University of Hong Kong (香港城市大学); Pengcheng Laboratory (鹏城实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed. Our code will be released.
zh

[CV-74] CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving

【速读】：该论文旨在解决自主驾驶系统在动态和不可预测的测试时条件下保持鲁棒的3D感知能力的问题，尤其是针对高方差任务如3D目标检测中现有测试时自适应（TTA）方法因优化不稳定和尖锐极小值而失效的问题。其解决方案的关键在于提出一种轻量且可扩展的模型融合框架CodeMerge，该框架通过在紧凑的潜在空间中操作，避免了传统方法对完整模型加载和多次前向传递的计算开销，利用源模型倒数第二层特征生成低维指纹，并基于这些指纹计算融合系数，从而实现高效的模型组合而不牺牲适应质量。

链接: https://arxiv.org/abs/2505.16524
作者: Huitong Yang,Zhuoxiao Chen,Fengyi Zhang,Zi Huang,Yadan Luo
机构: The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model’s penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9% NDS on nuScenes-C and LiDAR-based detection by over 7.6% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training. Code and pretrained models are released in the supplementary material.
zh

[CV-75] ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在机器人操作任务中依赖昂贵的人工标注数据集导致泛化能力不足以及在分布外（out-of-domain, OOD）场景下表现不佳的问题。其解决方案的关键在于提出一种名为ManipLVM-R1的强化学习框架，该框架采用可验证奖励的强化学习（Reinforcement Learning using Verifiable Rewards, RLVR）替代传统监督学习，通过直接优化任务对齐的结果来提升模型的泛化能力和物理推理能力，同时消除了对人工标注数据的依赖。

链接: https://arxiv.org/abs/2505.16517
作者: Zirui Song,Guangxian Ouyang,Mingzhe Li,Yuheng Ji,Chenxi Wang,Zixiang Xu,Zeyu Zhang,Xiaoqing Zhang,Qian Jiang,Zhenhao Chen,Zhongzhi Li,Rui Yan,Xiuying Chen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); ByteDance (字节跳动); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); The Australia National University (澳大利亚国立大学); Renmin University of China (中国人民大学); Wuhan University (武汉大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.
zh

[CV-76] Detailed Evaluation of Modern Machine Learning Approaches for Optic Plastics Sorting

【速读】：该论文试图解决当前废弃物回收中塑料回收率低的问题，特别是针对物料回收设施（MRFs）中自动化分拣系统的效率不足。其解决方案的关键在于评估光学识别技术在实际场景下的有效性，通过构建大规模数据集并应用先进的计算机视觉方法，如Mask R-CNN和YOLO等目标检测算法，以分析光学检测在塑料分类中的能力与局限性。研究发现，现有光学识别方法在真实环境中的分拣效果有限，主要受限于对颜色和形状等物理特性的依赖。

链接: https://arxiv.org/abs/2505.16513
作者: Vaishali Maheshkar,Aadarsh Anantha Ramakrishnan,Charuvahan Adhivarahan,Karthik Dantu
机构: University at Buffalo(布法罗大学); NIT Trichy(印度技术学院特里奇分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2024 REMADE Circular Economy Tech Summit and Conference, this https URL

点击查看摘要

Abstract:According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near-Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two-stage detectors like Mask R-CNN use region proposals and classification with deep backbones like ResNet. Single-stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad-CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real-world plastics at MRFs, primarily because they rely on physical properties such as color and shape. Comments: Accepted at the 2024 REMADE Circular Economy Tech Summit and Conference, this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T45 ACMclasses: I.4.9; I.4.6 Cite as: arXiv:2505.16513 [cs.CV] (or arXiv:2505.16513v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.16513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-77] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

【速读】：该论文旨在解决基于扩散模型的数字人生成技术对现有检测策略带来的严重挑战，特别是其高真实感和隐蔽性导致的检测效果显著下降问题。解决方案的关键在于构建了一个大规模多模态数字人伪造数据集DigiFakeAV，并提出了一种基于时空与跨模态融合的检测基线DigiShield，通过联合建模视频的3D时空特征和音频的语义-声学特征，实现了对合成视频中隐蔽伪影的细粒度分析与有效识别。

链接: https://arxiv.org/abs/2505.16512
作者: Jiaxin Liu,Jia Wang,Saihui Hou,Min Ren,Huijia Wu,Zhaofeng He
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Beijing Normal University(北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.
zh

[CV-78] ALTo: Adaptive-Length Tokenizer for Autoregressive Mask Generation

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理视觉对象和形状时因固定令牌表示而受到的限制问题。其关键解决方案是提出ALTo，一种用于自回归掩码生成的自适应长度分词器，通过设计新颖的令牌长度预测器、长度正则化项以及可微分的令牌分块策略，实现对不同复杂度的视觉内容进行自适应的令牌分配。

链接: https://arxiv.org/abs/2505.16495
作者: Lingfeng Wang,Hualing Lin,Senda Chen,Tao Wang,Changxu Cheng,Yangyang Zhong,Dong Zheng,Wuyue Zhao
机构: Uni-Ubi( Uni-Ubi); Zhejiang University(浙江大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While humans effortlessly draw visual objects and shapes by adaptively allocating attention based on their complexity, existing multimodal large language models (MLLMs) remain constrained by rigid token representations. Bridging this gap, we propose ALTo, an adaptive length tokenizer for autoregressive mask generation. To achieve this, a novel token length predictor is designed, along with a length regularization term and a differentiable token chunking strategy. We further build ALToLLM that seamlessly integrates ALTo into MLLM. Preferences on the trade-offs between mask quality and efficiency is implemented by group relative policy optimization (GRPO). Experiments demonstrate that ALToLLM achieves state-of-the-art performance with adaptive token cost on popular segmentation benchmarks. Code and models are released at this https URL.
zh

[CV-79] Implicit Neural Shape Optimization for 3D High-Contrast Electrical Impedance Tomography

【速读】：该论文旨在解决高对比度电导率在材料界面处存在突变的三维电容层析成像（EIT）中的重建问题，此类问题在金属植入物监测和工业缺陷检测中较为常见，传统方法因严重不适定性而难以有效处理。其解决方案的关键在于将形状优化与隐式神经表示相结合，核心创新包括基于形状导数的优化方案，该方案显式地纳入了高对比度界面条件，以及一种高效的潜在空间表示，用于降低变量维度。

链接: https://arxiv.org/abs/2505.16487
作者: Junqing Chen,Haibo Liu
机构: Tsinghua University (清华大学)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel implicit neural shape optimization framework for 3D high-contrast Electrical Impedance Tomography (EIT), addressing scenarios where conductivity exhibits sharp discontinuities across material interfaces. These high-contrast cases, prevalent in metallic implant monitoring and industrial defect detection, challenge traditional reconstruction methods due to severe ill-posedness. Our approach synergizes shape optimization with implicit neural representations, introducing key innovations including a shape derivative-based optimization scheme that explicitly incorporates high-contrast interface conditions and an efficient latent space representation that reduces variable dimensionality. Through rigorous theoretical analysis of algorithm convergence and extensive numerical experiments, we demonstrate substantial performance improvements, establishing our framework as promising for practical applications in medical imaging with metallic implants and industrial non-destructive testing.
zh

[CV-80] InspectionV3: Enhancing Tobacco Quality Assessment with Deep Convolutional Neural Networks for Automated Workshop Management

【速读】：该论文旨在解决烟草加工车间在烘烤质量、供应一致性、排程不规律及监管缺失等方面的问题，这些问题导致成本增加和产品质量下降。其解决方案的关键在于提出InspectionV3系统，该系统基于定制化的深度卷积神经网络（Deep Convolutional Neural Network）架构，通过构建包含21,113张图像的标注数据集，覆盖20个质量等级，实现对烟叶的自动化分级。该系统通过多层CNN结合批量归一化技术，捕捉如渗透性和湿度斑点等细微特征，从而提升 workflow 的智能化水平，并支持实时分级与数据分析，最终达到提高准确率、精确率、召回率、F1分数和AUC等性能指标的目标。

链接: https://arxiv.org/abs/2505.16485
作者: Yao Wei,Muhammad Usman,Hazrat Bilal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 15 figures, 2 Tables

点击查看摘要

Abstract:The problems that tobacco workshops encounter include poor curing, inconsistencies in supplies, irregular scheduling, and a lack of oversight, all of which drive up expenses and worse quality. Large quantities make manual examination costly, sluggish, and unreliable. Deep convolutional neural networks have recently made strides in capabilities that transcend those of conventional methods. To effectively enhance them, nevertheless, extensive customization is needed to account for subtle variations in tobacco grade. This study introduces InspectionV3, an integrated solution for automated flue-cured tobacco grading that makes use of a customized deep convolutional neural network architecture. A scope that covers color, maturity, and curing subtleties is established via a labelled dataset consisting of 21,113 images spanning 20 quality classes. Expert annotators performed preprocessing on the tobacco leaf images, including cleaning, labelling, and augmentation. Multi-layer CNN factors use batch normalization to describe domain properties like as permeability and moisture spots, and so account for the subtleties of the workshop. Its expertise lies in converting visual patterns into useful information for enhancing workflow. Fast notifications are made possible by real-time, on-the-spot grading that matches human expertise. Images-powered analytics dashboards facilitate the tracking of yield projections, inventories, bottlenecks, and the optimization of data-driven choices. More labelled images are assimilated after further retraining, improving representational capacities and enabling adaptations for seasonal variability. Metrics demonstrate 97% accuracy, 95% precision and recall, 96% F1-score and AUC, 95% specificity; validating real-world viability.
zh

[CV-81] Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration

【速读】：该论文旨在解决多天气条件影响下的夜间图像恢复问题，该问题在现实世界中因多种天气条件与夜间光照效应共存而显得复杂且研究不足。其解决方案的关键在于提出ClearNight框架，该框架通过提取基于Retinex的双先验，并分别引导网络关注不均匀光照区域和固有纹理内容，从而提升夜间场景下的恢复效果。此外，引入了天气感知的动态特异性-共性协作方法，以识别天气退化并自适应选择与特定天气类型相关的最优候选单元。

链接: https://arxiv.org/abs/2505.16479
作者: Yuetong Liu,Yunqiu Xu,Yang Wei,Xiuli Bi,Bin Xiao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 20 figures

点击查看摘要

Abstract:Restoring nighttime images affected by multiple adverse weather conditions is a practical yet under-explored research problem, as multiple weather conditions often coexist in the real world alongside various lighting effects at night. This paper first explores the challenging multi-weather nighttime image restoration task, where various types of weather degradations are intertwined with flare effects. To support the research, we contribute the AllWeatherNight dataset, featuring large-scale high-quality nighttime images with diverse compositional degradations, synthesized using our introduced illumination-aware degradation generation. Moreover, we present ClearNight, a unified nighttime image restoration framework, which effectively removes complex degradations in one go. Specifically, ClearNight extracts Retinex-based dual priors and explicitly guides the network to focus on uneven illumination regions and intrinsic texture contents respectively, thereby enhancing restoration effectiveness in nighttime scenarios. In order to better represent the common and unique characters of multiple weather degradations, we introduce a weather-aware dynamic specific-commonality collaboration method, which identifies weather degradations and adaptively selects optimal candidate units associated with specific weather types. Our ClearNight achieves state-of-the-art performance on both synthetic and real-world images. Comprehensive ablation experiments validate the necessity of AllWeatherNight dataset as well as the effectiveness of ClearNight. Project page: this https URL
zh

[CV-82] Consistent World Models via Foresight Diffusion

【速读】：该论文试图解决基于扩散模型的世界建模中样本一致性不足的问题，特别是在需要与真实轨迹对齐的场景下，传统扩散模型由于条件理解与目标去噪在共享架构中的耦合导致预测能力欠佳。解决方案的关键在于提出Foresight Diffusion（ForeDiff），通过解耦条件理解与目标去噪过程，引入独立的确定性预测流来处理条件输入，并利用预训练预测器提取有助于生成的表征，从而提升预测精度与样本一致性。

链接: https://arxiv.org/abs/2505.16474
作者: Yu Zhang,Xingzhuo Guo,Haoran Xu,Mingsheng Long
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.
zh

[CV-83] AnchorFormer: Differentiable Anchor Attention for Efficient Vision Transformer

【速读】：该论文旨在解决视觉变压器（Vision Transformer, ViT）在处理图像时因计算全局自注意力而产生的二次复杂度问题，以及由于输入图像分割粒度较小导致的高时间成本问题。同时，论文还关注到关键信息通常集中在输入图像的少数区域，部分标记对下游任务无显著贡献。其解决方案的关键在于引入基于锚点的高效视觉变压器（AnchorFormer），通过使用锚点标记来学习关键信息并加速推理过程，具体通过估计锚点与标记之间的二部注意力，将复杂度从 $\mathcal{O}(n^2)$ 降低至 $\mathcal{O}(mn)$ ，其中 $m$ 为锚点数量且 $m \ll n$ ，并通过神经网络层中的神经元表示锚点，实现可微学习及通过马尔可夫过程近似全局自注意力。

链接: https://arxiv.org/abs/2505.16463
作者: Jiquan Shan,Junxiao Wang,Lifeng Zhao,Liang Cai,Hongyuan Zhang,Ioannis Liritzis
机构: PetroChina Changqing Oilfield Company (中国石油长庆油田公司); The University of Hong Kong (香港大学); Alma Mater Europaea University (阿尔玛·马特拉欧洲大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, vision transformers (ViTs) have achieved excellent performance on vision tasks by measuring the global self-attention among the image patches. Given n patches, they will have quadratic complexity such as \mathcalO(n^2) and the time cost is high when splitting the input image with a small granularity. Meanwhile, the pivotal information is often randomly gathered in a few regions of an input image, some tokens may not be helpful for the downstream tasks. To handle this problem, we introduce an anchor-based efficient vision transformer (AnchorFormer), which employs the anchor tokens to learn the pivotal information and accelerate the inference. Firstly, by estimating the bipartite attention between the anchors and tokens, the complexity will be reduced from \mathcalO(n^2) to \mathcalO(mn) , where m is an anchor number and m n . Notably, by representing the anchors with the neurons in a neural layer, we can differentiable learn these distributions and approximate global self-attention through the Markov process. Moreover, we extend the proposed model to three downstream tasks including classification, detection, and segmentation. Extensive experiments show the effectiveness of our AnchorFormer, e.g., achieving up to a 9.0% higher accuracy or 46.7% FLOPs reduction on ImageNet classification, 81.3% higher mAP on COCO detection under comparable FLOPs, as compared to the current baselines.
zh

[CV-84] MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM

【速读】：该论文试图解决动态3D内容生成中物理一致性不足的问题，即现有视频生成模型在追求视觉真实感的同时，往往忽视了物理合理性，导致物体运动不符合物理规律。其解决方案的关键在于提出MAGIC框架，该框架无需训练即可实现单图像的物理属性推断与动态生成，通过整合预训练的图像到视频扩散模型与迭代的大型语言模型（LLM）推理，结合基于置信度驱动的LLM反馈机制，引导扩散模型生成符合物理规律的运动。此外，引入了一个可微分的MPM（Material Point Method）模拟器，直接在单图像重建的3D高斯分布上运行，从而生成具有物理基础且可控制的动态输出。

链接: https://arxiv.org/abs/2505.16456
作者: Siwei Meng,Yawei Luo,Ping Liu
机构: University of Nevada Reno (内华达大学里诺分校); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in static 3D generation have intensified the demand for physically consistent dynamic 3D content. However, existing video generation models, including diffusion-based methods, often prioritize visual realism while neglecting physical plausibility, resulting in implausible object dynamics. Prior approaches for physics-aware dynamic generation typically rely on large-scale annotated datasets or extensive model fine-tuning, which imposes significant computational and data collection burdens and limits scalability across scenarios. To address these challenges, we present MAGIC, a training-free framework for single-image physical property inference and dynamic generation, integrating pretrained image-to-video diffusion models with iterative LLM-based reasoning. Our framework generates motion-rich videos from a static image and closes the visual-to-physical gap through a confidence-driven LLM feedback loop that adaptively steers the diffusion model toward physics-relevant motion. To translate visual dynamics into controllable physical behavior, we further introduce a differentiable MPM simulator operating directly on 3D Gaussians reconstructed from the single image, enabling physically grounded, simulation-ready outputs without any supervision or model tuning. Experiments show that MAGIC outperforms existing physics-aware generative methods in inference accuracy and achieves greater temporal coherence than state-of-the-art video diffusion models.
zh

[CV-85] CMRINet: Joint Groupwise Registration and Segmentation for Cardiac Function Quantification from Cine-MRI

【速读】：该论文旨在解决传统心脏功能评估方法在准确性和 reproducibility 上的局限性，特别是左心室射血分数（LVEF）受多种因素影响以及在某些心脏疾病中可能无法准确反映心肌收缩功能的问题。其解决方案的关键在于提出一种端到端的深度学习（DL）模型，该模型能够联合估计群体配准（GW）和分割，从而实现对心脏电影磁共振成像（cine-MRI）图像的自动化分析，提高评估效率与准确性。

链接: https://arxiv.org/abs/2505.16452
作者: Mohamed S. Elmahdy,Marius Staring,Patrick J. H. de Koning,Samer Alabed,Mahan Salehi,Faisal Alandejani,Michael Sharkey,Ziad Aldabbagh,Andrew J. Swift,Rob J. van der Geest
机构: Leiden University Medical Center(莱顿大学医学中心); University of Sheffield(谢菲尔德大学); INSIGNEO, Institute for In Silico Medicine(INSIGNEO，数字医学研究所); Sheffield Teaching Hospitals(谢菲尔德教学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 7 figures, 1 appendix

点击查看摘要

Abstract:Accurate and efficient quantification of cardiac function is essential for the estimation of prognosis of cardiovascular diseases (CVDs). One of the most commonly used metrics for evaluating cardiac pumping performance is left ventricular ejection fraction (LVEF). However, LVEF can be affected by factors such as inter-observer variability and varying pre-load and after-load conditions, which can reduce its reproducibility. Additionally, cardiac dysfunction may not always manifest as alterations in LVEF, such as in heart failure and cardiotoxicity diseases. An alternative measure that can provide a relatively load-independent quantitative assessment of myocardial contractility is myocardial strain and strain rate. By using LVEF in combination with myocardial strain, it is possible to obtain a thorough description of cardiac function. Automated estimation of LVEF and other volumetric measures from cine-MRI sequences can be achieved through segmentation models, while strain calculation requires the estimation of tissue displacement between sequential frames, which can be accomplished using registration models. These tasks are often performed separately, potentially limiting the assessment of cardiac function. To address this issue, in this study we propose an end-to-end deep learning (DL) model that jointly estimates groupwise (GW) registration and segmentation for cardiac cine-MRI images. The proposed anatomically-guided Deep GW network was trained and validated on a large dataset of 4-chamber view cine-MRI image series of 374 subjects. A quantitative comparison with conventional GW registration using elastix and two DL-based methods showed that the proposed model improved performance and substantially reduced computation time.
zh

[CV-86] AT-VPR: Ternary Adaptive Transformer for Dynamic and Efficient Visual Place Recognition

【速读】：该论文旨在解决视觉SLAM（同步定位与建图）中回环检测的精度与效率平衡问题，特别是在资源受限的微小型无人机（micro-UAV）和嵌入式SLAM系统上的应用。其解决方案的关键在于提出一种三值化Transformer模型TAT-VPR，通过将三值权重与学习到的激活稀疏性门控机制相结合，在运行时实现高达40%的计算量缩减，同时保持性能（Recall@1）不下降，此外还采用两阶段知识蒸馏流程以保留描述符质量，从而在保证定位精度的同时适应嵌入式平台的计算限制。

链接: https://arxiv.org/abs/2505.16447
作者: Oliver Grainge,Michael Milford,Indu Bodala,Sarvapali D. Ramchurn,Shoaib Ehsan
机构: University of Southampton (南安普顿大学); Queensland University of Technology (昆士兰科技大学); University of Essex (埃塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:TAT-VPR is a ternary-quantized transformer that brings dynamic accuracy-efficiency trade-offs to visual SLAM loop-closure. By fusing ternary weights with a learned activation-sparsity gate, the model can control computation by up to 40% at run-time without degrading performance (Recall@1). The proposed two-stage distillation pipeline preserves descriptor quality, letting it run on micro-UAV and embedded SLAM stacks while matching state-of-the-art localization accuracy.
zh

[CV-87] MAFE R-CNN: Selecting More Samples to Learn Category-aware Features for Small Object Detection

【速读】：该论文旨在解决复杂环境中小目标检测的难题，其核心问题在于检测器难以有效学习小目标的判别性特征，同时在训练过程中难以选择高质量的小目标样本。解决方案的关键在于提出MAFE R-CNN框架，该框架包含两个核心组件：多线索样本选择（Multi-Clue Sample Selection, MCSS）策略和类别感知特征增强机制（Category-aware Feature Enhancement Mechanism, CFEM）。MCSS通过结合交并比（IoU）距离、预测类别置信度和真实框尺寸作为信息线索，实现多样化正样本的选择与目标尺寸分布的平衡；CFEM则通过引入一个简单有效的类别感知记忆模块，探索物体特征之间的关系，并通过类别感知特征与候选框的交互增强物体特征表示。

链接: https://arxiv.org/abs/2505.16442
作者: Yichen Li,Qiankun Liu,Zhenchao Jin,Jiuzhe Wei,Jing Nie,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); University of Science and Technology Beijing (北京科技大学); University of Hong Kong (香港大学); Beijing Institute of Space Mechanics and Electricity (北京航天力学与电子技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small object detection in intricate environments has consistently represented a major challenge in the field of object detection. In this paper, we identify that this difficulty stems from the detectors’ inability to effectively learn discriminative features for objects of small size, compounded by the complexity of selecting high-quality small object samples during training, which motivates the proposal of the Multi-Clue Assignment and Feature Enhancement this http URL, MAFE R-CNN integrates two pivotal this http URL first is the Multi-Clue Sample Selection (MCSS) strategy, in which the Intersection over Union (IoU) distance, predicted category confidence, and ground truth region sizes are leveraged as informative clues in the sample selection process. This methodology facilitates the selection of diverse positive samples and ensures a balanced distribution of object sizes during training, thereby promoting effective model this http URL second is the Category-aware Feature Enhancement Mechanism (CFEM), where we propose a simple yet effective category-aware memory module to explore the relationships among object features. Subsequently, we enhance the object feature representation by facilitating the interaction between category-aware features and candidate box this http URL experiments conducted on the large-scale small object dataset SODA validate the effectiveness of the proposed method. The code will be made publicly available.
zh

[CV-88] Ranked Entropy Minimization for Continual Test-Time Adaptation ICML2025

【速读】：该论文旨在解决在持续测试时适应（continual test-time adaptation）场景下，基于熵最小化（entropy minimization）的方法容易出现模型崩溃（model collapse）的问题，即模型趋于对所有图像预测单一类别。解决方案的关键在于提出排序熵最小化（ranked entropy minimization），通过渐进式掩码策略显式构建预测难度结构，在保持熵的排名顺序的同时，逐步对齐不同预测难度层次上的模型概率分布，从而提升方法的稳定性与适用性。

链接: https://arxiv.org/abs/2505.16441
作者: Jisu Han,Jaemin Na,Wonjun Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model’s probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results. Our code is available at this https URL
zh

[CV-89] Joint Flow And Feature Refinement Using Attention For Video Restoration

【速读】：该论文旨在解决视频修复中由于使用退化输入视频帧而导致的时间一致性难以保持的问题。其解决方案的关键在于提出一种名为联合流与特征精炼注意力机制（JFFRA）的框架，该框架通过流（对齐）与修复任务之间的协同迭代增强数据，利用先前增强的特征来精炼流信息，反之亦然，从而高效地利用时间信息进行特征增强，并在多尺度上执行流与修复之间的交互，降低对精确光流估计的依赖。

链接: https://arxiv.org/abs/2505.16434
作者: Ranjith Merugu,Mohammad Sameer Suhail,Akshay P Sarashetti,Venkata Bharath Reddy Reddem,Pankaj Kumar Bajpai,Amit Satish Unde
机构: Samsung(三星)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network’s capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.
zh

[CV-90] Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach

【速读】：该论文试图解决移动设备自动控制中因每一步可用环境信息有限（主要通过视觉观察获取）而导致的复杂多步骤任务执行效率低下的问题。现有方法通常依赖于仅基于即时观察的反应式策略，容易导致次优决策。解决方案的关键在于提出一种名为Foresighted Planning with World Model-Driven Code Execution (FPWC)的框架，该框架通过在任务开始时构建面向任务、可细化的世界模型（world model），优先提升智能体对环境的全局理解，并通过在该世界模型内迭代规划生成前瞻性行动，以可执行代码的形式进行任务执行。

链接: https://arxiv.org/abs/2505.16422
作者: Xiaoran Yin,Xu Luo,Hao Wu,Lianli Gao,Jingkuan Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The automatic control of mobile devices is essential for efficiently performing complex tasks that involve multiple sequential steps. However, these tasks pose significant challenges due to the limited environmental information available at each step, primarily through visual observations. As a result, current approaches, which typically rely on reactive policies, focus solely on immediate observations and often lead to suboptimal decision-making. To address this problem, we propose \textbfForesighted Planning with World Model-Driven Code Execution (FPWC),a framework that prioritizes natural language understanding and structured reasoning to enhance the agent’s global understanding of the environment by developing a task-oriented, refinable \emphworld model at the outset of the task. Foresighted actions are subsequently generated through iterative planning within this world model, executed in the form of executable code. Extensive experiments conducted in simulated environments and on real mobile devices demonstrate that our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate compared to the state-of-the-art in the simulated environment. Code and demo are provided in the supplementary material.
zh

[CV-91] Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

【速读】：该论文试图解决人类如何通过学习机制获得物体内部表征的问题，以及深度神经网络（DNN）在多大程度上能够模仿人类的物体表征。其解决方案的关键在于采用一种基于Gromov-Wasserstein最优传输的无监督对齐方法，该方法能够估计人类与模型之间物体表征的最优细粒度映射，从而在细粒度和粗粒度层面上比较两者的一致性。

链接: https://arxiv.org/abs/2505.16419
作者: Soh Takahashi,Masaru Sasaki,Ken Takeda,Masafumi Oizumi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 34 pages, 6 figures

点击查看摘要

Abstract:The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.
zh

[CV-92] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

【速读】：该论文试图解决在将旋转位置编码（Rotary Position Embedding, RoPE）扩展到大型视觉语言模型（Large Vision-Language Models, LVLMs）时引入的跨模态位置偏差问题。具体而言，现有方法在文本标记索引与图像标记之间强制相对位置依赖关系，导致虚假对齐。解决方案的关键在于提出Per-Token Distance（PTD）作为量化跨模态位置编码独立性的有效指标，并引入Circle-RoPE编码方案，将图像标记索引映射到与文本标记索引线性路径正交的圆形轨迹，形成类似圆锥的结构，从而确保每个文本标记与所有图像标记保持等距，减少人工跨模态偏差的同时保留图像内的空间信息。

链接: https://arxiv.org/abs/2505.16416
作者: Chengcheng Wang,Jianyuan Guo,Hongguang Li,Yuchuan Tian,Ying Nie,Chang Xu,Kai Han
机构: Huawei Noah’s Ark Lab. (华为诺亚方舟实验室); City University of Hong Kong (香港城市大学); University of Sydney (悉尼大学); State Key Lab of General AI, School of Intelligence Science and Technology, Peking University (人工智能通用国家重点实验室，北京大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model’s overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [this https URL](this https URL).
zh

[CV-93] Pose-invariant face recognition via feature-space pose frontalization

【速读】：该论文旨在解决姿态不变的人脸识别（pose-invariant face recognition）问题，即在野外采集的侧脸图像与数据库中注册的正面人脸图像之间进行匹配。其解决方案的关键在于提出一种新的特征空间姿态正视化模块（feature space pose frontalization module, FSPFM），该模块能够将任意角度的侧脸图像转换为正面图像，并结合一种包含预训练和注意力引导微调阶段的新训练范式，以充分发挥FSPFM的潜力并提升性能。

链接: https://arxiv.org/abs/2505.16412
作者: Nikolay Stanishev,Yuhang Lu,Touradj Ebrahimi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.
zh

[CV-94] Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

【速读】：该论文试图解决大型多模态语言模型（Large Vision Language Models, LVLMs）在生成文本时产生的“幻觉”问题，即生成的内容与视觉上下文不一致。解决方案的关键在于提出SPIN，一种任务无关的注意力引导头抑制策略，该策略能够在推理过程中无缝集成，而不会显著增加计算或延迟开销。通过分析发现，幻觉可归因于每层中动态的注意力头子集，SPIN针对每个文本查询词元，选择性地抑制对图像词元关注较低的注意力头，同时保留Top-K注意力头，从而有效降低幻觉分数并提升吞吐量。

链接: https://arxiv.org/abs/2505.16411
作者: Sreetama Sarkar,Yue Che,Alex Gavin,Peter A. Beerel,Souvik Kundu
机构: University of Southern California (南加州大学); Intel Labs (英特尔实验室); Harvard-Westlake High School (哈佛-韦斯特莱克高中)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from “hallucinations”, generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at this https URL.
zh

[CV-95] AdvReal: Adversarial Patch Generation Framework with Application to Adversarial Safety Evaluation of Object Detection Systems

【速读】：该论文旨在解决自主车辆中基于深度学习的感知方法在物理世界中对对抗样本的脆弱性问题，从而导致安全事故发生。其关键解决方案是提出一个统一的联合对抗训练框架，用于处理2D和3D样本，以应对现实场景中的类内多样性与环境变化问题，并引入一种结合非刚性表面建模和真实3D匹配机制的对抗样本现实增强方法。

链接: https://arxiv.org/abs/2505.16402
作者: Yuanhao Huang,Yilong Ren,Jinlei Wang,Lujia Huo,Xuesong Bai,Jinchuan Zhang,Haiyan Yu
机构: Beihang University (北京航空航天大学); Inner Mongolia University of Technology (内蒙古工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous vehicles are typical complex intelligent systems with artificial intelligence at their core. However, perception methods based on deep learning are extremely vulnerable to adversarial samples, resulting in safety accidents. How to generate effective adversarial examples in the physical world and evaluate object detection systems is a huge challenge. In this study, we propose a unified joint adversarial training framework for both 2D and 3D samples to address the challenges of intra-class diversity and environmental variations in real-world scenarios. Building upon this framework, we introduce an adversarial sample reality enhancement approach that incorporates non-rigid surface modeling and a realistic 3D matching mechanism. We compare with 5 advanced adversarial patches and evaluate their attack performance on 8 object detecotrs, including single-stage, two-stage, and transformer-based models. Extensive experiment results in digital and physical environments demonstrate that the adversarial textures generated by our method can effectively mislead the target detection model. Moreover, proposed method demonstrates excellent robustness and transferability under multi-angle attacks, varying lighting conditions, and different distance in the physical world. The demo video and code can be obtained at this https URL.
zh

[CV-96] Sketchy Bounding-box Supervision for 3D Instance Segmentation CVPR2025

【速读】：该论文旨在解决在弱监督3D实例分割中，由于获取精确边界框（bounding box）存在挑战，导致模型性能受限的问题。其解决方案的关键在于提出一种名为Sketchy-3DIS的新型弱监督3D实例分割框架，该框架通过联合学习伪标签生成器（pseudo labeler）和分割器（segmentator），在不精确边界框（sketchy bounding box）监督下提升分割性能。具体而言，该方法首先引入自适应的从边界框到点的伪标签生成器，以生成紧凑且纯净的伪实例标签；随后设计了从粗到细的实例分割器，逐步细化分割结果；最终通过伪标签对分割器进行监督，实现高质量实例的逐步生成。

链接: https://arxiv.org/abs/2505.16399
作者: Qian Deng,Le Hui,Jin Xie,Jian Yang
机构: Nankai University(南开大学); Northwestern Polytechnical University(西北工业大学); Nanjing University(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at this https URL.
zh

[CV-97] Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)

【速读】：该论文试图解决在端到端自动驾驶（end-to-end autonomous driving, E2E-AD）中，基于模仿学习（imitation learning, IL）存在的因果混淆和分布偏移问题，以及模型基础强化学习（model-based reinforcement learning, MBRL）在实际应用中对特权信息（privileged information）的依赖问题。解决方案的关键在于设计一种双流MBRL方法——Raw2Drive，通过高效训练一个辅助的特权世界模型与使用特权信息的神经规划器，并引入一种引导机制来训练原始传感器世界模型，确保其与特权世界模型在推演过程中的一致性，最终结合特权世界模型头部嵌入的先验知识，有效指导原始传感器策略的训练。

链接: https://arxiv.org/abs/2505.16394
作者: Zhenjie Yang,Xiaosong Jia,Qifeng Li,Xue Yang,Maoqing Yao,Junchi Yan
机构: Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学计算机科学与人工智能学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) can mitigate the causal confusion and distribution shift inherent to imitation learning (IL). However, applying RL to end-to-end autonomous driving (E2E-AD) remains an open problem for its training difficulty, and IL is still the mainstream paradigm in both academia and industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated promising results in neural planning; however, these methods typically require privileged information as input rather than raw sensor data. We fill this gap by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently train an auxiliary privileged world model paired with a neural planner that uses privileged information as input. Subsequently, we introduce a raw sensor world model trained via our proposed Guidance Mechanism, which ensures consistency between the raw sensor world model and the privileged world model during rollouts. Finally, the raw sensor world model combines the prior knowledge embedded in the heads of the privileged world model to effectively guide the training of the raw sensor policy. Raw2Drive is so far the only RL based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it achieves state-of-the-art performance.
zh

[CV-98] MAGE: A Multi-task Architecture for Gaze Estimation with an Efficient Calibration Module

【速读】：该论文旨在解决现有眼动估计方法仅能预测眼动方向或屏幕上的注视点（Point-of-Gaze, PoG），无法实现对三维空间中六自由度（6-DoF）眼动信息的全面分析的问题，同时应对个体间眼球形状和结构差异对模型泛化能力的限制。其解决方案的关键在于提出一种多任务眼动估计架构MAGE，该架构通过编码面部图像中的方向与位置特征，并利用专用信息流和多个解码器进行预测，同时引入高效的校准模块Easy-Calibration，以主体特定数据微调基础模型，从而减少个体差异的影响，且无需依赖屏幕进行校准。

链接: https://arxiv.org/abs/2505.16384
作者: Haoming Huang,Musen Zhang,Jianxin Yang,Zhen Li,Jinkai Li,Yao Guo
机构: Shanghai Jiao Tong University (上海交通大学); Chinese University of Hong Kong (深圳) (香港中文大学（深圳）)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Under review

点击查看摘要

Abstract:Eye gaze can provide rich information on human psychological activities, and has garnered significant attention in the field of Human-Robot Interaction (HRI). However, existing gaze estimation methods merely predict either the gaze direction or the Point-of-Gaze (PoG) on the screen, failing to provide sufficient information for a comprehensive six Degree-of-Freedom (DoF) gaze analysis in 3D space. Moreover, the variations of eye shape and structure among individuals also impede the generalization capability of these methods. In this study, we propose MAGE, a Multi-task Architecture for Gaze Estimation with an efficient calibration module, to predict the 6-DoF gaze information that is applicable for the real-word HRI. Our basic model encodes both the directional and positional features from facial images, and predicts gaze results with dedicated information flow and multiple decoders. To reduce the impact of individual variations, we propose a novel calibration module, namely Easy-Calibration, to fine-tune the basic model with subject-specific data, which is efficient to implement without the need of a screen. Experimental results demonstrate that our method achieves state-of-the-art performance on the public MPIIFaceGaze, EYEDIAP, and our built IMRGaze datasets.
zh

[CV-99] DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos CVPR2025

【速读】：该论文旨在解决长视频时间定位（Long Video Temporal Grounding, LVTG）中由于将视频分割为多个片段并使用全规模专家编码器进行处理而导致的计算成本过高、难以扩展的问题。其解决方案的关键在于提出DeCafNet，该方法采用“委托与征服”策略，通过引入一个轻量级的辅助编码器（sidekick encoder）高效地提取所有视频片段的密集特征，并生成显著性图以筛选出最相关的片段交由专家编码器进行详细处理，从而在不牺牲定位性能的前提下显著降低计算开销。

链接: https://arxiv.org/abs/2505.16376
作者: Zijia Lu,A S M Iftekhar,Gaurav Mittal,Tianjian Meng,Xiawei Wang,Cheng Zhao,Rohith Kukkala,Ehsan Elhamifar,Mei Chen
机构: Microsoft(微软); Northeastern University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer’’ strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at this https URL.
zh

[CV-100] mporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition

【速读】：该论文试图解决动态微表情识别（DMER）中由于微表情的瞬时性和高度局部性导致的识别准确率低的问题，其准确率甚至低至50%。解决方案的关键在于提出一种基于时空特征融合的框架（TSFmicro），该框架结合了保留网络（RetNet）和基于Transformer的DMER网络，以捕捉和融合时间与空间关系，并通过一种新颖的并行时-空融合方法，在高维特征空间中融合时空信息，从而在语义层面实现互补的“何处-如何”关系，提升模型的语义信息表达能力。

链接: https://arxiv.org/abs/2505.16372
作者: Feng Liu,Bingyu Nan,Xuezhong Qian,Xiaolan Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:When emotions are repressed, an individual’s true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual’s authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary “where-how” relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.
zh

[CV-101] Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

【速读】：该论文试图解决合成数据训练的语义分割模型在真实世界图像上表现不佳的问题，特别是在缺乏标注数据的恶劣条件下，由于领域差异（domain gap）导致性能下降。其解决方案的关键在于利用扩散模型进行语义一致的风格迁移，提出两种新技术：基于类别的自适应实例归一化和交叉注意力（CACTI）及其扩展的有选择性注意力过滤（CACTIF），通过语义类别选择性地应用统计归一化并过滤交叉注意力图，从而在保持语义边界和结构连贯性的同时迁移风格特征，有效缩小合成到真实领域的差距。

链接: https://arxiv.org/abs/2505.16360
作者: Estelle Chigot,Dennis G. Wilson,Meriem Ghrib,Thomas Oberlin
机构: ISAE-Supaero, University of Toulouse, France(法国图卢兹ISAE-Supaero大学); Airbus(空客)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.
zh

[CV-102] Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification

【速读】：该论文旨在解决皮肤病变在皮肤镜图像中的准确分类问题，这对于皮肤癌的诊断和治疗具有重要意义。其解决方案的关键在于利用一种专门针对皮肤病学领域的基础模型PanDerm，并与两种Vision Transformer（ViT）架构（ViT base和Swin Transformer V2 base）进行比较。研究通过冻结PanDerm提取的特征，并采用非线性探测方法（包括多层感知机、XGBoost和TabNet分类器）进行分类，而对ViT模型则进行全量微调以优化分类性能。实验结果表明，基于PanDerm的MLP模型在性能上可与微调后的Swin Transformer模型相媲美，且通过融合PanDerm与Swin Transformer的预测结果可进一步提升性能。

链接: https://arxiv.org/abs/2505.16338
作者: Amirreza Mahbod,Rupert Ecker,Ramona Woitek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Accurate classification of skin lesions from dermatoscopic images is essential for diagnosis and treatment of skin cancer. In this study, we investigate the utility of a dermatology-specific foundation model, PanDerm, in comparison with two Vision Transformer (ViT) architectures (ViT base and Swin Transformer V2 base) for the task of skin lesion classification. Using frozen features extracted from PanDerm, we apply non-linear probing with three different classifiers, namely, multi-layer perceptron (MLP), XGBoost, and TabNet. For the ViT-based models, we perform full fine-tuning to optimize classification performance. Our experiments on the HAM10000 and MSKCC datasets demonstrate that the PanDerm-based MLP model performs comparably to the fine-tuned Swin transformer model, while fusion of PanDerm and Swin Transformer predictions leads to further performance improvements. Future work will explore additional foundation models, fine-tuning strategies, and advanced fusion techniques.
zh

[CV-103] FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

【速读】：该论文旨在解决视觉自回归（VAR）模型在边缘设备部署中因参数规模大和计算成本高而导致的内存与计算效率问题。其解决方案的关键在于提出FPQVAR框架，该框架通过算法与硬件协同设计实现高效的后训练浮点（FP）量化。在算法层面，引入了双格式量化以处理输入激活的极端不平衡性，并采用分组哈达玛变换及GHT感知可学习变换来应对时变异常通道；在硬件层面，设计了首个基于FPGA的低比特浮点量化器与乘法器，并提出了具有低比特浮点计算和两级流水线结构的VAR加速器，从而显著提升了性能与能效。

链接: https://arxiv.org/abs/2505.16335
作者: Renjie Wei,Songqiang Xu,Qingyu Guo,Meng Li
机构: Peking University (北京大学); Beijing Advanced Innovation Center for Integrated Circuits (北京集成电路先进创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fréchet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.
zh

[CV-104] Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

【速读】：该论文试图解决图像的全景描述（panoptic captioning）问题，即生成能够全面涵盖图像中所有实体、其位置与属性、实体间关系以及图像整体语义的最小文本等价描述。其解决方案的关键在于提出一种名为PancapEngine的数据生成引擎和一种名为PancapChain的新型方法。PancapEngine通过精细的检测套件检测图像中的多种实体，并利用实体感知提示生成所需的全景描述；而PancapChain则将复杂的全景描述任务分解为多个阶段，逐步生成描述内容。此外，研究还引入了PancapScore度量标准和人工标注的测试集以确保模型评估的可靠性。

链接: https://arxiv.org/abs/2505.16334
作者: Kun-Yu Lin,Hongjun Wang,Weining Ren,Kai Han
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image this http URL an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model this http URL show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: this https URL
zh

[CV-105] nsorAR: Refinement is All You Need in Autoregressive Image Generation

【速读】：该论文试图解决自回归（Autoregressive, AR）图像生成模型在生成质量上的局限性，即由于缺乏对先前预测的修正机制，导致生成效果不如扩散模型。其解决方案的关键在于引入TensorAR，这是一种将图像生成从逐个token预测转化为逐张量（tensor）预测的新范式，通过滑动窗口生成重叠的图像块（tensors），实现对已生成内容的迭代优化，并采用离散张量去噪方案防止训练过程中的信息泄露。

链接: https://arxiv.org/abs/2505.16324
作者: Cheng Cheng,Lin Song,Yicheng Xiao,Yuxin Chen,Xuchong Zhang,Hongbin Sun,Ying Shan
机构: Xi’an JiaoTong University (西安交通大学); ARC Lab, Tencent PCG (腾讯PCG实验室); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院，清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.
zh

[CV-106] Efficient Motion Prompt Learning for Robust Visual Tracking ICML2025

【速读】：该论文试图解决视频数据中时间一致性被视觉判别性所忽视的问题，从而导致跟踪器在处理时间信息时存在挑战。其解决方案的关键在于提出一种轻量级且可插拔的运动提示跟踪方法，通过引入运动编码器将长期运动轨迹编码到视觉嵌入空间，并结合融合解码器和自适应权重机制动态融合视觉与运动特征，从而在不显著增加训练成本和计算开销的情况下提升基于视觉的跟踪器的鲁棒性。

链接: https://arxiv.org/abs/2505.16321
作者: Jie Zhao,Xin Chen,Yongsheng Yuan,Michael Felsberg,Dong Wang,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at this https URL.
zh

[CV-107] SuperPure: Efficient Purification of Localized and Distributed Adversarial Patches via Super-Resolution GAN Models

【速读】：该论文旨在解决视觉基础机器学习模型在自主和网络物理系统中面临的安全问题，特别是针对物理领域的对抗性补丁攻击。现有防御方法在应对高密度集中式补丁攻击时表现出色，但在两个关键方面存在不足：一是对低噪声分布式补丁攻击（如DorPatch攻击）的脆弱性；二是计算资源和时间消耗大，难以满足实时性要求。为了解决这些问题，本文提出了一种新的防御策略——SuperPure，其关键在于开发了一种像素级的遮蔽方案，能够有效抵御分布式和集中式补丁攻击，通过基于生成对抗网络（GAN）的超分辨率技术逐步净化图像中的对抗性补丁。

链接: https://arxiv.org/abs/2505.16318
作者: Hossein Khalili,Seongbin Park,Venkat Bollapragada,Nader Sehatbakhsh
机构: University of California, Los Angeles (UCLA)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:As vision-based machine learning models are increasingly integrated into autonomous and cyber-physical systems, concerns about (physical) adversarial patch attacks are growing. While state-of-the-art defenses can achieve certified robustness with minimal impact on utility against highly-concentrated localized patch attacks, they fall short in two important areas: (i) State-of-the-art methods are vulnerable to low-noise distributed patches where perturbations are subtly dispersed to evade detection or masking, as shown recently by the DorPatch attack; (ii) Achieving high robustness with state-of-the-art methods is extremely time and resource-consuming, rendering them impractical for latency-sensitive applications in many cyber-physical systems. To address both robustness and latency issues, this paper proposes a new defense strategy for adversarial patch attacks called SuperPure. The key novelty is developing a pixel-wise masking scheme that is robust against both distributed and localized patches. The masking involves leveraging a GAN-based super-resolution scheme to gradually purify the image from adversarial patches. Our extensive evaluations using ImageNet and two standard classifiers, ResNet and EfficientNet, show that SuperPure advances the state-of-the-art in three major directions: (i) it improves the robustness against conventional localized patches by more than 20%, on average, while also improving top-1 clean accuracy by almost 10%; (ii) It achieves 58% robustness against distributed patch attacks (as opposed to 0% in state-of-the-art method, PatchCleanser); (iii) It decreases the defense end-to-end latency by over 98% compared to PatchCleanser. Our further analysis shows that SuperPure is robust against white-box attacks and different patch sizes. Our code is open-source. Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV) Cite as: arXiv:2505.16318 [cs.CV] (or arXiv:2505.16318v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.16318 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-108] NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

【速读】：该论文旨在解决文本到图像（Text to Image, T2I）生成模型的质量评估问题，特别是针对细粒度的质量评估需求。其解决方案的关键在于从两个方面对T2I模型进行评估：图像与文本的一致性（image-text alignment）和图像结构失真检测（image structural distortion detection），分别对应于对齐跟踪和结构跟踪。通过对大量AI生成图像（AIGIs）的数据集进行分析，挑战赛收集了大量参赛者的模型和结果，验证了现有方法在质量评估任务上的有效性，并展示了优胜方法在预测性能上的优越性。

链接: https://arxiv.org/abs/2505.16314
作者: Shuhao Han,Haotian Fan,Fangyuan Kong,Wenjie Liao,Chunle Guo,Chongyi Li,Radu Timofte,Liang Li,Tao Li,Junhui Cui,Yunqiu Wang,Yang Tai,Jingwei Sun,Jianhui Sun,Xinli Yue,Tianyi Wang,Huan Hou,Junda Lu,Xinyang Huang,Zitang Zhou,Zijian Zhang,Xuhui Zheng,Xuecheng Wu,Chong Peng,Xuezhi Cao,Trong-Hieu Nguyen-Mau,Minh-Hoang Le,Minh-Khoa Le-Phan,Duy-Nam Ly,Hai-Dang Nguyen,Minh-Triet Tran,Yukang Lin,Yan Hong,Chuanbiao Song,Siyuan Li,Jun Lan,Zhichao Zhang,Xinyue Li,Wei Sun,Zicheng Zhang,Yunhao Li,Xiaohong Liu,Guangtao Zhai,Zitong Xu,Huiyu Duan,Jiarui Wang,Guangji Ma,Liu Yang,Lu Liu,Qiang Hu,Xiongkuo Min,Zichuan Wang,Zhenchen Tang,Bo Peng,Jing Dong,Fengbin Guan,Zihao Yu,Yiting Lu,Wei Luo,Xin Li,Minhao Lin,Haofeng Chen,Xuanxuan He,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Bo-Cheng Qiu,Chih-Chung Hsu,Chia-ming Lee,Yu-Fan Lin,Bo Yu,Zehao Wang,Da Mu,Mingxiu Chen,Junkang Fang,Huamei Sun,Wending Zhao,Zhiyu Wang,Wang Liu,Weikang Yu,Puhong Duan,Bin Sun,Xudong Kang,Shutao Li,Shuai He,Lingzhi Fu,Heng Cong,Rongyu Zhang,Jiarong He,Zhishan Qiao,Yongqing Huang,Zewen Chen,Zhe Pang,Juan Wang,Jian Guo,Zhizhuo Shao,Ziyu Feng,Bing Li,Weiming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
zh

[CV-109] Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings

【速读】：该论文旨在解决在黑盒设置下，针对特定目标类别的对抗样本生成问题，此类攻击由于目标类别决策区域狭窄而具有较高难度。现有方法主要依赖于源图像与目标图像之间决策边界的几何特性，而非直接利用图像本身的特征信息。论文提出的解决方案是Targeted Edge-informed Attack (TEA)，其关键在于利用目标图像的边缘信息对图像进行精确扰动，从而生成更接近源图像且能实现预期目标分类的对抗样本。该方法在低查询次数场景下表现出色，显著减少了所需查询次数，并为基于几何特性的攻击提供了更优的目标初始化策略。

链接: https://arxiv.org/abs/2505.16313
作者: Arjhun Swaminathan,Mete Akgün
机构: University of Tübingen (图宾根大学); Medical Data Privacy and Privacy-preserving Machine Learning (MDPPML) (医学数据隐私与隐私保护机器学习); Institute for Bioinformatics and Medical Informatics (IBMI) (生物信息学与医学信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper contains 11 pages, 7 figures and 3 tables. For associated supplementary code, see this https URL

点击查看摘要

Abstract:Deep neural networks for image classification remain vulnerable to adversarial examples – small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
zh

[CV-110] Paired and Unpaired Image to Image Translation using Generative Adversarial Networks

【速读】：该论文旨在解决跨多个图像域的成对与不成对图像翻译问题，即在保持图像特征属性的同时，将输入图像从一个领域转换到另一个领域。其解决方案的关键在于采用条件生成对抗网络（conditional GAN）处理成对任务，并通过循环一致性损失（cycle consistency loss）训练模型以应对不成对任务，同时探索了多种损失函数、Patch-GAN尺寸和模型架构，以提升图像翻译的质量与稳定性。

链接: https://arxiv.org/abs/2505.16310
作者: Gaurav Kumar,Soham Satyadharma,Harpreet Singh
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages

点击查看摘要

Abstract:Image to image translation is an active area of research in the field of computer vision, enabling the generation of new images with different styles, textures, or resolutions while preserving their characteristic properties. Recent architectures leverage Generative Adversarial Networks (GANs) to transform input images from one domain to another. In this work, we focus on the study of both paired and unpaired image translation across multiple image domains. For the paired task, we used a conditional GAN model, and for the unpaired task, we trained it using cycle consistency loss. We experimented with different types of loss functions, multiple Patch-GAN sizes, and model architectures. New quantitative metrics - precision, recall, and FID score - were used for analysis. In addition, a qualitative study of the results of different experiments was conducted.
zh

[CV-111] SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation

【速读】：该论文旨在解决自动化心脏MRI分割中复杂病理特征提取的挑战。其解决方案的关键在于提出了一种名为SAMba-UNet的创新双编码器架构，通过整合视觉基础模型SAM2、状态空间模型Mamba和经典UNet，实现跨模态特征协同学习。此外，通过设计动态特征融合修正器和异构全注意力收敛模块（HOACM），有效缓解了医学图像与自然图像之间的领域差异，并增强了小病灶特征提取能力以及局部位置语义与长程依赖建模的融合效果。

链接: https://arxiv.org/abs/2505.16304
作者: Guohao Huo,Ruiting Dai,Hao Tang
机构: University of Electronic Science and Technology of China(电子科技大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To address the challenge of complex pathological feature extraction in automated cardiac MRI segmentation, this study proposes an innovative dual-encoder architecture named SAMba-UNet. The framework achieves cross-modal feature collaborative learning by integrating the vision foundation model SAM2, the state-space model Mamba, and the classical UNet. To mitigate domain discrepancies between medical and natural images, a Dynamic Feature Fusion Refiner is designed, which enhances small lesion feature extraction through multi-scale pooling and a dual-path calibration mechanism across channel and spatial dimensions. Furthermore, a Heterogeneous Omni-Attention Convergence Module (HOACM) is introduced, combining global contextual attention with branch-selective emphasis mechanisms to effectively fuse SAM2’s local positional semantics and Mamba’s long-range dependency modeling capabilities. Experiments on the ACDC cardiac MRI dataset demonstrate that the proposed model achieves a Dice coefficient of 0.9103 and an HD95 boundary error of 1.0859 mm, significantly outperforming existing methods, particularly in boundary localization for complex pathological structures such as right ventricular anomalies. This work provides an efficient and reliable solution for automated cardiac disease diagnosis, and the code will be open-sourced.
zh

[CV-112] Self-Classification Enhancement and Correction for Weakly Supervised Object Detection IJCAI2025

【速读】：该论文旨在解决弱监督目标检测（WSOD）中因忽略两个多类分类任务（MCC）之间的潜在分类模糊性而导致的性能局限问题。其解决方案的关键在于引入一种自分类增强模块，该模块通过集成类内二分类（ICBC）任务来弥合两个MCC任务之间的差距，从而提升网络对正样本与误定位样本的区分能力，并与MCC任务形成相互促进的关系；同时，在推理阶段提出一种自分类校正算法，结合两个MCC任务的结果以有效减少误分类预测。

链接: https://arxiv.org/abs/2505.16294
作者: Yufei Yin,Lechao Cheng,Wengang Zhou,Jiajun Deng,Zhou Yu,Houqiang Li
机构: Hangzhou Dianzi University (杭州电子科技大学); Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学); Australian Institute for Machine Learning (澳大利亚机器学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i.e., multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network’s discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 2012 datasets demonstrate the superior performance of our framework.
zh

[CV-113] Efficient Prototype Consistency Learning in Medical Image Segmentation via Joint Uncertainty and Data Augmentation

【速读】：该论文旨在解决半监督医学图像分割中由于标记数据稀缺导致的原型表达能力不足问题，进而影响类别嵌入的完整表示。其解决方案的关键在于提出一种基于Mean-Teacher框架的高效原型一致性学习方法（EPCL-JUDA），通过联合不确定性量化和数据增强技术，利用原始和增强的标记数据生成具有表现力的原型，并通过不确定性量化优化伪标签，分别生成原始和增强未标记数据的可靠原型。最终通过融合标记与未标记原型形成高质量全局原型，实现原型与特征的一致性学习，同时引入原型网络以降低增强数据带来的高内存需求。

链接: https://arxiv.org/abs/2505.16283
作者: Lijian Li,Yuanpeng He,Chi-Man Pun
机构: University of Macau (澳门大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2404.10717

点击查看摘要

Abstract:Recently, prototype learning has emerged in semi-supervised medical image segmentation and achieved remarkable performance. However, the scarcity of labeled data limits the expressiveness of prototypes in previous methods, potentially hindering the complete representation of prototypes for class embedding. To overcome this issue, we propose an efficient prototype consistency learning via joint uncertainty quantification and data augmentation (EPCL-JUDA) to enhance the semantic expression of prototypes based on the framework of Mean-Teacher. The concatenation of original and augmented labeled data is fed into student network to generate expressive prototypes. Then, a joint uncertainty quantification method is devised to optimize pseudo-labels and generate reliable prototypes for original and augmented unlabeled data separately. High-quality global prototypes for each class are formed by fusing labeled and unlabeled prototypes, which are utilized to generate prototype-to-features to conduct consistency learning. Notably, a prototype network is proposed to reduce high memory requirements brought by the introduction of augmented data. Extensive experiments on Left Atrium, Pancreas-NIH, Type B Aortic Dissection datasets demonstrate EPCL-JUDA’s superiority over previous state-of-the-art approaches, confirming the effectiveness of our framework. The code will be released soon.
zh

[CV-114] ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay

【速读】：该论文旨在解决在复杂环境中训练大型语言模型（Large Language Models, LLMs）作为交互式代理以控制图形用户界面（Graphical User Interfaces, GUIs）时，优化长周期动作序列的挑战。其关键解决方案是提出一种端到端的强化学习方法——代理回放策略优化（Agentic Replay Policy Optimization, ARPO），该方法通过引入经验回放缓冲区来复用成功经验，并结合基于基线代理性能的任务选择策略以稳定训练过程，从而提升LLMs在复杂、长周期计算机任务中的表现。

链接: https://arxiv.org/abs/2505.16282
作者: Fanbin Lu,Zhisheng Zhong,Shu Liu,Chi-Wing Fu,Jiaya Jia
机构: The Chinese University of Hong Kong (香港中文大学); SmartMore (智慧能源); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) as interactive agents for controlling graphical user interfaces (GUIs) presents a unique challenge to optimize long-horizon action sequences with multimodal feedback from complex environments. While recent works have advanced multi-turn reinforcement learning (RL) for reasoning and tool-using capabilities in LLMs, their application to GUI-based agents remains relatively underexplored due to the difficulty of sparse rewards, delayed feedback, and high rollout costs. In this paper, we investigate end-to-end policy optimization for vision-language-based GUI agents with the aim of improving performance on complex, long-horizon computer tasks. We propose Agentic Replay Policy Optimization (ARPO), an end-to-end RL approach that augments Group Relative Policy Optimization (GRPO) with a replay buffer to reuse the successful experience across training iterations. To further stabilize the training process, we propose a task selection strategy that filters tasks based on baseline agent performance, allowing the agent to focus on learning from informative interactions. Additionally, we compare ARPO with offline preference optimization approaches, highlighting the advantages of policy-based methods in GUI environments. Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results, establishing a new performance baseline for LLM-based GUI agents trained via reinforcement learning. Our findings underscore the effectiveness of reinforcement learning for training multi-turn, vision-language GUI agents capable of managing complex real-world UI interactions. Codes and models:this https URL.
zh

[CV-115] MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing INTERSPEECH2025

【速读】：该论文旨在解决当前电影配音技术在适应多种配音风格、有效处理对话、旁白和独白以及考虑说话者年龄和性别等细微细节方面的不足。其解决方案的关键在于引入一种多模态生成框架，该框架首先利用多模态大视觉-语言模型（multi-modal large vision-language model）分析视觉输入，以识别配音类型和细粒度属性；其次通过大型语音生成模型，借助多模态输入生成高质量的配音。此外，还构建了一个包含配音类型和细微细节标注的电影配音数据集，以提升电影理解并优化配音质量。

链接: https://arxiv.org/abs/2505.16279
作者: Junjie Zheng,Zihao Chen,Chaofan Ding,Yunming Liang,Yihan Fan,Huan Yang,Lei Xie,Xinhan Di
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, accepted by Interspeech 2025

点击查看摘要

Abstract:Current movie dubbing technology can produce the desired speech using a reference voice and input video, maintaining perfect synchronization with the visuals while effectively conveying the intended emotions. However, crucial aspects of movie dubbing, including adaptation to various dubbing styles, effective handling of dialogue, narration, and monologues, as well as consideration of subtle details such as speaker age and gender, remain insufficiently explored. To tackle these challenges, we introduce a multi-modal generative framework. First, it utilizes a multi-modal large vision-language model (VLM) to analyze visual inputs, enabling the recognition of dubbing types and fine-grained attributes. Second, it produces high-quality dubbing using large speech generation models, guided by multi-modal inputs. Additionally, a movie dubbing dataset with annotations for dubbing types and subtle details is constructed to enhance movie understanding and improve dubbing quality for the proposed multi-modal framework. Experimental results across multiple benchmark datasets show superior performance compared to state-of-the-art (SOTA) methods. In details, the LSE-D, SPK-SIM, EMO-SIM, and MCD exhibit improvements of up to 1.09%, 8.80%, 19.08%, and 18.74%, respectively.
zh

[CV-116] DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

【速读】：该论文旨在解决端到端自动驾驶（E2E-AD）中多视角传感数据处理效率不足以及复杂驾驶场景下稀有操作（如激进转弯）应对能力薄弱的问题。其解决方案的关键在于引入基于专家混合（MoE）架构的DriveMoE框架，通过场景专用视觉MoE和技能专用动作MoE实现参数专业化，从而提升模型在多样化场景下的鲁棒性和性能。该设计模仿人类驾驶认知，使系统能够动态选择关键视觉线索并激活特定行为专家模块，避免现有模型因模式平均导致的性能下降。

链接: https://arxiv.org/abs/2505.16278
作者: Zhenjie Yang,Yilin Chai,Xiaosong Jia,Qifeng Li,Yuqian Shao,Xuekai Zhu,Haisheng Su,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our \pi_0 Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive- \pi_0 . Specifically, we add Vision MoE to Drive- \pi_0 by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive- \pi_0 .
zh

[CV-117] LINEA: Fast and Accurate Line Detection Using Scalable Transformers

【速读】：该论文试图解决基于Transformer的线检测方法在推理速度上显著低于卷积神经网络（CNN）方法，以及需要在大规模数据集（如COCO或Object360）上预训练注意力机制的问题，这些问题限制了其在低延迟视频分析中的应用。解决方案的关键在于提出一种新的机制——可变形线注意力（Deformable Line Attention, DLA），通过该机制无需在大规模数据集上预训练注意力模块即可实现高效的线检测，从而提升了模型的推理速度并保持了较高的性能。

链接: https://arxiv.org/abs/2505.16264
作者: Sebastian Janampa,Marios Pattichis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Line detection is a basic digital image processing operation used by higher-level processing methods. Recently, transformer-based methods for line detection have proven to be more accurate than methods based on CNNs, at the expense of significantly lower inference speeds. As a result, video analysis methods that require low latencies cannot benefit from current transformer-based methods for line detection. In addition, current transformer-based models require pretraining attention mechanisms on large datasets (e.g., COCO or Object360). This paper develops a new transformer-based method that is significantly faster without requiring pretraining the attention mechanism on large datasets. We eliminate the need to pre-train the attention mechanism using a new mechanism, Deformable Line Attention (DLA). We use the term LINEA to refer to our new transformer-based method based on DLA. Extensive experiments show that LINEA is significantly faster and outperforms previous models on sAP in out-of-distribution dataset testing.
zh

[CV-118] DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

【速读】：该论文试图解决多模态数据压缩中现有学习型无损压缩器缺乏灵活性和模态特异性适应的问题，以及多模态大语言模型（MLLM）在实际部署中因复杂度过高而受限的问题。其解决方案的关键在于提出DualComp，这是首个统一且轻量级的学习型双模态无损压缩器，通过引入模态统一的分词、模态切换的上下文学习和模态路由的专家混合结构，结合参数重参数化训练策略，实现了高效的参数利用与近实时推理能力。

链接: https://arxiv.org/abs/2505.16256
作者: Yan Zhao,Zhengxue Cheng,Junxuan Zhang,Qunshan Gu,Qi Wang,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 18 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.
zh

[CV-119] Swin Transformer for Robust CGI Images Detection: Intra- and Inter-Dataset Analysis across Multiple Color Spaces

【速读】：该论文试图解决在不同颜色空间（RGB、YCbCr和HSV）中区分计算机生成图像（CGI）与真实数字图像的挑战。其解决方案的关键在于提出一种基于Swin Transformer的模型，该模型利用其分层架构来捕捉局部和全局特征，从而实现自然图像与合成图像的准确区分。

链接: https://arxiv.org/abs/2505.16253
作者: Preeti Mehta,Aman Sagar,Suchi Kumari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2409.04734

点击查看摘要

Abstract:This study aims to address the growing challenge of distinguishing computer-generated imagery (CGI) from authentic digital images across three different color spaces; RGB, YCbCr, and HSV. Given the limitations of existing classification methods in handling the complexity and variability of CGI, this research proposes a Swin Transformer based model for accurate differentiation between natural and synthetic images. The proposed model leverages the Swin Transformer’s hierarchical architecture to capture local and global features for distinguishing CGI from natural images. Its performance was assessed through intra- and inter-dataset testing across three datasets: CiFAKE, JSSSTU, and Columbia. The model was evaluated individually on each dataset (D1, D2, D3) and on the combined datasets (D1+D2+D3) to test its robustness and domain generalization. To address dataset imbalance, data augmentation techniques were applied. Additionally, t-SNE visualization was used to demonstrate the feature separability achieved by the Swin Transformer across the selected color spaces. The model’s performance was tested across all color schemes, with the RGB color scheme yielding the highest accuracy for each dataset. As a result, RGB was selected for domain generalization analysis and compared with other CNN-based models, VGG-19 and ResNet-50. The comparative results demonstrate the proposed model’s effectiveness in detecting CGI, highlighting its robustness and reliability in both intra-dataset and inter-dataset evaluations. The findings of this study highlight the Swin Transformer model’s potential as an advanced tool for digital image forensics, particularly in distinguishing CGI from natural images. The model’s strong performance indicates its capability for domain generalization, making it a valuable asset in scenarios requiring precise and reliable image classification.
zh

[CV-120] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

【速读】：该论文旨在解决扩散模型在真实场景视频超分辨率（VSR）任务中推理速度慢的问题，尤其是由于其需要数十个采样步骤导致的效率低下。解决方案的关键在于提出DOVE，一个高效的单步扩散模型，通过微调预训练的视频扩散模型（如CogVideoX）实现，并引入了潜在-像素训练策略以有效适应VSR任务，同时设计了高质量的视频处理流水线以构建适用于VSR的HQ-VSR数据集，从而提升模型的恢复能力并显著提高推理效率。

链接: https://arxiv.org/abs/2505.16239
作者: Zheng Chen,Zichen Zou,Kewei Zhang,Xiongfei Su,Xin Yuan,Yong Guo,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Westlake University (西湖大学); Huawei Consumer Business Group (华为消费者业务集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28 \times ** speed-up over existing methods such as MGLD-VSR. Code is available at: this https URL.
zh

[CV-121] CT-Agent : A Multimodal-LLM Agent for 3D CT Radiology Question Answering

【速读】：该论文旨在解决CT影像放射学报告生成中的挑战，特别是针对CT影像的解剖复杂性和跨切片空间关系难以捕捉的问题。现有视觉问答（VQA）系统在处理CT影像放射学问答（CTQA）任务时表现不足。为了解决这些问题，本文提出了CT-Agent，这是一种多模态代理框架。CT-Agent的关键在于采用解剖独立工具来分解解剖复杂性，并通过全局-局部令牌压缩策略高效捕捉跨切片的空间关系。

链接: https://arxiv.org/abs/2505.16229
作者: Yuren Mao,Wenyi Xu,Yuyang Qin,Yunjun Gao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists’ questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.
zh

[CV-122] A Shape-Aware Total Body Photography System for In-focus Surface Coverag e Optimization

【速读】：该论文旨在解决现有全身摄影（Total Body Photography, TBP）系统在自动检测和分析可疑皮肤病变方面的不足，尤其是图像分辨率和清晰度的问题。其解决方案的关键在于提出一种形状感知的TBP系统，通过结合深度相机和RGB相机、三维人体形状估计以及聚焦表面优化方法，根据相机姿态选择最佳对焦距离，从而在复杂的人体三维几何结构上实现更高分辨率和清晰度的图像捕获。

链接: https://arxiv.org/abs/2505.16228
作者: Wei-Lun Huang,Joshua Liu,Davood Tashayyod,Jun Kang,Amir Gandjbakhche,Misha Kazhdan,Mehran Armand
机构: Johns Hopkins University (约翰霍普金斯大学); Eunice Kennedy Shriver National Institute of Child Health and Human Development (尤妮丝·肯尼迪·施里弗国家儿童健康与人类发展研究所); Institute for Integrative and Innovative Research (整合创新研究院); University of Arkansas (阿肯色大学); Johns Hopkins School of Medicine (约翰霍普金斯医学院); Department of Mechanical Engineering (机械工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to JBHI

点击查看摘要

Abstract:Total Body Photography (TBP) is becoming a useful screening tool for patients at high risk for skin cancer. While much progress has been made, existing TBP systems can be further improved for automatic detection and analysis of suspicious skin lesions, which is in part related to the resolution and sharpness of acquired images. This paper proposes a novel shape-aware TBP system automatically capturing full-body images while optimizing image quality in terms of resolution and sharpness over the body surface. The system uses depth and RGB cameras mounted on a 360-degree rotary beam, along with 3D body shape estimation and an in-focus surface optimization method to select the optimal focus distance for each camera pose. This allows for optimizing the focused coverage over the complex 3D geometry of the human body given the calibrated camera poses. We evaluate the effectiveness of the system in capturing high-fidelity body images. The proposed system achieves an average resolution of 0.068 mm/pixel and 0.0566 mm/pixel with approximately 85% and 95% of surface area in-focus, evaluated on simulation data of diverse body shapes and poses as well as a real scan of a mannequin respectively. Furthermore, the proposed shape-aware focus method outperforms existing focus protocols (e.g. auto-focus). We believe the high-fidelity imaging enabled by the proposed system will improve automated skin lesion analysis for skin cancer screening.
zh

[CV-123] A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering

【速读】：该论文试图解决医学视觉问答（MedVQA）模型中存在的模态偏好偏差问题，即模型在预测时过度依赖某一模态（通常是问题）而忽视另一模态（通常是图像），从而无法有效学习多模态知识。解决方案的关键在于提出一种基于因果图的医学反事实视觉问答（MedCFVQA）模型，该模型通过引入因果图结构，在推理过程中消除模态偏好偏差。此外，为应对现有MedVQA数据集中问题与答案之间存在的强烈先验依赖关系，研究者重构了新的数据集，改变了问题与答案之间的先验依赖关系（CP），以更真实地评估模型性能。

链接: https://arxiv.org/abs/2505.16209
作者: Shuchang Ye,Usman Naseem,Mingyuan Meng,Dagan Feng,Jinman Kim
机构: The University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) is crucial for enhancing the efficiency of clinical diagnosis by providing accurate and timely responses to clinicians’ inquiries regarding medical images. Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other (in MedVQA, usually questions dominate the answer but images are overlooked), thereby failing to learn multimodal knowledge. To overcome the modality preference bias, we proposed a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference. Existing MedVQA datasets exhibit substantial prior dependencies between questions and answers, which results in acceptable performance even if the model significantly suffers from the modality preference bias. To address this issue, we reconstructed new datasets by leveraging existing MedVQA datasets and Changed their P3rior dependencies (CP) between questions and their answers in the training and test set. Extensive experiments demonstrate that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.
zh

[CV-124] SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

【速读】：该论文旨在解决机器人操作中政策模型空间理解能力不足的问题，即如何有效处理三维几何推理、物体关系以及机器人本体特性。现有方法在这一领域存在局限：三维点云模型缺乏语义抽象，而二维图像编码器则难以进行空间推理。论文提出的解决方案是SEM（Spatial Enhanced Manipulation model），其关键在于通过两个互补模块增强空间理解——空间增强器通过引入三维几何上下文来扩展视觉表征，而机器人状态编码器则通过图结构建模关节依赖关系来捕捉本体感知的结构。这种集成方式显著提升了空间理解能力，从而在多种任务中实现了更稳健和泛化的操作性能。

链接: https://arxiv.org/abs/2505.16196
作者: Xuewu Lin,Tianwei Lin,Lichao Huang,Hongyu Xie,Yiwei Jin,Keyu Li,Zhizhong Su
机构: Horizon Robotics(地平线机器人)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.
zh

[CV-125] VLM-R3: Region Recognition Reasoning and Refinement for Enhanced Multimodal Chain-of-Thought

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）在处理需要动态、迭代关注和重新审视视觉区域以实现文本推理与视觉证据精确对齐的复杂任务时所面临的挑战。其解决方案的关键在于提出一种名为\textbfVLM-R ^3的框架，该框架通过引入\textbfRegion-Conditioned Reinforcement Policy Optimization (R-GRPO)训练范式，使模型能够判断何时需要额外的视觉证据、确定视觉区域的定位，并将相关子图像内容无缝融入到交错的思维链中。此外，研究者还构建了一个精心设计的Visuo-Lingual Interleaved Rationale (VLIR)语料库，以提供区域选择和文本论证的逐步监督，从而提升模型的视觉推理能力。

链接: https://arxiv.org/abs/2505.16192
作者: Chaoya Jiang,Yongrui Heng,Wei Ye,Han Yang,Haiyang Xu,Ming Yan,Ji Zhang,Fei Huang,Shikun Zhang
机构: National Engineering Research Center for Software Engineering, Peking University (国家软件工程研究中心，北京大学); Alibaba Group (阿里巴巴集团); ZEEKR Intelligent Technology Holding Limited (极氪智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbfVLM-R ^3 (\textbfVisual \textbfLanguage \textbfModel with \textbfRegion \textbfRecognition and \textbfReasoning), a framework that equips an MLLM with the ability to (i) decide \emphwhen additional visual evidence is needed, (ii) determine \emphwhere to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbfRegion-Conditioned Reinforcement Policy Optimization (R-GRPO), a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R ^3 sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.
zh

[CV-126] Understanding Generative AI Capabilities in Everyday Image Editing Tasks

【速读】：该论文旨在探究用户在实际图像编辑请求中的偏好与需求，以及当前生成式 AI (Generative AI) 编辑器在处理这些请求时的表现与局限性。其核心问题包括：用户最常希望编辑的主体是什么？他们倾向于执行何种类型的编辑操作？用户更偏好精确可预测的编辑还是高度创造性的编辑？论文通过分析过去12年（2013-2025）Reddit社区中的83,000条请求及对应的305,000次专业修图师编辑，揭示了真实场景下的编辑特征，并评估了当前先进AI编辑器（如GPT-4o、Gemini-2.0-Flash、SeedEdit）的处理能力。研究发现，尽管AI在开放式任务中表现尚可，但在需要精准编辑的低创意请求中表现较差，且常无法保持人物或动物的身份一致性。解决方案的关键在于深入理解用户需求与AI能力之间的差距，从而为改进AI编辑器提供指导，并明确当前AI能够成功处理的请求类型。

链接: https://arxiv.org/abs/2505.16181
作者: Mohammad Reza Taesiri,Brandon Collins,Logan Bolton,Viet Dac Lai,Franck Dernoncourt,Trung Bui,Anh Totti Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and qualitative examples are available at: this https URL

点击查看摘要

Abstract:Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: this https URL
zh

[CV-127] QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

【速读】：该论文旨在解决长视频理解在实际应用中的计算瓶颈问题，具体表现为视频解码的高延迟和大规模预填充导致的高内存消耗。其解决方案的关键在于提出QuickVideo系统-算法协同设计，包含三个核心创新：QuickDecoder通过并行化CPU视频解码实现2-3倍的速度提升，QuickPrefill采用KV-cache剪枝技术提高内存效率，以及CPU视频解码与GPU推理的重叠机制，从而显著降低长视频输入的推理时间，实现在有限硬件上的可扩展、高质量视频理解。

链接: https://arxiv.org/abs/2505.16175
作者: Benjamin Schneider,Dongfu Jiang,Chao Du,Tianyu Pang,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); SeaAI Lab (SeaAI 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
zh

[CV-128] Erased or Dormant? Rethinking Concept Erasure Through Reversibility

【速读】：该论文试图解决当前概念擦除技术是否真正消除生成模型中目标概念的生成能力，还是仅实现表面的、特定提示的抑制问题。解决方案的关键在于系统评估两种代表性概念擦除方法（统一概念编辑和擦除稳定扩散）的鲁棒性和可逆性，通过实例级评估策略，利用轻量级微调显式测试被擦除概念的重新激活潜力，从而揭示现有方法在去除生成能力上的局限性。

链接: https://arxiv.org/abs/2505.16174
作者: Ping Liu,Chi Zhang
机构: University of Nevada Reno(内华达大学里诺分校); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Dr. Chi Zhang is the corresponding author

点击查看摘要

Abstract:To what extent does concept erasure eliminate generative capacity in diffusion models? While prior evaluations have primarily focused on measuring concept suppression under specific textual prompts, we explore a complementary and fundamental question: do current concept erasure techniques genuinely remove the ability to generate targeted concepts, or do they merely achieve superficial, prompt-specific suppression? We systematically evaluate the robustness and reversibility of two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion, by probing their ability to eliminate targeted generative behaviors in text-to-image models. These methods attempt to suppress undesired semantic concepts by modifying internal model parameters, either through targeted attention edits or model-level fine-tuning strategies. To rigorously assess whether these techniques truly erase generative capacity, we propose an instance-level evaluation strategy that employs lightweight fine-tuning to explicitly test the reactivation potential of erased concepts. Through quantitative metrics and qualitative analyses, we show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation, indicating that current methods suppress latent generative representations without fully eliminating them. Our findings reveal critical limitations in existing concept erasure approaches and highlight the need for deeper, representation-level interventions and more rigorous evaluation standards to ensure genuine, irreversible removal of concepts from generative models.
zh

[CV-129] RAIL: Transferable Robust Adversarial Images via Latent diffusion

【速读】：该论文试图解决深度学习系统中基于无限制自然扰动的对抗攻击在不同模型间的迁移性受限问题，这一问题主要源于生成的对抗特征分布与真实数据分布之间的不匹配。解决方案的关键在于提出一种称为TRAIL（Transferable Robust Adversarial Images via Latent Diffusion）的测试时适应框架，该框架通过结合对抗目标和感知约束来更新扩散U-Net的权重，从而在生成对抗样本时保持图像的真实性并增强跨模型的迁移性。

链接: https://arxiv.org/abs/2505.16166
作者: Yuhao Xue,Zhifei Zhang,Xinyang Jiang,Yifei Shen,Junyao Gao,Wentao Gu,Jiale Zhao,Miaojing Shi,Cairong Zhao
机构: Tongji University (同济大学); Microsoft Research Asia (亚洲微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net’s weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.
zh

[CV-130] RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition

【速读】：该论文旨在解决在移动机器人中基于激光雷达（LiDAR）的场景识别（Place Recognition, PR）任务中，现有方法仅依赖几何测量而忽略激光雷达提供的额外反射率（reflectivity）信息的问题。其解决方案的关键在于提出一种新的三维场景描述符RE-TRIP（REflectivity-instance augmented TRIangle descriPtor），该描述符同时利用几何测量和反射率信息，以提升在几何退化、高几何相似性和动态物体存在等挑战性场景下的鲁棒性。

链接: https://arxiv.org/abs/2505.16165
作者: Yechan Park,Gyuhyeon Pak,Euntai Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR-based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE-TRIP in real-world applications, we further propose (1) a keypoint extraction method, (2) a key instance segmentation method, (3) a RE-TRIP matching method, and (4) a reflectivity-combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios such as long corridors, bridges, large-scale urban areas, and highly dynamic environments – our experimental results show that the proposed method outperforms existing state-of-the-art methods in terms of Scan Context, Intensity Scan Context, and STD.
zh

[CV-131] Deep Learning-Driven Ultra-High-Definition Image Restoration: A Survey

【速读】：该论文旨在解决超高清（Ultra-High-Definition, UHD）图像复原中因质量退化而导致的图像清晰度和细节丢失问题。其解决方案的关键在于利用深度学习技术，通过改进数据集构建、网络架构设计、采样策略、先验知识融合以及损失函数等多个方面，推动UHD图像复原技术的发展。论文系统性地回顾了相关进展，并提出了基于网络架构和采样策略的分类框架，以更好地组织和理解现有方法。

链接: https://arxiv.org/abs/2505.16161
作者: Liyan Wang,Weixiang Zhou,Cong Wang,Kin-Man Lam,Zhixun Su,Jinshan Pan
机构: Dalian University of Technology (大连理工大学); Hong Kong Polytechnic University (香港理工大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 papers, 12 figures

点击查看摘要

Abstract:Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at this https URL.
zh

[CV-132] Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention

【速读】：该论文旨在解决Transformer模型在图像恢复（Image Restoration, IR）任务中因自注意力机制的二次复杂度而难以应用于高分辨率图像的问题。现有方法通过稀疏或窗口化注意力缓解此问题，但限制了全局上下文建模能力。论文提出的关键解决方案是Rank Enhanced Linear Attention (RELA)，其通过引入轻量级深度可分离卷积来增强特征表示，从而克服线性注意力机制中由于注意力图低秩性导致的性能下降问题。基于RELA，作者进一步提出了LAformer，该模型通过整合线性注意力与通道注意力实现有效的全局感知，并利用卷积门控前馈网络提升局部拟合能力，同时摒弃了硬件低效操作，从而实现了高效处理高分辨率图像的能力。

链接: https://arxiv.org/abs/2505.16157
作者: Yuang Ai,Huaibo Huang,Tao Wu,Qihang Fan,Ran He
机构: MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences (中科院自动化所多媒体与智能系统实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); School of Information Science and Technology, ShanghaiTech University (上海科技大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 12 tables

点击查看摘要

Abstract:Transformer-based models have made remarkable progress in image restoration (IR) tasks. However, the quadratic complexity of self-attention in Transformer hinders its applicability to high-resolution images. Existing methods mitigate this issue with sparse or window-based attention, yet inherently limit global context modeling. Linear attention, a variant of softmax attention, demonstrates promise in global context modeling while maintaining linear complexity, offering a potential solution to the above challenge. Despite its efficiency benefits, vanilla linear attention suffers from a significant performance drop in IR, largely due to the low-rank nature of its attention map. To counter this, we propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution. Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer. LAformer achieves effective global perception by integrating linear attention and channel attention, while also enhancing local fitting capabilities through a convolutional gated feed-forward network. Notably, LAformer eliminates hardware-inefficient operations such as softmax and window shifting, enabling efficient processing of high-resolution images. Extensive experiments across 7 IR tasks and 21 benchmarks demonstrate that LAformer outperforms SOTA methods and offers significant computational advantages.
zh

[CV-133] BadDepth: Backdoor Attacks Against Monocular Depth Estimation in the Physical World

【速读】：该论文试图解决深度学习驱动的单目深度估计（Monocular Depth Estimation, MDE）模型在面对后门攻击时的脆弱性问题。现有后门攻击方法无法直接应用于MDE模型，因为其标签形式为深度图而非传统分类标签。解决方案的关键在于提出BadDepth，该方法通过图像分割模型选择性地操纵目标物体的深度，并利用深度补全技术恢复周围区域，从而生成用于对象级后门攻击的污染数据集。此外，为提升物理世界场景下的鲁棒性，引入了数字到物理增强技术以弥合物理世界与数字域之间的领域差异。

链接: https://arxiv.org/abs/2505.16154
作者: Ji Guo,Long Zhou,Zhijin Wang,Jiaming He,Qiyang Song,Aiguo Chen,Wenbo Jiang
机构: University of Electronic Science and Technology of China (中国电子科技大学); Xinjiang University (新疆大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, deep learning-based Monocular Depth Estimation (MDE) models have been widely applied in fields such as autonomous driving and robotics. However, their vulnerability to backdoor attacks remains unexplored. To fill the gap in this area, we conduct a comprehensive investigation of backdoor attacks against MDE models. Typically, existing backdoor attack methods can not be applied to MDE models. This is because the label used in MDE is in the form of a depth map. To address this, we propose BadDepth, the first backdoor attack targeting MDE models. BadDepth overcomes this limitation by selectively manipulating the target object’s depth using an image segmentation model and restoring the surrounding areas via depth completion, thereby generating poisoned datasets for object-level backdoor attacks. To improve robustness in physical world scenarios, we further introduce digital-to-physical augmentation to adapt to the domain gap between the physical world and the digital domain. Extensive experiments on multiple models validate the effectiveness of BadDepth in both the digital domain and the physical world, without being affected by environmental factors.
zh

[CV-134] raining-Free Reasoning and Reflection in MLLM s

【速读】：该论文旨在解决将推理能力从纯语言大模型（LLMs）迁移至多模态大模型（MLLMs）时所面临的高昂重训练成本和高质量、可验证的多模态推理数据集稀缺的问题。其解决方案的关键在于提出一种无需训练（training-Free）且无需梯度更新或额外监督的多模态模型——FRANK Model，通过解耦多模态大模型解码器层中的感知与推理过程，实现对现有多模态模型的推理与反思能力的增强。核心思想是利用浅层解码器关注视觉标记、深层解码器侧重文本语义的特性，采用分层权重融合方法，将视觉预训练的多模态模型与专门用于推理的语言模型进行整合。

链接: https://arxiv.org/abs/2505.16151
作者: Hongchen Wei,Zhenzhong Chen
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: this http URL
zh

[CV-135] GMatch: Geometry-Constrained Feature Matching for RGB-D Object Pose Estimation

【速读】：该论文旨在解决稀疏特征匹配中常见的局部歧义问题，从而提升6DoF物体位姿估计的鲁棒性。其解决方案的关键在于提出GMatch，一种无需学习的特征匹配方法，通过引导式、逐步搜索的方式，在匹配过程中强制执行SE(3)-不变的几何一致性。GMatch利用了一组可证明完备的几何特征，这些特征能够唯一确定三维关键点配置，从而确保全局一致的对应关系，而无需依赖训练或GPU支持。

链接: https://arxiv.org/abs/2505.16144
作者: Ming Yang,Haoran Li
机构: Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages + 3 pages references + 2 pages appendix; 6 figures; 1 table

点击查看摘要

Abstract:We present GMatch, a learning-free feature matcher designed for robust 6DoF object pose estimation, addressing common local ambiguities in sparse feature matching. Unlike traditional methods that rely solely on descriptor similarity, GMatch performs a guided, incremental search, enforcing SE(3)-invariant geometric consistency throughout the matching process. It leverages a provably complete set of geometric features that uniquely determine 3D keypoint configurations, ensuring globally consistent correspondences without the need for training or GPU support. When combined with classical descriptors such as SIFT, GMatch-SIFT forms a general-purpose pose estimation pipeline that offers strong interpretability and generalization across diverse objects and scenes. Experiments on the HOPE dataset show that GMatch outperforms both traditional and learning-based matchers, with GMatch-SIFT achieving or surpassing the performance of instance-level pose networks. On the YCB-Video dataset, GMatch-SIFT demonstrates high accuracy and low variance on texture-rich objects. These results not only validate the effectiveness of GMatch-SIFT for object pose estimation but also highlight the broader applicability of GMatch as a general-purpose feature matcher. Code will be released upon acceptance.
zh

[CV-136] An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection

【速读】：该论文试图解决脑部疾病在磁共振成像（MRI）数据中的分类问题，特别是针对诊断和治疗难度较大的脑部相关疾病。其解决方案的关键在于采用视觉Transformer（ViT）模型与迁移学习（TL）模型进行对比分析，并结合可解释人工智能（XAI）方法提升模型的透明度和可解释性。研究结果表明，ViT在分类精度上优于传统迁移学习模型，达到了94.39%的准确率，同时XAI方法的应用为医疗专业人员提供了更精确的诊断支持。

链接: https://arxiv.org/abs/2505.16039
作者: Shuvashis Sarker,Shamim Rahim Refat,Faika Fairuj Preotee,Shifat Islam,Tashreef Muhammad,Mohammad Ashraful Hoque
机构: Ahsanullah University of Science and Technology (阿罕默德科技大学); Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in 2024 27th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:The brain is a highly complex organ that manages many important tasks, including movement, memory and thinking. Brain-related conditions, like tumors and degenerative disorders, can be hard to diagnose and treat. Magnetic Resonance Imaging (MRI) serves as a key tool for identifying these conditions, offering high-resolution images of brain structures. Despite this, interpreting MRI scans can be complicated. This study tackles this challenge by conducting a comparative analysis of Vision Transformer (ViT) and Transfer Learning (TL) models such as VGG16, VGG19, Resnet50V2, MobilenetV2 for classifying brain diseases using MRI data from Bangladesh based dataset. ViT, known for their ability to capture global relationships in images, are particularly effective for medical imaging tasks. Transfer learning helps to mitigate data constraints by fine-tuning pre-trained models. Furthermore, Explainable AI (XAI) methods such as GradCAM, GradCAM++, LayerCAM, ScoreCAM, and Faster-ScoreCAM are employed to interpret model predictions. The results demonstrate that ViT surpasses transfer learning models, achieving a classification accuracy of 94.39%. The integration of XAI methods enhances model transparency, offering crucial insights to aid medical professionals in diagnosing brain diseases with greater precision.
zh

[CV-137] An Approach Towards Identifying Bangladeshi Leaf Diseases through Transfer Learning and XAI

【速读】：该论文旨在解决植物叶片病害识别的问题，特别是在孟加拉国农业中，由于专家知识有限，农民难以有效管理植物健康，导致作物损失和生计影响。研究的关键在于利用深度学习（Deep Learning, DL）模型对六种植物的21种不同叶部病害进行分类，以提高病害检测的准确性并减少对专家的依赖。其中，VGG19和Xception模型分别达到了98.90%和98.66%的最高准确率，同时结合可解释人工智能（Explainable AI, XAI）技术如GradCAM、GradCAM++等，增强了模型决策的透明度，使农民能够理解预测结果并采取相应措施。

链接: https://arxiv.org/abs/2505.16033
作者: Faika Fairuj Preotee,Shuvashis Sarker,Shamim Rahim Refat,Tashreef Muhammad,Shifat Islam
机构: Ahsanullah University of Science and Technology (阿罕默德科技大学); Southeast University (东南大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in 2024 27th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Leaf diseases are harmful conditions that affect the health, appearance and productivity of plants, leading to significant plant loss and negatively impacting farmers’ livelihoods. These diseases cause visible symptoms such as lesions, color changes, and texture variations, making it difficult for farmers to manage plant health, especially in large or remote farms where expert knowledge is limited. The main motivation of this study is to provide an efficient and accessible solution for identifying plant leaf diseases in Bangladesh, where agriculture plays a critical role in food security. The objective of our research is to classify 21 distinct leaf diseases across six plants using deep learning models, improving disease detection accuracy while reducing the need for expert involvement. Deep Learning (DL) techniques, including CNN and Transfer Learning (TL) models like VGG16, VGG19, MobileNetV2, InceptionV3, ResNet50V2 and Xception are used. VGG19 and Xception achieve the highest accuracies, with 98.90% and 98.66% respectively. Additionally, Explainable AI (XAI) techniques such as GradCAM, GradCAM++, LayerCAM, ScoreCAM and FasterScoreCAM are used to enhance transparency by highlighting the regions of the models focused on during disease classification. This transparency ensures that farmers can understand the model’s predictions and take necessary action. This approach not only improves disease management but also supports farmers in making informed decisions, leading to better plant protection and increased agricultural productivity.
zh

[CV-138] Learning better representations for crowded pedestrians in offboard LiDAR-camera 3D tracking-by-detection

【速读】：该论文试图解决在高度拥挤的城市环境中基于学习的自主感知中行人检测的长尾问题，特别是针对3D真实标注生成速度慢且性能关键的挑战。其解决方案的关键在于构建一个离线自动标注系统，该系统通过LiDAR点云和多视角图像重建行人轨迹，并提出学习高分辨率表示方法，以增强对密集场景和小目标的泛化能力和性能。

链接: https://arxiv.org/abs/2505.16029
作者: Shichao Li,Peiliang Li,Qing Lian,Peng Yun,Xiaozhi Chen
机构: Zhuoyu Technology (卓越科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Perceiving pedestrians in highly crowded urban environments is a difficult long-tail problem for learning-based autonomous perception. Speeding up 3D ground truth generation for such challenging scenes is performance-critical yet very challenging. The difficulties include the sparsity of the captured pedestrian point cloud and a lack of suitable benchmarks for a specific system design study. To tackle the challenges, we first collect a new multi-view LiDAR-camera 3D multiple-object-tracking benchmark of highly crowded pedestrians for in-depth analysis. We then build an offboard auto-labeling system that reconstructs pedestrian trajectories from LiDAR point cloud and multi-view images. To improve the generalization power for crowded scenes and the performance for small objects, we propose to learn high-resolution representations that are density-aware and relationship-aware. Extensive experiments validate that our approach significantly improves the 3D pedestrian tracking performance towards higher auto-labeling efficiency. The code will be publicly available at this HTTP URL.
zh

[CV-139] CP-LLM : Context and Pixel Aware Large Language Model for Video Quality Assessment

【速读】：该论文旨在解决视频质量评估（Video Quality Assessment, VQA）中传统模型对像素级失真敏感但缺乏上下文理解，以及基于大语言模型（LLM）的模型在处理微小失真或分离质量评分与描述任务时表现不足的问题。其解决方案的关键在于提出一种名为CP-LLM的多模态大语言模型架构，该架构包含双视觉编码器，能够分别从高层（视频上下文）和低层（像素失真）粒度独立分析感知质量，并通过语言解码器推理两者之间的相互作用，从而实现对像素失真的增强敏感性及可解释的质量描述与稳健的质量评分。

链接: https://arxiv.org/abs/2505.16025
作者: Wen Wen,Yaohong Wu,Yue Sheng,Neil Birkbeck,Balu Adsumilli,Yilin Wang
机构: City University of Hong Kong (香港城市大学); Google Inc. (谷歌公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Under review

点击查看摘要

Abstract:Video quality assessment (VQA) is a challenging research topic with broad applications. Effective VQA necessitates sensitivity to pixel-level distortions and a comprehensive understanding of video context to accurately determine the perceptual impact of distortions. Traditional hand-crafted and learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent LLM-based models struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context and Pixel aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g. compression artifacts). The model is trained via a multi-task pipeline optimizing for score prediction, description generation, and pairwise comparisons. Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on established VQA benchmarks and superior robustness to pixel distortions, confirming its efficacy for comprehensive and practical video quality assessment in real-world scenarios.
zh

[CV-140] GradPCA: Leverag ing NTK Alignment for Reliable Out-of-Distribution Detection

【速读】：该论文旨在解决神经网络中的分布外（Out-of-Distribution, OOD）检测问题，即如何有效区分训练数据与测试阶段出现的未知类别数据。其解决方案的关键在于利用神经网络梯度的低秩结构，该结构由神经切线核（Neural Tangent Kernel, NTK）对齐所引发。通过将主成分分析（PCA）应用于梯度类均值，GradPCA方法在多个标准图像分类基准上表现出更一致的性能。研究还从理论角度分析了谱OOD检测的特性，揭示了特征质量（尤其是预训练与非预训练表示）对检测器性能的重要影响。

链接: https://arxiv.org/abs/2505.16017
作者: Mariia Seleznova,Hung-Hsu Chou,Claudio Mayrink Verdun,Gitta Kutyniok
机构: Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学); University of Pittsburgh (匹兹堡大学); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce GradPCA, an Out-of-Distribution (OOD) detection method that exploits the low-rank structure of neural network gradients induced by Neural Tangent Kernel (NTK) alignment. GradPCA applies Principal Component Analysis (PCA) to gradient class-means, achieving more consistent performance than existing methods across standard image classification benchmarks. We provide a theoretical perspective on spectral OOD detection in neural networks to support GradPCA, highlighting feature-space properties that enable effective detection and naturally emerge from NTK alignment. Our analysis further reveals that feature quality – particularly the use of pretrained versus non-pretrained representations – plays a crucial role in determining which detectors will succeed. Extensive experiments validate the strong performance of GradPCA, and our theoretical framework offers guidance for designing more principled spectral OOD detectors.
zh

[CV-141] Position: Agent ic Systems Constitute a Key Component of Next-Generation Intelligent Image Processing

【速读】：该论文试图解决当前图像处理领域过度依赖模型中心范式所带来的泛化能力不足、适应性差以及现实问题解决灵活性有限的问题（model-centric paradigms）。其解决方案的关键在于发展智能代理系统（agentic systems），这些系统能够动态选择、组合和优化现有的图像处理工具，从而模拟人类专家在解决复杂问题时的战略性工具协调能力，克服单一模型的脆弱性。

链接: https://arxiv.org/abs/2505.16007
作者: Jinjin Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This position paper argues that the image processing community should broaden its focus from purely model-centric development to include agentic system design as an essential complementary paradigm. While deep learning has significantly advanced capabilities for specific image processing tasks, current approaches face critical limitations in generalization, adaptability, and real-world problem-solving flexibility. We propose that developing intelligent agentic systems, capable of dynamically selecting, combining, and optimizing existing image processing tools, represents the next evolutionary step for the field. Such systems would emulate human experts’ ability to strategically orchestrate different tools to solve complex problems, overcoming the brittleness of monolithic models. The paper analyzes key limitations of model-centric paradigms, establishes design principles for agentic image processing systems, and outlines different capability levels for such agents.
zh

[CV-142] Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

【速读】：该论文旨在解决图像到图像翻译（image-to-image translation）问题，即学习源域与目标域之间的映射，以实现如风格迁移、外观变换和领域适应等任务。其解决方案的关键在于采用基于扩散模型的框架，具体为对Diffusion Transformers (DiT)进行适配，该模型结合了扩散模型的去噪能力与Transformer的全局建模优势。为了引导翻译过程，模型通过预训练CLIP编码器提取的图像嵌入进行条件控制，从而实现细粒度且结构一致的翻译，而无需依赖文本或类别标签。此外，通过引入CLIP相似性损失和LPIPS感知损失，在训练过程中增强了语义一致性和视觉保真度。

链接: https://arxiv.org/abs/2505.16001
作者: Qiang Zhu,Kuan Lu,Menghao Huo,Yuxiao Li
机构: University of Houston (休斯顿大学); Santa Clara University (圣克拉拉大学); Cornell University (康奈尔大学); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.
zh

[CV-143] Domain Adaptive Skin Lesion Classification via Conformal Ensemble of Vision Transformers CEC

【速读】：该论文旨在解决深度学习模型在领域偏移场景下的可信度问题，特别是在医学影像决策支持系统等关键领域中，传统模型在面对不同数据源变化时表现不稳定。其解决方案的关键在于提出一种名为Conformal Ensemble of Vision Transformers (CE-ViTs)的框架，通过引入视觉变换器（Vision Transformer）的集成学习方法，结合多源数据集（如HAM10000、Dermofit和Skin Cancer ISIC）进行训练，以增强模型的领域适应性和鲁棒性，同时结合置信区间预测技术提升不确定性估计的可靠性。

链接: https://arxiv.org/abs/2505.15997
作者: Mehran Zoravar,Shadi Alijani,Homayoun Najjaran
机构: University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 5 pages, 4 figures, conference (ccece 2025)

点击查看摘要

Abstract:Exploring the trustworthiness of deep learning models is crucial, especially in critical domains such as medical imaging decision support systems. Conformal prediction has emerged as a rigorous means of providing deep learning models with reliable uncertainty estimates and safety guarantees. However, conformal prediction results face challenges due to the backbone model’s struggles in domain-shifted scenarios, such as variations in different sources. To aim this challenge, this paper proposes a novel framework termed Conformal Ensemble of Vision Transformers (CE-ViTs) designed to enhance image classification performance by prioritizing domain adaptation and model robustness, while accounting for uncertainty. The proposed method leverages an ensemble of vision transformer models in the backbone, trained on diverse datasets including HAM10000, Dermofit, and Skin Cancer ISIC datasets. This ensemble learning approach, calibrated through the combined mentioned datasets, aims to enhance domain adaptation through conformal learning. Experimental results underscore that the framework achieves a high coverage rate of 90.38%, representing an improvement of 9.95% compared to the HAM10000 model. This indicates a strong likelihood that the prediction set includes the true label compared to singular models. Ensemble learning in CE-ViTs significantly improves conformal prediction performance, increasing the average prediction set size for challenging misclassified samples from 1.86 to 3.075.
zh

[CV-144] Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders CVPR2025

【速读】：该论文试图解决深度视觉模型如何编码ImageNet层次结构的问题，即探究视觉模型内部表示是否与ImageNet的本体结构一致。解决方案的关键在于利用稀疏自编码器（Sparse Autoencoders, SAEs）对模型激活进行分析，以揭示其内部表示中的层次关系，并评估不同层间表示的一致性。通过这一方法，研究展示了SAEs在探测深度网络语义结构中的潜力。

链接: https://arxiv.org/abs/2505.15970
作者: Matthew Lyle Olson,Musashi Hinck,Neale Ratzlaff,Changbai Li,Phillip Howard,Vasudev Lal,Shao-Yen Tseng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: (Oral) CVPR 2025 Workshop on Mechanistic Interpretability for Vision. Authors 1 and 2 contributed equally

点击查看摘要

Abstract:The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.
zh

[CV-145] Super-Resolution with Structured Motion

【速读】：该论文试图解决在成像约束下超分辨率重建的极限问题，特别是针对传统基于重建的方法在分辨率提升方面受到理论和实践限制的问题。其解决方案的关键在于利用高精度运动信息、稀疏图像先验和凸优化方法，从而实现大幅的分辨率提升。研究还表明，运动模糊并非总是干扰因素，通过伪随机运动，可以仅使用一张低分辨率图像重建出高分辨率目标。

链接: https://arxiv.org/abs/2505.15961
作者: Gabby Litterio,Juan-David Lizarazo-Ferro,Pedro Felzenszwalb,Rashid Zia
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We consider the limits of super-resolution using imaging constraints. Due to various theoretical and practical limitations, reconstruction-based methods have been largely restricted to small increases in resolution. In addition, motion-blur is usually seen as a nuisance that impedes super-resolution. We show that by using high-precision motion information, sparse image priors, and convex optimization, it is possible to increase resolution by large factors. A key operation in super-resolution is deconvolution with a box. In general, convolution with a box is not invertible. However, we obtain perfect reconstructions of sparse signals using convex optimization. We also show that motion blur can be helpful for super-resolution. We demonstrate that using pseudo-random motion it is possible to reconstruct a high-resolution target using a single low-resolution image. We present numerical experiments with simulated data and results with real data captured by a camera mounted on a computer controlled stage.
zh

[CV-146] VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

【速读】：该论文试图解决视频游戏质量保证（Quality Assurance, QA）流程中自动化程度低的问题，尤其是在当前游戏产业收入持续增长的背景下，优化开发工作流成为关键需求。现有基准测试不足以满足该领域特定要求，因此亟需标准化的评估体系。论文提出的解决方案之关键是引入VideoGameQA-Bench，这是一个全面的基准测试平台，涵盖了多种游戏QA活动，如视觉单元测试、视觉回归测试、 haystack任务、缺陷检测以及针对图像和视频的错误报告生成，从而为评估视觉-语言模型（Vision-Language Models, VLMs）在真实场景中的表现提供了可靠依据。

链接: https://arxiv.org/abs/2505.15952
作者: Mohammad Reza Taesiri,Abhijay Ghildyal,Saman Zadtootaghaj,Nabajeet Barman,Cor-Paul Bezemer
机构: University of Alberta (阿尔伯塔大学); Sony Interactive Entertainment (索尼互动娱乐)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website with code and data: this https URL

点击查看摘要

Abstract:With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector’s sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry’s most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: this https URL
zh

[CV-147] MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

【速读】：该论文旨在解决当前基于功能性磁共振成像（fMRI）的视觉重建方法中，过度追求重建保真度而忽视可解释性的问题，这一问题限制了神经科学洞见的获取。其解决方案的关键在于提出MoRE-Brain框架，该框架采用基于脑网络原理的分层专家混合（Mixture-of-Experts）架构，通过功能相关的体素组划分，使不同专家模拟专门化的脑网络进行处理，并结合一种新颖的双阶段路由机制，动态加权专家贡献以指导扩散模型生成图像，从而实现高保真、可适应且可解释的视觉重建。

链接: https://arxiv.org/abs/2505.15946
作者: Yuxiang Wei,Yanteng Zhang,Xi Xiao,Tianyang Wang,Xiao Wang,Vince D. Calhoun
机构: TreNDS (TreNDS); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain’s high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: this https URL.
zh

[CV-148] VERDI: VLM-Embedded Reasoning for Autonomous Driving

【速读】：该论文旨在解决自动驾驶系统在部分可观测性和现实世界复杂性下决策能力不足的问题，其核心挑战在于如何将人类驾驶员的常识推理能力有效地融入自动驾驶堆栈中。现有方法虽然尝试利用微调的视觉-语言模型（VLM）在推理阶段进行轨迹规划以模仿人类行为，但存在部署成本高和安全分解困难等局限性。论文提出的解决方案关键在于VERDI框架，该框架通过在训练阶段将VLM的推理过程和常识知识蒸馏到自动驾驶系统中，使模块化端到端（e2e）自动驾驶模型在感知、预测和规划阶段的中间输出与VLM生成的解释性文本特征对齐，从而在不增加推理时计算开销的情况下，实现结构化推理能力的内化。

链接: https://arxiv.org/abs/2505.15925
作者: Bowen Feng,Zhiting Mei,Baiang Li,Julian Ost,Roger Girgis,Anirudha Majumdar,Felix Heide
机构: Princeton University (普林斯顿大学); Torc Robotics (托克机器人)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, \textscVERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in \ell_2 distance, while maintaining high inference speed.
zh

[CV-149] Challenger: Affordable Adversarial Driving Video Generation

【速读】：该论文旨在解决生成具有物理合理性和图像真实感的对抗性驾驶视频的问题，以更有效地测试自动驾驶（AD）系统。现有方法多集中于常规场景，而针对对抗性场景的生成则通常依赖抽象轨迹或鸟瞰图（BEV）表示，难以生成真实的传感器数据。论文提出的解决方案——Challenger框架，其关键在于通过两种技术实现：一是基于物理感知的多轮轨迹优化过程，用于缩小潜在对抗性操作的范围；二是定制化的轨迹评分函数，以鼓励生成既真实又具有对抗性的行为，同时兼容后续视频生成。

链接: https://arxiv.org/abs/2505.15880
作者: Zhiyuan Xu,Bohan Li,Huan-ang Gao,Mingju Gao,Yong Chen,Ming Liu,Chenxu Yan,Hang Zhao,Shuo Feng,Hao Zhao
机构: Tsinghua(清华大学); UCAS(中国科学院大学); SJTU(上海交通大学); Geely Auto(吉利汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.
zh

[CV-150] Decouple and Orthogonalize: A Data-Free Framework for LoRA Merging

【速读】：该论文旨在解决模型合并（model merging）在处理低秩适配（LoRA）方法时性能不佳的问题。现有合并方法主要针对全量微调（full fine-tuning）设计，但在面对LoRA模块时表现较差，原因在于LoRA参数的幅度方差远大于全量微调参数，而较大的幅度方差会导致合并后参数分布偏离，进而引发信息丢失和性能下降。论文提出的解决方案关键在于Decoupled and Orthogonal merging（DO-Merging），通过将参数分解为幅度和方向两部分并分别合并，减少幅度差异对方向对齐的影响，同时引入无数据、分层的梯度下降方法以约束方向合并过程，从而有效提升合并性能。

链接: https://arxiv.org/abs/2505.15875
作者: Shenghe Zheng,Hongzhi Wang,Chenyu Huang,Xiaohui Wang,Tao Chen,Jiayuan Fan,Shuyue Hu,Peng Ye
机构: Harbin Institute of Technology (哈尔滨工业大学); Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:With more open-source models available for diverse tasks, model merging has gained attention by combining models into one, reducing training, storage, and inference costs. Current research mainly focuses on model merging for full fine-tuning, overlooking the popular LoRA. However, our empirical analysis reveals that: a) existing merging methods designed for full fine-tuning perform poorly on LoRA; b) LoRA modules show much larger parameter magnitude variance than full fine-tuned weights; c) greater parameter magnitude variance correlates with worse merging performance. Considering that large magnitude variances cause deviations in the distribution of the merged parameters, resulting in information loss and performance degradation, we propose a Decoupled and Orthogonal merging approach(DO-Merging). By separating parameters into magnitude and direction components and merging them independently, we reduce the impact of magnitude differences on the directional alignment of the merged models, thereby preserving task information. Furthermore, we introduce a data-free, layer-wise gradient descent method with orthogonal constraints to mitigate interference during the merging of direction components. We provide theoretical guarantees for both the decoupling and orthogonal components. And we validate through extensive experiments across vision, language, and multi-modal domains that our proposed DO-Merging can achieve significantly higher performance than existing merging methods at a minimal cost. Notably, each component can be flexibly integrated with existing methods, offering near free-lunch improvements across tasks.
zh

[CV-151] Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

【速读】：该论文试图解决城市间通勤起讫点（Origin-Destination, OD）流数据难以获取的问题，该数据对于全球城市的可持续发展至关重要，但因交通调查成本高和隐私问题而难以获得。解决方案的关键在于利用全球可公开获取的卫星影像，通过视觉-语言地理基础模型提取与人类移动相关的城市语义信号，并结合人口数据生成区域级表征，再通过图扩散模型生成OD流数据，从而实现对全球城市高一致性的真实出行数据生成。

链接: https://arxiv.org/abs/2505.15870
作者: Can Rong,Xin Zhang,Yanxin Xi,Hongjie Sui,Jingtao Ding,Yong Li
机构: Tsinghua University (清华大学); University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at this https URL.
zh

[CV-152] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

【速读】：该论文试图解决图像检索中因依赖低级视觉特征（如颜色）而导致的偏差问题，以及现有基于场景图的检索方法对标注数据和不一致的基于文本的监督信号的依赖问题。其解决方案的关键在于提出一种基于图自编码器（Graph Autoencoder）的无监督检索框架SCENIR，该框架无需依赖标签训练数据，并引入图编辑距离（Graph Edit Distance, GED）作为确定性和鲁棒的场景图相似性度量，以替代传统不一致的基于文本的监督方式，从而提升检索的可靠性和性能。

链接: https://arxiv.org/abs/2505.15867
作者: Nikolaos Chaidos,Angeliki Dimitriou,Maria Lymperaiou,Giorgos Stamou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we present SCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.
zh

[CV-153] How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

【速读】：该论文试图解决大型视觉语言模型（Large Vision Language Models, LVLMs）在可解释性方面存在的不足，特别是其如何定位和解析图像中的文本信息。解决方案的关键在于识别并分析负责从图像中识别文本的特定神经网络头，即光学字符识别头（OCR Head）。研究发现，这些OCR Head具有激活更密集、特性与常规检索头显著不同以及静态激活等特征，并通过实验验证了其在下游任务中的有效性，为理解LVLM处理图像中嵌入文本信息的内部机制提供了深入见解。

链接: https://arxiv.org/abs/2505.15865
作者: Ingeol Baek,Hwan Chang,Sunghyun Ryu,Hwanhee Lee
机构: Chung-Ang University (忠南大学); Sejong University (世宗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.
zh

[CV-154] Generative AI for Autonomous Driving: A Review

【速读】：该论文试图解决生成式 AI (Generative AI) 在自动驾驶 (Autonomous Driving, AD) 领域中的应用问题，旨在探索生成模型如何提升静态地图构建、动态场景生成、轨迹预测和车辆运动规划等任务。其解决方案的关键在于系统性地分析多种生成方法（如变分自编码器、生成对抗网络、可逆神经网络、生成变压器和扩散模型）在 AD 任务中的能力与局限性，并探讨传统方法与生成方法的混合策略，以增强系统的适应性和鲁棒性。此外，论文还针对安全、可解释性和实时性等核心挑战提出了建议。

链接: https://arxiv.org/abs/2505.15863
作者: Katharina Winter,Abhishek Vivekanandan,Rupert Polley,Yinzhe Shen,Christian Schlauch,Mohamed-Khalil Bouzidi,Bojan Derajic,Natalie Grabowsky,Annajoyce Mariani,Dennis Rochau,Giovanni Lucente,Harsh Yadav,Firas Mualla,Adam Molin,Sebastian Bernhard,Christian Wirth,Ömer Şahin Taş,Nadja Klein,Fabian B. Flohr,Hanno Gottschalk
机构: Munich University of Applied Sciences, Intelligent Vehicles Lab (IVL); FZI Research Center for Information Technology; Continental Automotive Technologies GmbH; Technical University of Berlin; German Aerospace Center (DLR); Aptiv Services Deutschland GmbH; AI Lab, ZF Friedrichshafen AG; DENSO AUTOMOTIVE Deutschland GmbH; Karlsruhe Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Generative AI (GenAI) is rapidly advancing the field of Autonomous Driving (AD), extending beyond traditional applications in text, image, and video generation. We explore how generative models can enhance automotive tasks, such as static map creation, dynamic scenario generation, trajectory forecasting, and vehicle motion planning. By examining multiple generative approaches ranging from Variational Autoencoder (VAEs) over Generative Adversarial Networks (GANs) and Invertible Neural Networks (INNs) to Generative Transformers (GTs) and Diffusion Models (DMs), we highlight and compare their capabilities and limitations for AD-specific applications. Additionally, we discuss hybrid methods integrating conventional techniques with generative approaches, and emphasize their improved adaptability and robustness. We also identify relevant datasets and outline open research questions to guide future developments in GenAI. Finally, we discuss three core challenges: safety, interpretability, and realtime capabilities, and present recommendations for image generation, dynamic scenario generation, and planning.
zh

[CV-155] DFormer: A Top-Down Attention-Controlled Spiking Transformer

【速读】：该论文试图解决传统脉冲神经网络（Spiking Neural Networks, SNNs）中膜电位作为唯一信息传递机制的局限性，这种隐式表示方式难以有效表征时间信息，导致模型无法充分利用先前时间步的信息，从而限制了性能。解决方案的关键在于引入TDFormer，该模型采用自上而下的反馈结构，通过高层表示对低层信息处理进行调制，从两个方面提升模型性能：一方面在前向传播过程中增加时间步之间的互信息，增强时间信息的传递与整合；另一方面在反向传播过程中理论证明该反馈结构缓解了时间维度上的梯度消失问题。

链接: https://arxiv.org/abs/2505.15840
作者: Zizheng Zhu,Yingchao Yu,Zeqi Zheng,Zhaofei Yu,Yaochu Jin
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Peking University (北京大学); Donghua University (东华大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages

点击查看摘要

Abstract:Traditional spiking neural networks (SNNs) can be viewed as a combination of multiple subnetworks with each running for one time step, where the parameters are shared, and the membrane potential serves as the only information link between them. However, the implicit nature of the membrane potential limits its ability to effectively represent temporal information. As a result, each time step cannot fully leverage information from previous time steps, seriously limiting the model’s performance. Inspired by the top-down mechanism in the brain, we introduce TDFormer, a novel model with a top-down feedback structure that functions hierarchically and leverages high-order representations from earlier time steps to modulate the processing of low-order information at later stages. The feedback structure plays a role from two perspectives: 1) During forward propagation, our model increases the mutual information across time steps, indicating that richer temporal information is being transmitted and integrated in different time steps. 2) During backward propagation, we theoretically prove that the feedback structure alleviates the problem of vanishing gradients along the time dimension. We find that these mechanisms together significantly and consistently improve the model performance on multiple datasets. In particular, our model achieves state-of-the-art performance on ImageNet with an accuracy of 86.83%.
zh

[CV-156] Multilinear subspace learning for person re-identification based fusion of high order tensor features

【速读】：该论文旨在解决视频监控图像分析中的行人再识别（Person Re-Identification, PRe-ID）问题，即在摄像头网络中识别和跟踪已检测到的目标个体。解决方案的关键在于提出一种高维特征融合方法（High-Dimensional Feature Fusion, HDFF），通过引入新的张量融合方案，将卷积神经网络（Convolutional Neural Networks, CNN）和局部最大出现（Local Maximal Occurrence, LOMO）两种特征在不同维度下进行联合建模。此外，采用张量跨视图二次分析（Tensor Cross-View Quadratic Analysis, TXQDA）进行多线性子空间学习，并结合余弦相似度进行匹配，以提升系统的识别准确率。

链接: https://arxiv.org/abs/2505.15825
作者: Ammar Chouchane,Mohcene Bessaoudi,Hamza Kheddar,Abdelmalik Ouamane,Tiago Vieira,Mahmoud Hassaballah
机构: University of MEDEA(麦德亚大学); University of Biskra(比斯克拉大学); Federal University of Alagoas(阿拉戈斯联邦大学); Prince Sattam Bin Abdulaziz University(沙特阿卜杜勒阿齐兹王子大学); South Valley University(南谷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Video surveillance image analysis and processing is a challenging field in computer vision, with one of its most difficult tasks being Person Re-Identification (PRe-ID). PRe-ID aims to identify and track target individuals who have already been detected in a network of cameras, using a robust description of their pedestrian images. The success of recent research in person PRe-ID is largely due to effective feature extraction and representation, as well as the powerful learning of these features to reliably discriminate between pedestrian images. To this end, two powerful features, Convolutional Neural Networks (CNN) and Local Maximal Occurrence (LOMO), are modeled on multidimensional data using the proposed method, High-Dimensional Feature Fusion (HDFF). Specifically, a new tensor fusion scheme is introduced to leverage and combine these two types of features in a single tensor, even though their dimensions are not identical. To enhance the system’s accuracy, we employ Tensor Cross-View Quadratic Analysis (TXQDA) for multilinear subspace learning, followed by cosine similarity for matching. TXQDA efficiently facilitates learning while reducing the high dimensionality inherent in high-order tensor data. The effectiveness of our approach is verified through experiments on three widely-used PRe-ID datasets: VIPeR, GRID, and PRID450S. Extensive experiments demonstrate that our approach outperforms recent state-of-the-art methods.
zh

[CV-157] UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

【速读】：该论文试图解决语言引导的细粒度无人机轨迹控制问题，即在短距离内根据语言指令执行反应性飞行行为。解决方案的关键在于提出一种基于模仿学习的框架，通过模仿专家飞行员的轨迹与原子化语言指令来学习细粒度控制策略，并构建了UAV-Flow基准，支持真实场景下的直接部署与系统评估。

链接: https://arxiv.org/abs/2505.15725
作者: Xiangyu Wang,Donglin Yang,Yue Liao,Wenhao Zheng,wenjun wu,Bin Dai,Hongsheng Li,Si Liu
机构: Institute of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); National University of Singapore (新加坡国立大学); MMLab, CUHK (香港中文大学多媒体实验室); Hangzhou International Innovation Institute of Beihang University (北京航空航天大学杭州国际创新研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting.
zh

[CV-158] PCMamba: Physics-Informed Cross-Modal State Space Model for Dual-Camera Compressive Hyperspectral Imaging

【速读】：该论文旨在解决快照式高光谱成像（Snapshot Hyperspectral Imaging, HSI）中由于现有方法在显式分离光谱信息与空间信息时存在的瓶颈问题，从而提升HSI重建的质量与效率。其解决方案的关键在于提出一种物理信息引导的跨模态状态空间模型网络（Physics-Informed Cross-Modal State Space Model Network, PCMamba），该模型将高光谱成像的前向物理过程嵌入到Mamba的线性复杂度中，以实现轻量级且高质量的HSI重建。通过分析热辐射信号的成像过程，网络能够解耦温度、发射率和纹理这三个关键物理属性，并利用2D压缩测量与全色图像中的潜在信息，通过物理驱动的合成过程完成HSI重建。

链接: https://arxiv.org/abs/2505.16373
作者: Ge Meng,Zhongnan Cai,Jingyan Tu,Yingying Wang,Chenxin Li,Yue Huang,Xinghao Ding
机构: Xiamen University (厦门大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Panchromatic (PAN) -assisted Dual-Camera Compressive Hyperspectral Imaging (DCCHI) is a key technology in snapshot hyperspectral imaging. Existing research primarily focuses on exploring spectral information from 2D compressive measurements and spatial information from PAN images in an explicit manner, leading to a bottleneck in HSI reconstruction. Various physical factors, such as temperature, emissivity, and multiple reflections between objects, play a critical role in the process of a sensor acquiring hyperspectral thermal signals. Inspired by this, we attempt to investigate the interrelationships between physical properties to provide deeper theoretical insights for HSI reconstruction. In this paper, we propose a Physics-Informed Cross-Modal State Space Model Network (PCMamba) for DCCHI, which incorporates the forward physical imaging process of HSI into the linear complexity of Mamba to facilitate lightweight and high-quality HSI reconstruction. Specifically, we analyze the imaging process of hyperspectral thermal signals to enable the network to disentangle the three key physical properties-temperature, emissivity, and texture. By fully exploiting the potential information embedded in 2D measurements and PAN images, the HSIs are reconstructed through a physics-driven synthesis process. Furthermore, we design a Cross-Modal Scanning Mamba Block (CSMB) that introduces inter-modal pixel-wise interaction with positional inductive bias by cross-scanning the backbone features and PAN features. Extensive experiments conducted on both real and simulated datasets demonstrate that our method significantly outperforms SOTA methods in both quantitative and qualitative metrics.
zh

[CV-159] Generative Latent Coding for Ultra-Low Bitrate Image and Video Compression

【速读】：该论文试图解决在超低比特率下同时实现高真实感和高保真度的图像与视频压缩问题，现有方法由于像素空间失真与人类感知之间的不匹配而面临挑战。解决方案的关键在于提出生成式潜在编码（Generative Latent Coding, GLC）模型，其在生成向量量化变分自编码器（VQ-VAE）的潜在空间中进行变换编码，该空间具有更高的稀疏性、更丰富的语义信息以及更好的人类感知对齐特性，从而提升了压缩性能。

链接: https://arxiv.org/abs/2505.16177
作者: Linfeng Qi,Zhaoyang Jia,Jiahao Li,Bin Li,Houqiang Li,Yan Lu
机构: MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China (教育部脑启发智能感知与认知重点实验室); Microsoft Research Asia (微软亚洲研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing approaches for image and video compression perform transform coding in the pixel space to reduce redundancy. However, due to the misalignment between the pixel-space distortion and human perception, such schemes often face the difficulties in achieving both high-realism and high-fidelity at ultra-low bitrate. To solve this problem, we propose \textbfGenerative \textbfLatent \textbfCoding (\textbfGLC) models for image and video compression, termed GLC-image and GLC-Video. The transform coding of GLC is conducted in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE). Compared to the pixel-space, such a latent space offers greater sparsity, richer semantics and better alignment with human perception, and show its advantages in achieving high-realism and high-fidelity compression. To further enhance performance, we improve the hyper prior by introducing a spatial categorical hyper module in GLC-image and a spatio-temporal categorical hyper module in GLC-video. Additionally, the code-prediction-based loss function is proposed to enhance the semantic consistency. Experiments demonstrate that our scheme shows high visual quality at ultra-low bitrate for both image and video compression. For image compression, GLC-image achieves an impressive bitrate of less than 0.04 bpp, achieving the same FID as previous SOTA model MS-ILLM while using 45% fewer bitrate on the CLIC 2020 test set. For video compression, GLC-video achieves 65.3% bitrate saving over PLVC in terms of DISTS.
zh

[CV-160] Compressing Human Body Video with Interactive Semantics: A Generative Approach

【速读】：该论文试图解决传统视频编码在处理人体视频时缺乏交互性和可控性的问题，旨在通过引入语义信息实现更高效且可操控的视频压缩。解决方案的关键在于利用3D人体模型将人体信号的非线性动态和复杂运动分解为一系列可配置的嵌入表示（embeddings），这些嵌入表示可以被可控编辑、紧凑压缩并高效传输，同时解码器能够基于这些语义信息生成高质量的人体视频重建。

链接: https://arxiv.org/abs/2505.16152
作者: Bolin Chen,Shanzhi Yin,Hanwei Zhu,Lingyu Zhu,Zihan Zhang,Jie Chen,Ru-Ling Liao,Shiqi Wang,Yan Ye
机构: Alibaba DAMO Academy & Hupan Laboratory (阿里巴巴达摩院与湖畔实验室); Shenzhen Research Institute, City University of Hong Kong (深圳研究院，香港城市大学); City University of Hong Kong (香港城市大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.
zh

[CV-161] OSCAR: One-Step Diffusion Codec Across Multiple Bit-rates

【速读】：该论文旨在解决基于扩散模型的图像压缩方法中存在的计算开销大和需要为不同比特率单独训练模型的问题。其解决方案的关键在于提出了一种跨多比特率的一步扩散编解码器（OSCAR），通过将压缩潜变量视为原始潜变量的噪声变体，并将其建模为扩散轨迹中的中间状态，从而实现单一生成模型对多个比特率的重建支持。该方法通过建立比特率到伪扩散时间步的映射，替代了传统的多步采样过程，显著提升了推理效率。

链接: https://arxiv.org/abs/2505.16091
作者: Jinpei Guo,Yifei Ji,Zheng Chen,Kai Liu,Min Liu,Wang Rao,Wenbo Li,Yong Guo,Yulun Zhang
机构: Carnegie Mellon University (卡内基梅隆大学); Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); South China University of Technology (华南理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at this https URL.
zh

[CV-162] Comprehensive Lung Disease Detection Using Deep Learning Models and Hybrid Chest X-ray Data with Explainable AI

【速读】：该论文旨在解决肺部疾病（包括新冠肺炎、肺炎、肺部模糊和正常肺部状况）在胸部X光图像中的准确检测问题。其解决方案的关键在于构建一个混合数据集，通过整合来自孟加拉国和全球的四个独立数据集，显著提升模型的准确性与泛化能力。研究中采用了一系列深度学习和迁移学习模型，并发现VGG16、Xception、ResNet50V2和DenseNet121在混合数据集上均达到了99%的准确率，展现出强大的性能。此外，通过引入可解释AI技术（如LIME），增强了模型预测的可解释性，为医疗影像中的可靠且可解释的AI解决方案提供了支持。

链接: https://arxiv.org/abs/2505.16028
作者: Shuvashis Sarker,Shamim Rahim Refat,Faika Fairuj Preotee,Tanvir Rouf Shawon,Raihan Tanvir
机构: Ahsanullah University of Science and Technology (阿罕默德科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in 2024 27th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Advanced diagnostic instruments are crucial for the accurate detection and treatment of lung diseases, which affect millions of individuals globally. This study examines the effectiveness of deep learning and transfer learning models using a hybrid dataset, created by merging four individual datasets from Bangladesh and global sources. The hybrid dataset significantly enhances model accuracy and generalizability, particularly in detecting COVID-19, pneumonia, lung opacity, and normal lung conditions from chest X-ray images. A range of models, including CNN, VGG16, VGG19, InceptionV3, Xception, ResNet50V2, InceptionResNetV2, MobileNetV2, and DenseNet121, were applied to both individual and hybrid datasets. The results showed superior performance on the hybrid dataset, with VGG16, Xception, ResNet50V2, and DenseNet121 each achieving an accuracy of 99%. This consistent performance across the hybrid dataset highlights the robustness of these models in handling diverse data while maintaining high accuracy. To understand the models implicit behavior, explainable AI techniques were employed to illuminate their black-box nature. Specifically, LIME was used to enhance the interpretability of model predictions, especially in cases of misclassification, contributing to the development of reliable and interpretable AI-driven solutions for medical imaging.
zh

[CV-163] Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets

【速读】：该论文旨在解决基础模型在胸部X光（CXR）诊断中的实际性能评估不足问题，特别是在多样化人群和不同诊断任务中的泛化能力。其解决方案的关键在于对比基础模型与传统卷积神经网络（CNN）在多国CXR数据集上的诊断性能，并引入结构化监督和知识增强提示以提升模型表现。实验结果表明，采用结构化监督和提示设计的MAVL模型在公共和私有数据集上均取得了最佳性能，显示出结构化监督和提示设计在放射学AI中的重要价值。

链接: https://arxiv.org/abs/2505.16027
作者: Qinmei Xu,Yiheng Li,Xianghao Zhan,Ahmet Gorkem Er,Brittany Dashevsky,Chuanjun Xu,Mohammed Alawad,Mengya Yang,Liu Ya,Changsheng Zhou,Xiao Li,Haruka Itakura,Olivier Gevaert
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 78 pages, 7 figures, 2 tabeles

点击查看摘要

Abstract:Foundation models leveraging vision-language pretraining have shown promise in chest X-ray (CXR) interpretation, yet their real-world performance across diverse populations and diagnostic tasks remains insufficiently evaluated. This study benchmarks the diagnostic performance and generalizability of foundation models versus traditional convolutional neural networks (CNNs) on multinational CXR datasets. We evaluated eight CXR diagnostic models - five vision-language foundation models and three CNN-based architectures - across 37 standardized classification tasks using six public datasets from the USA, Spain, India, and Vietnam, and three private datasets from hospitals in China. Performance was assessed using AUROC, AUPRC, and other metrics across both shared and dataset-specific tasks. Foundation models outperformed CNNs in both accuracy and task coverage. MAVL, a model incorporating knowledge-enhanced prompts and structured supervision, achieved the highest performance on public (mean AUROC: 0.82; AUPRC: 0.32) and private (mean AUROC: 0.95; AUPRC: 0.89) datasets, ranking first in 14 of 37 public and 3 of 4 private tasks. All models showed reduced performance on pediatric cases, with average AUROC dropping from 0.88 +/- 0.18 in adults to 0.57 +/- 0.29 in children (p = 0.0202). These findings highlight the value of structured supervision and prompt design in radiologic AI and suggest future directions including geographic expansion and ensemble modeling for clinical deployment. Code for all evaluated models is available at this https URL
zh

[CV-164] P3Net: Progressive and Periodic Perturbation for Semi-Supervised Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割（SSMIS）中如何有效利用未标注数据以及如何确保边界区域预测准确性的问题。其解决方案的关键在于提出一种渐进式周期性扰动机制（P3M）和一个关注边界区域的损失函数。P3M通过动态调整扰动策略，使模型逐步学习未标注数据；而边界聚焦损失则增强了模型对边界区域细节的敏感性，从而提升分割精度。

链接: https://arxiv.org/abs/2505.15861
作者: Zhenyan Yao,Miao Zhang,Lanhu Wu,Yongri Piao,Feng Tian,Weibing Sun,Huchuan Lu
机构: Dalian University of Technology (大连理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Perturbation with diverse unlabeled data has proven beneficial for semi-supervised medical image segmentation (SSMIS). While many works have successfully used various perturbation techniques, a deeper understanding of learning perturbations is needed. Excessive or inappropriate perturbation can have negative effects, so we aim to address two challenges: how to use perturbation mechanisms to guide the learning of unlabeled data through labeled data, and how to ensure accurate predictions in boundary regions. Inspired by human progressive and periodic learning, we propose a progressive and periodic perturbation mechanism (P3M) and a boundary-focused loss. P3M enables dynamic adjustment of perturbations, allowing the model to gradually learn them. Our boundary-focused loss encourages the model to concentrate on boundary regions, enhancing sensitivity to intricate details and ensuring accurate predictions. Experimental results demonstrate that our method achieves state-of-the-art performance on two 2D and 3D datasets. Moreover, P3M is extendable to other methods, and the proposed loss serves as a universal tool for improving existing methods, highlighting the scalability and applicability of our approach.
zh

[CV-165] MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models

【速读】：该论文旨在解决生成式对抗网络（Generative Adversarial Network, GAN）图像逆向映射与编辑中的挑战，即如何在高重建质量、有效编辑能力和计算效率之间取得平衡。其解决方案的关键在于提出MambaStyle，一种基于视觉状态空间模型（Vision State-Space Model, VSSM）的高效单阶段编码器方法，通过在架构中集成VSSM，实现了高质量的图像逆向映射和灵活编辑，同时显著减少参数数量和计算复杂度。

链接: https://arxiv.org/abs/2505.15822
作者: Jhon Lopez,Carlos Hinojosa,Henry Arguello,Bernard Ghanem
机构: Universidad Industrial de Santander(桑坦德工业大学); KAUST(卡耐基梅隆大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The task of inverting real images into StyleGAN’s latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages vision state-space models (VSSMs) to address these challenges. Specifically, our approach integrates VSSMs within the proposed architecture, enabling high-quality image inversion and flexible editing with significantly fewer parameters and reduced computational complexity compared to state-of-the-art methods. Extensive experiments show that MambaStyle achieves a superior balance among inversion accuracy, editing quality, and computational efficiency. Notably, our method achieves superior inversion and editing results with reduced model complexity and faster inference, making it suitable for real-time applications.
zh

人工智能

[AI-0] Understanding Prompt Tuning and In-Context Learning via Meta-Learning

【速读】：该论文试图解决如何通过贝叶斯视角理解最优提示（prompting）的问题，以及提示方法在适应目标任务时所面临的根本性限制。其解决方案的关键在于将元训练的神经网络视为在预训练分布上的贝叶斯预测器，该预测器具有快速上下文适应的特征，而最优提示可以形式化为对这些贝叶斯预测器进行条件化，从而得出在哪些目标任务中最优提示是可行的，在哪些任务中不可行。此外，研究还表明，通过调整软前缀（soft prefixes）可以有效提升模型性能，这超越了传统的硬标记提示方法。

链接: https://arxiv.org/abs/2505.17010
作者: Tim Genewein,Kevin Wenliang Li,Jordi Grau-Moya,Anian Ruoss,Laurent Orseau,Marcus Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.
zh

[AI-1] Guided Diffusion Sampling on Function Spaces with Applications to PDEs

【速读】：该论文旨在解决基于偏微分方程（PDE）的逆问题中的条件采样问题，特别是从极稀疏或噪声测量数据中恢复完整解。其解决方案的关键在于提出一种通用框架——FunDPS，该框架结合了函数空间扩散模型和插件式引导机制。通过训练一个与离散化无关的无条件去噪模型，并在推理阶段利用基于梯度的引导机制使样本满足稀疏观测数据，从而实现对后验分布的准确捕捉。此外，通过严格的数学分析将Tweedie公式扩展到无限维Hilbert空间，为后验采样方法提供了理论基础。

链接: https://arxiv.org/abs/2505.17004
作者: Jiachen Yao,Abbas Mammadov,Julius Berner,Gavin Kerrigan,Jong Chul Ye,Kamyar Azizzadenesheli,Anima Anandkumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie’s formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL
zh

[AI-2] Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在生物医学领域缺乏真正因果理解的问题，其解决方案的关键在于构建能够整合多模态数据并进行基于干预的因果推理的因果LLM代理（causal LLM agents）。该方案需克服多个核心挑战，包括设计安全、可控的智能体框架，开发严格的因果评估基准，整合异构数据源，并将LLMs与结构化知识（Knowledge Graphs, KGs）及形式化因果推断工具协同结合。

链接: https://arxiv.org/abs/2505.16982
作者: Adib Bazgir,Amir Habibdoust Lafmajani,Yuwen Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
zh

[AI-3] Know the Ropes: A Heuristic Strategy for LLM -based Multi-Agent System Design

【速读】：该论文试图解决单智能体大语言模型（Single-agent LLMs）在有限上下文、角色过载和领域迁移脆弱性方面的瓶颈问题，以及传统多智能体方法在任务分解不明确、契约模糊和验证开销大等方面的局限性。其解决方案的关键在于提出Know-The-Ropes (KtR) 框架，该框架将领域先验知识转化为算法蓝图层次结构，通过递归地将任务拆分为类型化、控制器协调的子任务，并采用零样本或最小可行增强（如思维链、微调、自检）进行求解，从而实现高效的任务处理与性能提升。

链接: https://arxiv.org/abs/2505.16979
作者: Zhenkun Li,Lingyao Li,Shuhang Lin,Yongfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Single-agent LLMs hit hard limits–finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators–no ever-larger monoliths required.
zh

[AI-4] HyGenar: An LLM -Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在少量样本（few-shot）语法生成任务中的表现不足问题，即如何从少量正负示例中推断并生成符合Backus-Naur Form（BNF）的语法规则。其解决方案的关键在于提出一种基于LLM的混合遗传算法（HyGenar），通过优化语法生成过程，显著提升了生成语法的句法和语义正确性。

链接: https://arxiv.org/abs/2505.16978
作者: Weizhi Tang,Yixuan Li,Chris Sypherd,Elizabeth Polgreen,Vaishak Belle
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: Accepted to ACL 2025 Findings. Code available at this https URL

点击查看摘要

Abstract:Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
zh

[AI-5] Invisible Prompts Visible Threats: Malicious Font Injection in External Resources for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理外部资源时可能受到的隐藏对抗性提示攻击问题，特别是通过恶意字体注入实现的隐蔽内容欺骗。解决方案的关键在于识别并分析攻击者如何通过操纵代码到字形的映射，在外部资源（如网页）中注入不可见的恶意内容，从而绕过LLM的安全机制，导致恶意内容中继和敏感数据泄露。研究通过实验验证了此类攻击的有效性，并强调了在LLM部署中加强对外部内容安全处理的必要性。

链接: https://arxiv.org/abs/2505.16957
作者: Junjie Xiong,Changjia Zhu,Shuhang Lin,Chong Zhang,Yongfeng Zhang,Yao Liu,Lingyao Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP). This extension could introduce new security vulnerabilities. We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. We evaluate two critical attack scenarios: (1) “malicious content relay” and (2) “sensitive data leakage” through MCP-enabled tools. Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content.
zh

[AI-6] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning

【速读】：该论文试图解决大型语言模型在训练分布之外的泛化能力不足的问题，特别是其倾向于进行复杂的模式内插而非真正的抽象推理（extrapolation）。解决方案的关键在于从信息瓶颈（Information Bottleneck, IB）理论出发，证明仅解码器结构的Transformer在形成任务最优序列表示方面存在固有局限性，并进一步提出通过周期性地全局重构键值缓存（KV cache）来提升模型在推理任务中的泛化能力。该方法通过重新分配缓存的存储能力，减少对输入前缀的依赖，增强对未来标记预测有用特征的编码，从而显著提升了数学推理基准的表现。

链接: https://arxiv.org/abs/2505.16950
作者: Adnan Oomerjee,Zafeirios Fountas,Zhongwei Yu,Haitham Bou-Ammar,Jun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.
zh

[AI-7] MixAT: Combining Continuous and Discrete Adversarial Training for LLM s

【速读】：该论文试图解决当前前沿大型语言模型（Large Language Models, LLMs）在面对对抗性攻击时仍存在持续生成有害内容的问题，以及传统对抗训练方法在LLMs中的有效性与计算成本之间的矛盾。其解决方案的关键在于提出一种名为MixAT的新方法，该方法在训练过程中结合了更强的离散对抗攻击与更快的连续攻击，从而在提升模型鲁棒性的同时保持较低的计算开销。

链接: https://arxiv.org/abs/2505.16947
作者: Csaba Dékány,Stefan Balauca,Robin Staab,Dimitar I. Dimitrov,Martin Vechev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR 20%) compared to prior defenses (ALO-ASR 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT’s discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at this https URL.
zh

[AI-8] FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records

【速读】：该论文试图解决当前基础模型在医疗领域临床实用性缺乏共识的问题，主要原因是缺乏全面且有意义的任务定义以及足够多样化的评估方法，难以证明其相对于传统监督学习的优势。解决方案的关键在于提出一系列具有临床意义的任务，涵盖患者预后和急慢性疾病的早期预测，并设计稳健的评估标准。研究在包含500万例患者的哥伦比亚大学 Irving 医疗中心（CUMC）电子健康记录（EHR）数据集上，对最先进的基础模型进行了14个临床相关任务的评估，通过整体准确率、校准性和子群体性能等指标，揭示了预训练策略、分词方式和数据表示方法选择带来的权衡。

链接: https://arxiv.org/abs/2505.16941
作者: Chao Pang,Vincent Jeanselme,Young Sang Choi,Xinzhuo Jiang,Zilin Jing,Aparajita Kashyap,Yuta Kobayashi,Yanwei Li,Florent Pollet,Karthik Natarajan,Shalmali Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models’ potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.
zh

[AI-9] Beyond Needle(s) in the Embodied Haystack: Environment Architecture and Training Considerations for Long Context Reasoning

【速读】：该论文试图解决具身人工智能（Embodied AI）中长期任务的长上下文理解问题，特别是在复杂环境中实现长时间推理与规划的能力。解决方案的关键在于提出 \infty -THOR 框架，该框架包含生成可扩展、可复现且无限长的轨迹的生成机制、一种基于多线索分布的具身问答任务（Needle(s) in the Embodied Haystack）以及一个涵盖数百环境步长的长时域数据集和基准测试套件。此外，通过引入交错的目标-状态-动作建模、上下文扩展技术和上下文并行化等架构改进，增强了基于大语言模型（LLM）的智能体在极端长上下文条件下的推理与交互能力。

链接: https://arxiv.org/abs/2505.16928
作者: Bosung Kim,Prithviraj Ammanabrolu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce \infty -THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. \infty -THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents’ long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.
zh

[AI-10] Identifying Evaluating and Mitigating Risks of AI Thought Partnerships

【速读】：该论文试图解决AI思想伙伴（AI thought partners）在复杂推理过程中可能带来的多层级风险问题，包括实时风险、个体风险和社会风险（RISc）。其解决方案的关键在于提出一个新颖的分析框架，通过该框架识别不同层面的风险，并据此制定具体的评估指标和缓解策略，以帮助开发者和政策制定者有效预防潜在危害，确保人类在与AI思想伙伴的合作中获得积极效益。

链接: https://arxiv.org/abs/2505.16899
作者: Kerem Oktar,Katherine M. Collins,Jose Hernandez-Orallo,Diane Coyle,Stephen Cave,Adrian Weller,Ilia Sucholutsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) systems have historically been used as tools that execute narrowly defined tasks. Yet recent advances in AI have unlocked possibilities for a new class of models that genuinely collaborate with humans in complex reasoning, from conceptualizing problems to brainstorming solutions. Such AI thought partners enable novel forms of collaboration and extended cognition, yet they also pose major risks-including and beyond risks of typical AI tools and agents. In this commentary, we systematically identify risks of AI thought partners through a novel framework that identifies risks at multiple levels of analysis, including Real-time, Individual, and Societal risks arising from collaborative cognition (RISc). We leverage this framework to propose concrete metrics for risk evaluation, and finally suggest specific mitigation strategies for developers and policymakers. As AI thought partners continue to proliferate, these strategies can help prevent major harms and ensure that humans actively benefit from productive thought partnerships.
zh

[AI-11] Structure-Aligned Protein Language Model

【速读】：该论文旨在解决蛋白质语言模型（pLMs）在缺乏结构知识的情况下难以满足许多生物应用需求的问题。其关键解决方案是通过潜在层对比学习任务，将预训练蛋白质图神经网络（pGNNs）中的结构信息整合到pLMs中，从而赋予pLMs跨蛋白的结构知识；同时引入物理层任务，通过优化pLMs预测结构标记来注入蛋白内结构知识，实现对pLMs的双重结构知识增强。

链接: https://arxiv.org/abs/2505.16896
作者: Can Chen,David Heurtel-Depeiges,Robert M. Vernon,Christopher James Langmead,Yoshua Bengio,Quentin Fournier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.
zh

[AI-12] Predicate-Conditional Conformalized Answer Sets for Knowledge Graph Embeddings ACL2025

【速读】：该论文试图解决知识图谱嵌入（Knowledge Graph Embedding, KGE）方法中不确定性量化的问题，特别是现有方法仅提供基于参考查询和答案集的边际覆盖保证，而无法满足高风险应用中对每个查询的条件覆盖保证的需求。解决方案的关键在于提出CondKGCP方法，该方法通过合并具有相似向量表示的谓词，并结合排名信息进行校准，从而近似实现谓词条件覆盖保证，同时保持预测集合的紧凑性。

链接: https://arxiv.org/abs/2505.16877
作者: Yuqicheng Zhu,Daniel Hernández,Yuan He,Zifeng Ding,Bo Xiong,Evgeny Kharlamov,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the Finding of ACL 2025

点击查看摘要

Abstract:Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
zh

[AI-13] GCAL: Adapting Graph Models to Evolving Domain Shifts ICML2025

【速读】：该论文试图解决在动态变化的多分布外（out-of-distribution, OOD）图数据上进行图域适应的问题，传统图域适应方法仅限于单步适应，难以处理连续域偏移并容易产生灾难性遗忘。论文提出的解决方案是图持续自适应学习（Graph Continual Adaptive Learning, GCAL），其关键在于采用双层优化策略：在“适应”阶段通过信息最大化方法微调模型以适应新图域并重新适应过去记忆以减轻遗忘；在“生成记忆”阶段，基于信息瓶颈理论推导出的理论下界，利用变分记忆图生成模块将原始图压缩为记忆，从而提升模型在不同图域中的可持续性和适应能力。

链接: https://arxiv.org/abs/2505.16860
作者: Ziyue Qiao,Qianyi Cai,Hao Dong,Jiawei Gu,Pengyang Wang,Meng Xiao,Xiao Luo,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs. Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy. The “adapt” phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the “generate memory” phase, guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.
zh

[AI-14] Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only

【速读】：该论文试图解决预训练策略通过在线强化学习（online reinforcement learning, RL）提升性能时面临的挑战，尤其是现有方法依赖离线预训练的Q函数导致的保守性问题，这会低估离线数据集之外的状态-动作对，从而阻碍在线环境中的进一步探索。此外，这种方法在仅存在预训练策略而无预训练Q函数的场景中（如模仿学习（imitation learning, IL）预训练）适用性受限。解决方案的关键在于提出PORL（Policy-Only Reinforcement Learning Fine-Tuning），该方法仅使用离线预训练策略进行高效的在线RL微调，通过在在线阶段从零开始快速初始化Q函数，避免有害的悲观性，从而实现与先进离线到在线RL算法及利用数据或策略的在线RL方法相当的性能，并为直接微调行为克隆（behavior cloning, BC）策略开辟了新路径。

链接: https://arxiv.org/abs/2505.16856
作者: Wei Xiao,Jiacheng Liu,Zifeng Zhuang,Runze Suo,Shangke Lyu,Donglin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Improving the performance of pre-trained policies through online reinforcement learning (RL) is a critical yet challenging topic. Existing online RL fine-tuning methods require continued training with offline pretrained Q-functions for stability and performance. However, these offline pretrained Q-functions commonly underestimate state-action pairs beyond the offline dataset due to the conservatism in most offline RL methods, which hinders further exploration when transitioning from the offline to the online setting. Additionally, this requirement limits their applicability in scenarios where only pre-trained policies are available but pre-trained Q-functions are absent, such as in imitation learning (IL) pre-training. To address these challenges, we propose a method for efficient online RL fine-tuning using solely the offline pre-trained policy, eliminating reliance on pre-trained Q-functions. We introduce PORL (Policy-Only Reinforcement Learning Fine-Tuning), which rapidly initializes the Q-function from scratch during the online phase to avoid detrimental pessimism. Our method not only achieves competitive performance with advanced offline-to-online RL algorithms and online RL approaches that leverage data or policies prior, but also pioneers a new path for directly fine-tuning behavior cloning (BC) policies.
zh

[AI-15] GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent ACL2025

【速读】：该论文旨在解决图形用户界面（GUI）自动化在动态环境中的关键挑战，特别是多模态大语言模型（MLLM）在UI组件误识别和知识过时方面的两个核心问题。其解决方案的关键在于提出一种无需训练的GUI代理GUI-explorer，该代理包含两个核心机制：（1）功能感知轨迹的自主探索，通过功能感知任务目标生成器自动构建探索目标以系统性地收集多样化轨迹；（2）过渡感知知识的无监督挖掘，通过过渡感知知识提取器对结构化交互三元组（观察、动作、结果）的状态转移进行无监督分析，从而提取有效的屏幕操作逻辑。

链接: https://arxiv.org/abs/2505.16827
作者: Bin Xie,Rui Shao,Gongwei Chen,Kaiwen Zhou,Yinchuan Li,Jie Liu,Min Zhang,Liqiang Nie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2025. Github: this https URL

点击查看摘要

Abstract:GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at this https URL.
zh

[AI-16] Dynamic Reservoir Computing with Physical Neuromorphic Networks IJCNN2025

【速读】：该论文旨在解决如何利用具有神经形态动力学的物理纳米电子网络作为动态储备池（Reservoir Computing, RC）中的物理储备系统，以提高其在非线性时间序列处理任务中的性能。研究的关键在于揭示网络稀疏性对动态RC性能的影响，发现稀疏网络能够生成更有效的非线性时间输出，从而在混沌时间序列预测任务中表现出更好的动态特性和学习能力。

链接: https://arxiv.org/abs/2505.16813
作者: Yinhao Xu,Georg A. Gottwald,Zdenka Kuncic
机构: 未知
类目: Emerging Technologies (cs.ET); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, IJCNN 2025, accepted

点击查看摘要

Abstract:Reservoir Computing (RC) with physical systems requires an understanding of the underlying structure and internal dynamics of the specific physical reservoir. In this study, physical nano-electronic networks with neuromorphic dynamics are investigated for their use as physical reservoirs in an RC framework. These neuromorphic networks operate as dynamic reservoirs, with node activities in general coupled to the edge dynamics through nonlinear nano-electronic circuit elements, and the reservoir outputs influenced by the underlying network connectivity structure. This study finds that networks with varying degrees of sparsity generate more useful nonlinear temporal outputs for dynamic RC compared to dense networks. Dynamic RC is also tested on an autonomous multivariate chaotic time series prediction task with networks of varying densities, which revealed the importance of network sparsity in maintaining network activity and overall dynamics, that in turn enabled the learning of the chaotic Lorenz63 system’s attractor behavior.
zh

[AI-17] A modular framework for automated evaluation of procedural content generation in serious games with deep reinforcement learning agents

【速读】：该论文试图解决在严肃游戏（Serious Games, SGs）中集成程序化内容生成（Procedural Content Generation, PCG）技术后，如何评估其对玩家体验影响的问题。解决方案的关键在于提出一种基于深度强化学习（Deep Reinforcement Learning, DRL）的游戏测试代理的自动化评估方法，通过部署包含不同PCG版本的SG进行实验，验证该框架的有效性。

链接: https://arxiv.org/abs/2505.16801
作者: Eleftherios Kalafatis,Konstantinos Mitsis,Konstantia Zarkogianni,Maria Athanasiou,Konstantina Nikita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Serious Games (SGs) are nowadays shifting focus to include procedural content generation (PCG) in the development process as a means of offering personalized and enhanced player experience. However, the development of a framework to assess the impact of PCG techniques when integrated into SGs remains particularly challenging. This study proposes a methodology for automated evaluation of PCG integration in SGs, incorporating deep reinforcement learning (DRL) game testing agents. To validate the proposed framework, a previously introduced SG featuring card game mechanics and incorporating three different versions of PCG for nonplayer character (NPC) creation has been deployed. Version 1 features random NPC creation, while versions 2 and 3 utilize a genetic algorithm approach. These versions are used to test the impact of different dynamic SG environments on the proposed framework’s agents. The obtained results highlight the superiority of the DRL game testing agents trained on Versions 2 and 3 over those trained on Version 1 in terms of win rate (i.e. number of wins per played games) and training time. More specifically, within the execution of a test emulating regular gameplay, both Versions 2 and 3 peaked at a 97% win rate and achieved statistically significant higher (p=0009) win rates compared to those achieved in Version 1 that peaked at 94%. Overall, results advocate towards the proposed framework’s capability to produce meaningful data for the evaluation of procedurally generated content in SGs.
zh

[AI-18] Cohort-Based Active Modality Acquisition

【速读】：该论文试图解决在多模态机器学习应用中，当资源有限且并非所有样本都具备所有模态时，如何优先选择样本进行额外模态获取的问题。解决方案的关键在于提出一种基于群体的测试时主动模态获取方法（Cohort-based Active Modality Acquisition, CAMA），通过结合生成式填补和判别建模技术，估计缺失模态获取的预期收益，并利用上界启发式方法为获取策略提供性能基准，从而更有效地指导新样本的模态获取。

链接: https://arxiv.org/abs/2505.16791
作者: Tillmann Rheude,Roland Eils,Benjamin Wild
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored despite their importance in many real-world settings. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on common multimodal datasets demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of new samples in comparison to those relying solely on unimodal information, entropy guidance, and random selections. Our work provides an effective solution for optimizing modality acquisition at the cohort level, enabling better utilization of resources in constrained settings.
zh

[AI-19] Learning Flexible Forward Trajectories for Masked Molecular Diffusion

【速读】：该论文旨在解决生成式分子数据时，直接应用标准的掩码扩散模型（Masked Diffusion Models, MDMs）会导致性能显著下降的问题。其关键问题在于分子在前向扩散过程中出现状态冲突（state-clashing），即不同分子逐渐坍缩到同一状态，导致重建目标混合，无法通过常规的单峰反向扩散过程学习。为了解决这一问题，作者提出了掩码元素可学习扩散（Masked Element-wise Learnable Diffusion, MELD），通过参数化噪声调度网络为每个图元素（原子和键）分配不同的破坏率，从而协调各元素的破坏轨迹，避免分子图之间的碰撞。

链接: https://arxiv.org/abs/2505.16790
作者: Hyunjin Seo,Taewon Kim,Sihyun Yu,SungSoo Ahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
zh

[AI-20] Gaze Into the Abyss – Planning to Seek Entropy When Reward is Scarce

【速读】：该论文旨在解决模型基础强化学习（Model-based Reinforcement Learning, MBRL）中世界模型（world model）学习优化被忽视的问题，从而提升样本效率和后续策略（actor）的性能。其解决方案的关键在于提出一种新颖的方法，通过世界模型生成的短期潜在预测主动寻找高熵状态，以替代传统的基于好奇心的探索方法，并引入分层规划器动态决定重规划时机、规划时域长度以及奖励与熵的权重，从而提高训练效率。

链接: https://arxiv.org/abs/2505.16787
作者: Ashish Sundar,Chunbo Luo,Xiaoyang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages without appendix, 15 Figures, preprint

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. MBRL methods have progressed by largely prioritising the actor; optimising the world model learning has been neglected meanwhile. Improving the fidelity of the world model and reducing its time to convergence can yield significant downstream benefits, one of which is improving the ensuing performance of any actor it may train. We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions generated by the world model, offering a principled alternative to traditional curiosity-driven methods that chase once-novel states well after they were stumbled into. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multi step plans after every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to just Dreamer as a proof of concept. Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.
zh

[AI-21] CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models

【速读】：该论文试图解决开放源代码大型语言模型（Large Language Models, LLMs）在滥用使用中的溯源问题，即如何有效且隐蔽地识别可疑应用背后的特定源LLM。解决方案的关键在于提出一种新颖的LLM指纹方案CoTSRF，该方案利用思维链（Chain of Thought, CoT）作为LLM的指纹，通过构造特定的CoT查询收集源LLM的响应，并采用对比学习训练CoT提取器以获得CoT特征（即指纹），最终通过比较源LLM与可疑LLM的CoT特征之间的Kullback-Leibler散度进行指纹验证。

链接: https://arxiv.org/abs/2505.16785
作者: Zhenzhen Ren,GuoBiao Li,Sheng Li,Zhenxing Qian,Xinpeng Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite providing superior performance, open-source large language models (LLMs) are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold. Various experiments have been conducted to demonstrate the advantage of our proposed CoTSRF for fingerprinting LLMs, particularly in stealthy and robust fingerprint verification.
zh

[AI-22] Fuzzy Information Evolution with Three-Way Decision in Social Network Group Decision-Making

【速读】：该论文旨在解决群体决策（Group Decision-Making, GDM）场景中面临的不确定性、动态社会结构和模糊信息等问题。其解决方案的关键在于提出一种融合三支决策（Three-Way Decision, 3WD）理论、动态网络重构和语言意见表示的社交网络群体决策（Social Network Group Decision-Making, SNGDM）框架，通过引入3WD机制建模代理判断中的犹豫与模糊性，基于意见相似性设计连接调整规则以适应社会关系的演变，并采用语言术语描述代理意见，从而更有效地处理主观、模糊或不完整的信息。

链接: https://arxiv.org/abs/2505.16781
作者: Qianlei Jia,Xinliang Zhou,Ondrej Krejcar,Enrique Herrera-Viedma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In group decision-making (GDM) scenarios, uncertainty, dynamic social structures, and vague information present major challenges for traditional opinion dynamics models. To address these issues, this study proposes a novel social network group decision-making (SNGDM) framework that integrates three-way decision (3WD) theory, dynamic network reconstruction, and linguistic opinion representation. First, the 3WD mechanism is introduced to explicitly model hesitation and ambiguity in agent judgments, thereby preventing irrational decisions. Second, a connection adjustment rule based on opinion similarity is developed, enabling agents to adaptively update their communication links and better reflect the evolving nature of social relationships. Third, linguistic terms are used to describe agent opinions, allowing the model to handle subjective, vague, or incomplete information more effectively. Finally, an integrated multi-agent decision-making framework is constructed, which simultaneously considers individual uncertainty, opinion evolution, and network dynamics. The proposed model is applied to a multi-UAV cooperative decision-making scenario, where simulation results and consensus analysis demonstrate its effectiveness. Experimental comparisons further verify the advantages of the algorithm in enhancing system stability and representing realistic decision-making behaviors.
zh

[AI-23] Data-Driven Breakthroughs and Future Directions in AI Infrastructure: A Comprehensive Review

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在过去十五年中的关键发展路径及其对技术范式转变的影响问题，旨在通过整合历史、理论和技术视角，揭示AI演进中的重要转折点。其解决方案的关键在于将GPU加速模型训练、ImageNet引发的数据驱动范式转变、Transformer架构简化以及GPT系列扩展建模能力等突破性进展视为更深层次范式转移的指标，并结合统计学习理论中的样本复杂性和数据效率概念，解释这些突破如何转化为可扩展的解决方案。同时，论文还探讨了应对隐私担忧和监管收紧的新兴技术，如联邦学习和隐私增强技术（Privacy Enhancing Technologies, PETs），以及在真实数据不可用时合成数据生成的效用与限制，从而为未来AI研究和政策制定提供战略指导。

链接: https://arxiv.org/abs/2505.16771
作者: Beyazit Bestami Yuksel,Ayse Yilmazer Metin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 3 tables

点击查看摘要

Abstract:This paper presents a comprehensive synthesis of major breakthroughs in artificial intelligence (AI) over the past fifteen years, integrating historical, theoretical, and technological perspectives. It identifies key inflection points in AI’ s evolution by tracing the convergence of computational resources, data access, and algorithmic innovation. The analysis highlights how researchers enabled GPU based model training, triggered a data centric shift with ImageNet, simplified architectures through the Transformer, and expanded modeling capabilities with the GPT series. Rather than treating these advances as isolated milestones, the paper frames them as indicators of deeper paradigm shifts. By applying concepts from statistical learning theory such as sample complexity and data efficiency, the paper explains how researchers translated breakthroughs into scalable solutions and why the field must now embrace data centric approaches. In response to rising privacy concerns and tightening regulations, the paper evaluates emerging solutions like federated learning, privacy enhancing technologies (PETs), and the data site paradigm, which reframe data access and security. In cases where real world data remains inaccessible, the paper also assesses the utility and constraints of mock and synthetic data generation. By aligning technical insights with evolving data infrastructure, this study offers strategic guidance for future AI research and policy development.
zh

[AI-24] When Safety Detectors Arent Enough: A Stealthy and Effective Jailbreak Attack on LLM s via Steganographic Techniques

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）面临的越狱攻击（jailbreak attacks）问题，这类攻击能够绕过模型内置的安全机制并产生有害输出。现有攻击方法难以同时实现毒性隐蔽性（toxic stealth）和语言自然性（linguistic stealth），即在隐藏有害内容的同时保持文本的自然流畅。论文提出的解决方案关键在于StealthAttack（StegoAttack），该方法利用隐写术（steganography）将有害查询隐藏在良性且语义连贯的文本中，并通过提示语言模型提取隐藏查询并以加密方式响应，从而在保持自然性的前提下有效隐藏恶意意图，成功规避内置及外部安全机制。

链接: https://arxiv.org/abs/2505.16765
作者: Jianing Geng,Biao Yi,Zekun Fei,Tongxi Wu,Lihai Nie,Zheli Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at this https URL
zh

[AI-25] Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation

【速读】：该论文旨在解决生成式推荐系统中因异构信息量映射到相同向量空间而导致的训练不稳定问题，这是Meta提出的HSTU生成推荐方法中存在的关键缺陷。其解决方案的关键在于提出双流生成排序网络（Dual-Flow Generative Ranking Network, DFGR），通过在自注意力机制的QKV模块中引入真实流与虚假流之间的创新交互模式，提升训练和推理效率。DFGR仅依赖用户历史行为序列和少量属性信息，无需复杂的手工特征工程，从而实现了更高效且有效的生成排序范式。

链接: https://arxiv.org/abs/2505.16752
作者: Hao Guo,Erpeng Xue,Lei Huang,Shichao Wang,Xiaolei Wang,Lei Wang,Jinpeng Wang,Sheng Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce the Dual-Flow Generative Ranking Network (DFGR), a two-stream architecture designed for recommendation systems. DFGR integrates innovative interaction patterns between real and fake flows within the QKV modules of the self-attention mechanism, enhancing both training and inference efficiency. This approach effectively addresses a key limitation observed in Meta’s proposed HSTU generative recommendation approach, where heterogeneous information volumes are mapped into identical vector spaces, leading to training instability. Unlike traditional recommendation models, DFGR only relies on user history behavior sequences and minimal attribute information, eliminating the need for extensive manual feature engineering. Comprehensive evaluations on open-source and industrial datasets reveal DFGR’s superior performance compared to established baselines such as DIN, DCN, DIEN, and DeepFM. We also investigate optimal parameter allocation strategies under computational constraints, establishing DFGR as an efficient and effective next-generation generate ranking paradigm.
zh

[AI-26] Sequential Monte Carlo for Policy Optimization in Continuous POMDPs

【速读】：该论文旨在解决在部分可观测马尔可夫决策过程（POMDP）中，智能体如何在减少不确定性（探索）与追求即时目标（利用）之间进行有效平衡的问题。其解决方案的关键在于将策略学习建模为非马尔可夫费曼-卡茨模型中的概率推断，该模型通过预测未来观测来内在地捕捉信息收集的价值，从而无需依赖外部探索奖励或人工设计的启发式方法。为了在该模型下优化策略，作者提出了一种嵌套的顺序蒙特卡洛（SMC）算法，以高效估计基于历史的策略梯度。

链接: https://arxiv.org/abs/2505.16732
作者: Hany Abdulsamad,Sahel Iqbal,Simo Särkkä
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Optimal decision-making under partial observability requires agents to balance reducing uncertainty (exploration) against pursuing immediate objectives (exploitation). In this paper, we introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs) that explicitly addresses this challenge. Our method casts policy learning as probabilistic inference in a non-Markovian Feynman–Kac model that inherently captures the value of information gathering by anticipating future observations, without requiring extrinsic exploration bonuses or handcrafted heuristics. To optimize policies under this model, we develop a nested sequential Monte Carlo~(SMC) algorithm that efficiently estimates a history-dependent policy gradient under samples from the optimal trajectory distribution induced by the POMDP. We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty.
zh

[AI-27] Advancing Brainwave Modeling with a Codebook-Based Foundation Model

【速读】：该论文试图解决现有大规模预训练Electroencephalogram (EEG)模型在捕捉神经振荡丰富信息内容方面的不足，这一问题限制了其在不同Brain-Computer Interfaces (BCIs)任务中的性能和泛化能力。解决方案的关键在于提出LaBraM++，这是一个基于稳健信号处理基础的增强型Large Brainwave Foundation Model (LBM)，通过改进的架构设计提升了模型的表征能力，从而实现了在多种任务上的显著性能提升。

链接: https://arxiv.org/abs/2505.16724
作者: Konstantinos Barmpas,Na Lee,Yannis Panagakis,Dimitrios A. Adamos,Nikolaos Laskaris,Stefanos Zafeiriou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in large-scale pre-trained Electroencephalogram (EEG) models have shown great promise, driving progress in Brain-Computer Interfaces (BCIs) and healthcare applications. However, despite their success, many existing pre-trained models have struggled to fully capture the rich information content of neural oscillations, a limitation that fundamentally constrains their performance and generalizability across diverse BCI tasks. This limitation is frequently rooted in suboptimal architectural design choices which constrain their representational capacity. In this work, we introduce LaBraM++, an enhanced Large Brainwave Foundation Model (LBM) that incorporates principled improvements grounded in robust signal processing foundations. LaBraM++ demonstrates substantial gains across a variety of tasks, consistently outperforming its originally-based architecture and achieving competitive results when compared to other open-source LBMs. Its superior performance and training efficiency highlight its potential as a strong foundation for future advancements in LBMs.
zh

[AI-28] raining Long-Context LLM s Efficiently via Chunk-wise Optimization

【速读】：该论文旨在解决长上下文大语言模型（LLMs）在训练过程中因高昂成本而难以进行定制化应用的问题。其核心解决方案是提出一种高效的记忆优化训练范式——顺序分块优化（Sequential Chunk-wise Optimization, SeCO），通过将长输入分割为可管理的块，每个块独立构建计算图并执行局部反向传播，从而仅需存储一个块的前向激活。在此基础上进一步引入稀疏分块优化（Sparse Chunk-wise Optimization, SpaCO），通过选择性地将梯度传播至特定块并引入补偿因子，减少计算开销，实现反向传播计算成本与上下文长度解耦，使训练时间随序列变长逐渐接近推理时间。

链接: https://arxiv.org/abs/2505.16710
作者: Wenhao Li,Yuxin Zhang,Gen Luo,Daohai Yu,Rongrong Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk’s forward activations are stored in memory. Building on SeCO, we further introduce \textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed – achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \hrefthis https URLhere.
zh

[AI-29] An Analysis of Concept Bottleneck Models: Measuring Understanding and Mitigating the Impact of Noisy Annotations

【速读】：该论文试图解决生成式 AI (Generative AI) 中概念瓶颈模型（Concept Bottleneck Models, CBMs）在训练过程中因标注噪声导致的预测性能下降、可解释性减弱以及干预效果降低的问题。其解决方案的关键在于提出一种两阶段框架：在训练阶段，采用尖锐度感知最小化（sharpness-aware minimization）来稳定对噪声敏感概念的学习；在推理阶段，通过预测熵对概念进行排序，并仅修正最不确定的概念，利用不确定性作为易感概念的代理指标。该方法在理论上和实验上均验证了其有效性和鲁棒性，从而在噪声环境下保持模型的可解释性与稳定性。

链接: https://arxiv.org/abs/2505.16705
作者: Seonghwan Park,Jueun Mun,Donghyun Oh,Namhoon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept bottleneck models (CBMs) ensure interpretability by decomposing predictions into human interpretable concepts. Yet the annotations used for training CBMs that enable this transparency are often noisy, and the impact of such corruption is not well understood. In this study, we present the first systematic study of noise in CBMs and show that even moderate corruption simultaneously impairs prediction performance, interpretability, and the intervention effectiveness. Our analysis identifies a susceptible subset of concepts whose accuracy declines far more than the average gap between noisy and clean supervision and whose corruption accounts for most performance loss. To mitigate this vulnerability we propose a two-stage framework. During training, sharpness-aware minimization stabilizes the learning of noise-sensitive concepts. During inference, where clean labels are unavailable, we rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility. Theoretical analysis and extensive ablations elucidate why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts, providing a principled basis that preserves both interpretability and resilience in the presence of noise.
zh

[AI-30] MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

【速读】：该论文试图解决当前评估大型语言模型（Large Language Models, LLMs）在模型上下文协议（Model Context Protocol, MCP）框架下工具使用能力的方法不足的问题。现有评估方法无法充分衡量LLM在新型主动推理代理模式中的工具利用能力。论文提出的解决方案是MCP-RADAR，这是首个专门设计用于评估LLM在MCP框架中性能的综合性基准，其关键在于采用五维评估方法，包括答案准确性、工具选择效率、计算资源效率、参数构建准确性和执行速度，通过客观、可量化的指标在多个任务领域进行评估，从而更全面地反映LLM在工具交互中的表现。

链接: https://arxiv.org/abs/2505.16700
作者: Xuanqi Gao,Siyi Xie,Juan Zhai,Shqing Ma,Chao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of tool interaction, the Model Context Protocol (MCP) has emerged as a standardized framework for dynamic tool discovery and orchestration. Despite widespread industry adoption, existing evaluation methodologies fail to adequately assess tool utilization capabilities within this new paradigm. This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance in the MCP framework through a novel five-dimensional approach measuring: answer accuracy, tool selection efficiency, computational resource efficiency, parameter construction accuracy, and execution speed. Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains including software engineering, mathematical reasoning, and general problem-solving. Our evaluations of leading commercial and open-source LLMs reveal distinctive capability profiles with significant trade-offs between accuracy, efficiency, and speed, challenging traditional single-metric performance rankings. Besides, we provide valuable guidance for developers to optimize their tools for maximum model compatibility and effectiveness. While focused on MCP due to its standardized approach, our methodology remains applicable across all LLM agent tool integration frameworks, providing valuable insights for both LLM developers and tool creators to optimize the entire LLM-tool interaction ecosystem. The implementation, configurations, and datasets used in our evaluation are publicly available at this https URL.
zh

[AI-31] EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion EMNLP2025

【速读】：该论文旨在解决语音转换（Voice Conversion）在零样本跨语言设置下的性能不足问题，特别是现有方法在处理未见过的语言和口音时的泛化能力有限。其解决方案的关键在于结合自监督模型的离散语音表示与非自回归的基于扩散-Transformer的条件流匹配语音解码器，从而实现无需文本的自监督语音转换模型训练，避免了多编码器结构以分离语音特征的需求。

链接: https://arxiv.org/abs/2505.16691
作者: Advait Joglekar,Divyanshu Singh,Rooshil Rohit Bhatia,S. Umesh
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2025, 7 pages, 2 figures, 5 Tables

点击查看摘要

Abstract:Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages.
zh

[AI-32] Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

【速读】：该论文旨在解决后训练语言模型（Post-trained Language Models, PoLMs）在置信度校准方面存在的过自信问题，即PoLMs对正确和错误输出均赋予高置信度，从而影响其在关键应用中的可靠性。其核心挑战在于下游任务中标签数据的稀缺性，限制了传统监督校准方法的应用。该论文提出的解决方案是Disagreement-Aware Confidence Alignment (DACA)，其关键在于通过选择性使用预测一致的样本进行校准，避免因预测不一致导致的温度参数（temperature $\tau$ ）过大，从而有效提升置信度校准性能。

链接: https://arxiv.org/abs/2505.16690
作者: Beier Luo,Shuoyuan Wang,Yixuan Li,Hongxin Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature \tau ) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM’s confidence underestimates PoLM’s prediction accuracy on disagreement examples, causing a larger \tau and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large \tau in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08 % on common benchmarks.
zh

[AI-33] BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理成本攻击下的脆弱性问题，此类攻击通过诱导模型生成最长可能的输出内容来增加计算和资源消耗。现有方法由于其自指向性（即攻击者同时也是用户，只能通过输入执行攻击，且生成内容会直接对其自身产生影响），难以造成大规模恶意效应。论文提出的解决方案关键在于引入一种新型的推理成本攻击——“位翻转推理成本攻击”（bit-flip inference cost attack），其核心是针对目标模型本身的参数进行攻击，而非依赖输入。具体而言，设计了一种名为“BitHydra”的简单而有效的方法，通过损失函数引导和高效的临界位搜索算法，精准翻转模型参数中的关键位，从而显著提升攻击效果。实验结果表明，该方法在少量样本和位翻转次数下即可使多数测试提示达到最大生成长度，展现出高效性、可扩展性和强大的跨输入迁移能力。

链接: https://arxiv.org/abs/2505.16670
作者: Xiaobei Yan,Yiming Li,Zhaoxin Fan,Han Qiu,Tianwei Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities across a wide range of applications, but their ever-increasing size and resource demands make them vulnerable to inference cost attacks, where attackers induce victim LLMs to generate the longest possible output content. In this paper, we revisit existing inference cost attacks and reveal that these methods can hardly produce large-scale malicious effects since they are self-targeting, where attackers are also the users and therefore have to execute attacks solely through the inputs, whose generated content will be charged by LLMs and can only directly influence themselves. Motivated by these findings, this paper introduces a new type of inference cost attacks (dubbed ‘bit-flip inference cost attack’) that target the victim model itself rather than its inputs. Specifically, we design a simple yet effective method (dubbed ‘BitHydra’) to effectively flip critical bits of model parameters. This process is guided by a loss function designed to suppress EOS token’s probability with an efficient critical bit search algorithm, thus explicitly defining the attack objective and enabling effective optimization. We evaluate our method on 11 LLMs ranging from 1.5B to 14B parameters under both int8 and float16 settings. Experimental results demonstrate that with just 4 search samples and as few as 3 bit flips, BitHydra can force 100% of test prompts to reach the maximum generation length (e.g., 2048 tokens) on representative LLMs such as LLaMA3, highlighting its efficiency, scalability, and strong transferability across unseen inputs.
zh

[AI-34] ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming ACL2025

【速读】：该论文试图解决当前关于人类与大语言模型（Large Language Model, LLM）协作在竞赛编程中的研究存在碎片化、缺乏统一理解的问题，以及现有研究中使用多样化且应用特定的人类反馈导致难以进行系统评估的问题。其解决方案的关键在于提出一种全面的人类反馈分类体系，构建专门用于人类-LLM协作的编程数据集ELABORATIONSET，以及设计一个新的基准测试ELABORATION，以支持对人类-LLM竞赛编程能力的深入评估。

链接: https://arxiv.org/abs/2505.16667
作者: Xinwei Yang,Zhaofeng Liu,Chen Huang,Jiashuai Zhang,Tong Zhang,Yifan Zhang,Wenqiang Lei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ACL 2025 Main. Our code and dataset are available at this https URL

点击查看摘要

Abstract:While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at this https URL
zh

[AI-35] End-to-End Framework for Predicting the Remaining Useful Life of Lithium-Ion Batteries

【速读】：该论文旨在解决锂离子电池剩余使用寿命（RUL）准确预测的问题，这对于实现及时维护、提升依赖其运行的电动应用的操作效率具有重要意义。解决方案的关键在于提出了一种结合新型信号处理流程与深度学习预测模型的方法，其中信号处理流程通过计算衍生容量特征并利用统计指标和基于差分的方法对原始容量、电压和电流等特征进行去噪与增强，而预测模型则采用由1D卷积神经网络（CNN）、注意力长短期记忆（A-LSTM）和基于常微分方程的LSTM（ODE-LSTM）模块组成的混合深度学习架构，以捕捉局部信号特征、长程时间依赖性以及电池退化过程的连续时间动态特性。

链接: https://arxiv.org/abs/2505.16664
作者: Khoa Tran,Tri Le,Bao Huynh,Hung-Cuong Trinh,Vy-Rin Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal processing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed based on current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) modules. This architecture is designed to capture both local signal characteristics and long-range temporal dependencies while modeling the continuous-time dynamics of battery degradation. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available large-scale datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its strong potential for real-world RUL prediction applications.
zh

[AI-36] SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLM s Mathematical Problem Solving

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在数学任务中取得的成功是否反映真正的数学推理能力，还是仅依赖于表层模式识别的问题。现有评估指标如最终答案准确率无法有效区分模型所具备的底层能力，缺乏诊断价值。论文提出的解决方案是SMART（Self-Generating and Self-Validating Multi-Dimensional Assessment Framework），其关键在于将数学问题求解分解为四个独立维度：理解、推理、算术和反思/优化，并通过定制化任务对每个维度进行独立评估，从而实现对LLM行为的可解释性和细粒度分析。此外，SMART集成了自动化的自生成与自验证机制，确保了基准数据的可扩展性和可靠性。

链接: https://arxiv.org/abs/2505.16646
作者: Yujie Hou,Ting Zhang,Mei Wang,Xuetao Ma,Hu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection \ refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.
zh

[AI-37] BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

【速读】：该论文旨在解决Vision-Language-Action (VLA)模型在安全方面存在的后门漏洞问题，尤其是在Training-as-a-Service（TaaS）范式下，传统对抗扰动之外的隐蔽、持久且具有实际威胁的后门攻击尚未被充分研究。论文提出的解决方案为BadVLA，其关键在于基于目标解耦优化的两阶段方法：首先在特征空间中明确分离触发器表示与正常输入，其次通过条件控制偏差仅在触发器存在时激活，同时保持干净任务的性能。该方法首次揭示了VLA模型的后门脆弱性，并在多个VLA基准测试中实现了接近100%的攻击成功率，同时对干净任务准确性影响极小。

链接: https://arxiv.org/abs/2505.16640
作者: Xueyang Zhou,Guiyao Tie,Guowen Zhang,Hechang Wang,Pan Zhou,Lichao Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have advanced robotic control by enabling end-to-end decision-making directly from multimodal inputs. However, their tightly coupled architectures expose novel security vulnerabilities. Unlike traditional adversarial perturbations, backdoor attacks represent a stealthier, persistent, and practically significant threat-particularly under the emerging Training-as-a-Service paradigm-but remain largely unexplored in the context of VLA models. To address this gap, we propose BadVLA, a backdoor attack method based on Objective-Decoupled Optimization, which for the first time exposes the backdoor vulnerabilities of VLA models. Specifically, it consists of a two-stage process: (1) explicit feature-space separation to isolate trigger representations from benign inputs, and (2) conditional control deviations that activate only in the presence of the trigger, while preserving clean-task performance. Empirical results on multiple VLA benchmarks demonstrate that BadVLA consistently achieves near-100% attack success rates with minimal impact on clean task accuracy. Further analyses confirm its robustness against common input perturbations, task transfers, and model fine-tuning, underscoring critical security vulnerabilities in current VLA deployments. Our work offers the first systematic investigation of backdoor vulnerabilities in VLA models, highlighting an urgent need for secure and trustworthy embodied model design practices. We have released the project page at this https URL.
zh

[AI-38] Open and Sustainable AI: challenges opportunities and the road ahead in the life sciences

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在生命科学领域快速应用所带来的研究可信度下降、可重复性差以及环境可持续性受限等问题。其解决方案的关键在于提出一套与AI生态系统超过300个组件直接对应的开放且可持续的人工智能（Open and Sustainable AI, OSAI）推荐框架，以促进AI研究的可重用性、透明性和可持续性，并为未来政策制定和实施路径提供指导。

链接: https://arxiv.org/abs/2505.16619
作者: Gavin Farrell(Department of Biomedical Sciences, University of Padova, Padova, Italy),Eleni Adamidi(Athena Research and Innovation Center, Marousi, Greece),Rafael Andrade Buono(a href=“http://VIB.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a Center for AI and Computational Biology, Ghent, Belgium),Mihail Anton(ELIXIR Europe Hub, EMBL-EBI, Hinxton, United Kingdom),Omar Abdelghani Attafi(Department of Biomedical Sciences, University of Padova, Padova, Italy),Salvador Capella Gutierrez(Barcelona Supercomputing Center (BSC), Barcelona, Spain),Emidio Capriotti(Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy and Computational Genomics Platform, IRCCS University Hospital of Bologna, Bologna, Italy),Leyla Jael Castro(ZB MED Information Centre for Life Sciences, Cologne, Germany),Davide Cirillo(Barcelona Supercomputing Center (BSC), Barcelona, Spain),Lisa Crossman(a href=“http://SequenceAnalysis.co.uk” rel=“external noopener nofollow” class="link-external link-http"this http URL/a, United Kingdom and University of East Anglia, Norwich, United Kingdom),Christophe Dessimoz(Department of Computational Biology, University of Lausanne, Lausanne, Switzerland and Swiss Institute of Bioinformatics, Lausanne, Switzerland),Alexandros Dimopoulos(Institute for Fundamental Biomedical Science, Biomedical Sciences Research Center Alexander Fleming, Vari, Greece and Department of Informatics amp; Telematics, School of Digital Technology, Harokopio University, Athens, Greece),Raul Fernandez-Diaz(School of Medicine, University College Dublin, Dublin, Ireland and Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland and IBM Research Dublin, Dublin, Ireland),Styliani-Christina Fragkouli(Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece and Department of Biology, National amp; Kapodistrian University of Athens, Athens, Greece),Carole Goble(Department of Computer Science, University of Manchester, Manchester, United Kingdom),Wei Gu(Luxembourg National Data Service, Esch-sur-Alzette, Luxembourg),John M. Hancock(Institute of Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia),Alireza Khanteymoori(Department of Psychology, University of Freiburg, Freiburg, Germany),Tom Lenaerts(Machine Learning Group, Universite Libre de Bruxelles, Brussels, Belgium and Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium and Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium and FARI, AI for the common good institute, ULB-VUB, Brussels, Belgium and Center for Human-Compatible AI, UC Berkeley, Berkeley, CA, USA),Fabio G. Liberante(ELIXIR Europe Hub, EMBL-EBI, Hinxton, United Kingdom),Peter Maccallum(ELIXIR Europe Hub, EMBL-EBI, Hinxton, United Kingdom),Alexander Miguel Monzon(Department of Biomedical Sciences, University of Padova, Padova, Italy),Magnus Palmblad(Leiden University Medical Center, Leiden, Netherlands),Lucy Poveda(Swiss Institute of Bioinformatics, Lausanne, Switzerland),Ovidiu Radulescu(LPHI, University of Montpellier, CNRS, INSERM, Montpellier, France),Denis C. Shields(School of Medicine, University College Dublin, Dublin, Ireland and Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland),Shoaib Sufi(Department of Computer Science, University of Manchester, Manchester, United Kingdom),Thanasis Vergoulis(Athena Research and Innovation Center, Marousi, Greece),Fotis Psomopoulos(Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece),Silvio C.E. Tosatto(Department of Biomedical Sciences, University of Padova, Padova, Italy and Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari, Italy)
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注: 1 PDF, 24 Pages, 2 figures within. Co-corresponding authors: Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece and Department of Biomedical Sciences, University of Padova, Padova, Italy. E-mails: fpsom@certh.gr, this http URL @unipd.it

点击查看摘要

Abstract:Artificial intelligence (AI) has recently seen transformative breakthroughs in the life sciences, expanding possibilities for researchers to interpret biological information at an unprecedented capacity, with novel applications and advances being made almost daily. In order to maximise return on the growing investments in AI-based life science research and accelerate this progress, it has become urgent to address the exacerbation of long-standing research challenges arising from the rapid adoption of AI methods. We review the increased erosion of trust in AI research outputs, driven by the issues of poor reusability and reproducibility, and highlight their consequent impact on environmental sustainability. Furthermore, we discuss the fragmented components of the AI ecosystem and lack of guiding pathways to best support Open and Sustainable AI (OSAI) model development. In response, this perspective introduces a practical set of OSAI recommendations directly mapped to over 300 components of the AI ecosystem. Our work connects researchers with relevant AI resources, facilitating the implementation of sustainable, reusable and transparent AI. Built upon life science community consensus and aligned to existing efforts, the outputs of this perspective are designed to aid the future development of policy and structured pathways for guiding AI implementation.
zh

[AI-39] Safe Uncertainty-Aware Learning of Robotic Suturing

【速读】：该论文旨在解决机器人辅助微创手术中自动化控制的挑战，特别是如何在保证安全性的同时提高系统的适应性和鲁棒性。其解决方案的关键在于提出一种安全、具有不确定性感知的学习框架，通过训练基于扩散策略的集成模型（Ensemble Model of Diffusion Policies）来量化策略的认知不确定性，并利用该不确定性识别分布外（Out-Of-Distribution）场景，从而在不安全情况下将控制权交还给外科医生。此外，引入无模型控制屏障函数（Control Barrier Function）以对预测动作施加形式化的安全约束，确保系统在预测不安全时仍能保持在预设的安全集合内。

链接: https://arxiv.org/abs/2505.16596
作者: Wilbert Peter Empleo,Yitaek Kim,Hansoul Kim,Thiusius Rajeeth Savarimuthu,Iñigo Iturrate
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robot-Assisted Minimally Invasive Surgery is currently fully manually controlled by a trained surgeon. Automating this has great potential for alleviating issues, e.g., physical strain, highly repetitive tasks, and shortages of trained surgeons. For these reasons, recent works have utilized Artificial Intelligence methods, which show promising adaptability. Despite these advances, there is skepticism of these methods because they lack explainability and robust safety guarantees. This paper presents a framework for a safe, uncertainty-aware learning method. We train an Ensemble Model of Diffusion Policies using expert demonstrations of needle insertion. Using an Ensemble model, we can quantify the policy’s epistemic uncertainty, which is used to determine Out-Of-Distribution scenarios. This allows the system to release control back to the surgeon in the event of an unsafe scenario. Additionally, we implement a model-free Control Barrier Function to place formal safety guarantees on the predicted action. We experimentally evaluate our proposed framework using a state-of-the-art robotic suturing simulator. We evaluate multiple scenarios, such as dropping the needle, moving the camera, and moving the phantom. The learned policy is robust to these perturbations, showing corrective behaviors and generalization, and it is possible to detect Out-Of-Distribution scenarios. We further demonstrate that the Control Barrier Function successfully limits the action to remain within our specified safety set in the case of unsafe predictions.
zh

[AI-40] How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning

【速读】：该论文试图解决在强化学习的零样本策略迁移（zero-shot policy transfer）设置中，如何通过策略蒸馏（policy distillation）提升智能体在未见过的测试环境中的泛化能力的问题。其解决方案的关键在于证明了训练后策略蒸馏的泛化界，并提出两个实践性见解：一是应训练一个蒸馏策略的集成（ensemble），二是应在尽可能多的训练环境数据上进行策略蒸馏，以提升泛化性能。实验验证表明，在理论假设不成立的情况下，这些见解依然有效，且在多样化数据集上蒸馏的策略集成能够显著优于原始智能体。

链接: https://arxiv.org/abs/2505.16581
作者: Max Weltevrede,Moritz A. Zanger,Matthijs T.J. Spaan,Wendelin Böhmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
zh

[AI-41] From Local Patterns to Global Understanding: Cross-Stock Trend Integration for Enhanced Predictive Modeling

【速读】：该论文试图解决传统单股票学习方法在股票价格预测中无法有效利用股票趋势之间潜在相关性的问题，从而限制了对多股票价格动态的全面理解。其解决方案的关键在于提出一种基于跨股票模式整合的新型方法，即Cross-Stock Trend Integration (CSTI)，该方法受联邦学习（Federated Learning, FL）启发，通过在分布式数据集上协同训练模型来融合局部模式，形成全局模型，同时保持数据隐私，并在特定股票数据上进行微调以保留局部相关性。

链接: https://arxiv.org/abs/2505.16573
作者: Yi Hu,Hanchi Ren,Jingjing Deng,Xianghua Xie
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stock price prediction is a critical area of financial forecasting, traditionally approached by training models using the historical price data of individual stocks. While these models effectively capture single-stock patterns, they fail to leverage potential correlations among stock trends, which could improve predictive performance. Current single-stock learning methods are thus limited in their ability to provide a broader understanding of price dynamics across multiple stocks. To address this, we propose a novel method that merges local patterns into a global understanding through cross-stock pattern integration. Our strategy is inspired by Federated Learning (FL), a paradigm designed for decentralized model training. FL enables collaborative learning across distributed datasets without sharing raw data, facilitating the aggregation of global insights while preserving data privacy. In our adaptation, we train models on individual stock data and iteratively merge them to create a unified global model. This global model is subsequently fine-tuned on specific stock data to retain local relevance. The proposed strategy enables parallel training of individual stock models, facilitating efficient utilization of computational resources and reducing overall training time. We conducted extensive experiments to evaluate the proposed method, demonstrating that it outperforms benchmark models and enhances the predictive capabilities of state-of-the-art approaches. Our results highlight the efficacy of Cross-Stock Trend Integration (CSTI) in advancing stock price prediction, offering a robust alternative to traditional single-stock learning methodologies.
zh

[AI-42] Finetuning-Activated Backdoors in LLM s

【速读】：该论文试图解决在微调大型语言模型（Large Language Models, LLMs）过程中存在的安全问题，即攻击者可以通过某种方式在模型中植入后门，使其在初始阶段表现正常，但在被下游用户微调后触发恶意行为。解决方案的关键在于提出一种名为FAB（Finetuning-Activated Backdoor）的攻击方法，该方法通过元学习技术对LLM进行污染，模拟下游微调过程，并显式优化微调后模型中恶意行为的出现，同时确保污染后的模型在未微调前仍保持正常功能。

链接: https://arxiv.org/abs/2505.16567
作者: Thibaud Gloaguen,Mark Vero,Robin Staab,Martin Vechev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Finetuning openly accessible Large Language Models (LLMs) has become standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets led to predictable behaviors. In this paper, we demonstrate for the first time that an adversary can create poisoned LLMs that initially appear benign but exhibit malicious behaviors once finetuned by downstream users. To this end, our proposed attack, FAB (Finetuning-Activated Backdoor), poisons an LLM via meta-learning techniques to simulate downstream finetuning, explicitly optimizing for the emergence of malicious behaviors in the finetuned models. At the same time, the poisoned LLM is regularized to retain general capabilities and to exhibit no malicious behaviors prior to finetuning. As a result, when users finetune the seemingly benign model on their own datasets, they unknowingly trigger its hidden backdoor behavior. We demonstrate the effectiveness of FAB across multiple LLMs and three target behaviors: unsolicited advertising, refusal, and jailbreakability. Additionally, we show that FAB-backdoors are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler). Our findings challenge prevailing assumptions about the security of finetuning, revealing yet another critical attack vector exploiting the complexities of LLMs.
zh

[AI-43] Find the Fruit: Designing a Zero-Shot Sim2Real Deep RL Planner for Occlusion Aware Plant Manipulation

【速读】：该论文旨在解决在杂乱植物环境中，机器人如何通过与柔性植物的交互来揭示被遮挡的目标物体（如果实）的问题。解决方案的关键在于提出一种端到端的深度强化学习（Deep Reinforcement Learning, DRL）框架，该框架利用多模态观测信息，并将运动学规划问题与机器人控制解耦，从而简化了训练策略的零样本仿真到现实（sim2real）迁移。

链接: https://arxiv.org/abs/2505.16547
作者: Nitesh Subedi,Hsin-Jung Yang,Devesh K. Jha,Soumik Sarkar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 18 Pages, 15 Figures, 5 Tables

点击查看摘要

Abstract:This paper presents an end-to-end deep reinforcement learning (RL) framework for occlusion-aware robotic manipulation in cluttered plant environments. Our approach enables a robot to interact with a deformable plant to reveal hidden objects of interest, such as fruits, using multimodal observations. We decouple the kinematic planning problem from robot control to simplify zero-shot sim2real transfer for the trained policy. Our results demonstrate that the trained policy, deployed using our framework, achieves up to 86.7% success in real-world trials across diverse initial conditions. Our findings pave the way toward autonomous, perception-driven agricultural robots that intelligently interact with complex foliage plants to “find the fruit” in challenging occluded scenarios, without the need for explicitly designed geometric and dynamic models of every plant scenario.
zh

[AI-44] Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods

【速读】：该论文试图解决核方法在机器学习中因黑箱特性而导致的可解释性问题，尤其是如何高效计算基于Shapley值的特征归因。解决方案的关键在于提出PKeX-Shapley算法，该算法利用乘积核的乘法结构，实现了在多项式时间内精确计算Shapley值，通过功能分解和递归公式提高了计算效率与模型可解释性。

链接: https://arxiv.org/abs/2505.16516
作者: Majid Mohammadi,Siu Lun Chau,Krikamol Muandet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kernel methods are widely used in machine learning due to their flexibility and expressive power. However, their black-box nature poses significant challenges to interpretability, limiting their adoption in high-stakes applications. Shapley value-based feature attribution techniques, such as SHAP and kernel-specific variants like RKHS-SHAP, offer a promising path toward explainability. Yet, computing exact Shapley values remains computationally intractable in general, motivating the development of various approximation schemes. In this work, we introduce PKeX-Shapley, a novel algorithm that utilizes the multiplicative structure of product kernels to enable the exact computation of Shapley values in polynomial time. We show that product-kernel models admit a functional decomposition that allows for a recursive formulation of Shapley values. This decomposition not only yields computational efficiency but also enhances interpretability in kernel-based learning. We also demonstrate how our framework can be generalized to explain kernel-based statistical discrepancies such as the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Independence Criterion (HSIC), thus offering new tools for interpretable statistical inference.
zh

[AI-45] Edge-First Language Model Inference: Models Metrics and Tradeoffs

【速读】：该论文旨在解决在计算连续体（从云端到边缘）中高效部署语言模型（Language Models, LMs）所面临的挑战，特别是在资源受限的边缘平台实现低延迟、低成本和高可靠性的推理。其解决方案的关键在于利用小型语言模型（Small Language Models, SLMs）通过模型压缩技术实现设备端推理，并通过深入分析边缘与云端部署之间的协同作用，提出适应异构环境的高效、自适应的LM推理系统设计原则。

链接: https://arxiv.org/abs/2505.16508
作者: SiYoung Jang,Roberto Morabito
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注: This paper has been accepted for publication and presentation at the 45th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS 2025). The copyright will be transferred to IEEE upon publication in the conference proceedings

点击查看摘要

Abstract:The widespread adoption of Language Models (LMs) across industries is driving interest in deploying these services across the computing continuum, from the cloud to the network edge. This shift aims to reduce costs, lower latency, and improve reliability and privacy. Small Language Models (SLMs), enabled by advances in model compression, are central to this shift, offering a path to on-device inference on resource-constrained edge platforms. This work examines the interplay between edge and cloud deployments, starting from detailed benchmarking of SLM capabilities on single edge devices, and extending to distributed edge clusters. We identify scenarios where edge inference offers comparable performance with lower costs, and others where cloud fallback becomes essential due to limits in scalability or model capacity. Rather than proposing a one-size-fits-all solution, we present platform-level comparisons and design insights for building efficient, adaptive LM inference systems across heterogeneous environments.
zh

[AI-46] Relevance for Stability of Verification Status of a Set of Arguments in Incomplete Argumentation Frameworks (with Proofs)

【速读】：该论文试图解决在不完整论证框架（Incomplete Argumentation Frameworks, IAFs）中，如何确保给定论证集合的验证状态在所有可能的完成情况下保持稳定的问题。解决方案的关键在于引入“相关性”（relevance）的概念，用于描述必须解决的不确定性，以保证在不同完成情况下，判断一个论证集合是否为扩展的结果一致。此外，论文还提出了“强相关性”（strong relevance）来刻画在所有达到稳定性的场景中解决必要性的需求。研究结果表明，在大多数语义下，检测（强）相关性可以在多项式时间内完成，但基于默认语义（grounded semantics）的相关性检测仍存在计算难度。

链接: https://arxiv.org/abs/2505.16507
作者: Anshu Xiong,Songmao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is a version of paper ‘Relevance for Stability of Verification Status of a Set of Arguments in Incomplete Argumentation Frameworks’ extented with proofs of the results in the paper

点击查看摘要

Abstract:The notion of relevance was proposed for stability of justification status of a single argument in incomplete argumentation frameworks (IAFs) in 2024 by Odekerken et al. To extend the notion, we study the relevance for stability of verification status of a set of arguments in this paper, i.e., the uncertainties in an IAF that have to be resolved in some situations so that answering whether a given set of arguments is an extension obtains the same result in every completion of the IAF. Further we propose the notion of strong relevance for describing the necessity of resolution in all situations reaching stability. An analysis of complexity reveals that detecting the (strong) relevance for stability of sets of arguments can be accomplished in P time under the most semantics discussed in the paper. We also discuss the difficulty in finding tractable methods for relevance detection under grounded semantics.
zh

[AI-47] Smaller Smarter Closer: The Edge of Collaborative Generative AI

【速读】：该论文试图解决生成式 AI（Generative AI）在云中心部署中面临的延迟、成本和隐私问题，以及小型语言模型（Small Language Models, SLMs）在资源受限的边缘环境中的能力不足问题。解决方案的关键在于构建协作推理系统，通过结合边缘与云资源，提出不同的协作策略及实用的设计原则，以实现生成式 AI 在计算连续体中的有效部署。

链接: https://arxiv.org/abs/2505.16499
作者: Roberto Morabito,SiYoung Jang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: This paper is currently under review for publication in an IEEE magazine. If accepted, the copyright will be transferred to IEEE

点击查看摘要

Abstract:The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.
zh

[AI-48] Human-like Semantic Navigation for Autonomous Driving using Knowledge Representation and Large Language Models

【速读】：该论文试图解决自动驾驶车辆在动态城市环境中实现完全自动化所面临的挑战，特别是在应对道路布局的不可预测变化、临时绕行或缺失地图数据时，现有系统因过度依赖预定义的地图信息而难以生成有效的导航计划。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）将非正式的导航指令转化为结构化的答案集编程（Answer Set Programming, ASP）规则，从而通过ASP的非单调推理能力，使自动驾驶系统能够在无需依赖预定义地图的情况下适应动态场景。

链接: https://arxiv.org/abs/2505.16498
作者: Augusto Luis Ballardini,Miguel Ángel Sotelo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, submitted for IEEE conference

点击查看摘要

Abstract:Achieving full automation in self-driving vehicles remains a challenge, especially in dynamic urban environments where navigation requires real-time adaptability. Existing systems struggle to handle navigation plans when faced with unpredictable changes in road layouts, spontaneous detours, or missing map data, due to their heavy reliance on predefined cartographic information. In this work, we explore the use of Large Language Models to generate Answer Set Programming rules by translating informal navigation instructions into structured, logic-based reasoning. ASP provides non-monotonic reasoning, allowing autonomous vehicles to adapt to evolving scenarios without relying on predefined maps. We present an experimental evaluation in which LLMs generate ASP constraints that encode real-world urban driving logic into a formal knowledge representation. By automating the translation of informal navigation instructions into logical rules, our method improves adaptability and explainability in autonomous navigation. Results show that LLM-driven ASP rule generation supports semantic-based decision-making, offering an explainable framework for dynamic navigation planning that aligns closely with how humans communicate navigational intent.
zh

[AI-49] Minimizing the energy depletion in wireless rechargeable sensor networks using bi-level metaheuristic charging schemes

【速读】：该论文旨在解决无线可充电传感器网络（Wireless Rechargeable Sensor Networks, WRSNs）中因充电策略无效而导致的能量耗尽问题。现有研究多集中于优化充电路径并采用完全充电方法，但这种方法可能导致传感器因充电延迟过长而失效。论文提出了一种新的部分充电方法，其关键在于采用双层优化方案，同时优化充电路径和充电时间，以最小化能量耗尽。通过建立数学模型并设计两种近似算法，分别结合多起点局部搜索与遗传算法，以及多任务与协方差矩阵自适应进化策略，实验结果表明所提算法在多种网络场景下优于现有方法。

链接: https://arxiv.org/abs/2505.16482
作者: Huynh Thi Thanh Binh,Le Van Cuong,Dang Hai Dang,Le Trong Vinh
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recently, Wireless Rechargeable Sensor Networks (WRSNs) that leveraged the advantage of wireless energy transfer technology have opened a promising opportunity in solving the limited energy issue. However, an ineffective charging strategy may reduce the charging performance. Although many practical charging algorithms have been introduced, these studies mainly focus on optimizing the charging path with a fully charging approach. This approach may lead to the death of a series of sensors due to their extended charging latency. This paper introduces a novel partial charging approach that follows a bi-level optimized scheme to minimize energy depletion in WRSNs. We aim at optimizing simultaneously two factors: the charging path and time. To accomplish this, we first formulate a mathematical model of the investigated problem. We then propose two approximate algorithms in which the optimization of the charging path and the charging time are considered as the upper and lower level, respectively. The first algorithm combines a Multi-start Local Search method and a Genetic Algorithm to find a solution. The second algorithm adopts a nested approach that utilizes the advantages of the Multitasking and Covariance Matrix Adaptation Evolutionary Strategies. Experimental validations on various network scenarios demonstrate that our proposed algorithms outperform the existing works.
zh

[AI-50] Advancing the Scientific Method with Large Language Models : From Hypothesis to Discovery

【速读】：该论文试图解决如何将生成式 AI (Generative AI) 有效整合到科学研究流程中，以提升科研效率并重塑科学方法的问题。其解决方案的关键在于推动大型语言模型 (Large Language Models, LLMs) 深度融入科学研究的各个阶段，包括假设验证与发现，并通过与人类科学目标的协作和对齐，建立明确的评估指标，确保其作为创造性引擎和生产力增强工具的有效性。同时，论文强调在AI驱动科学发展的过程中需关注伦理问题，如创造力、监督与责任，并指出在合理引导下，LLMs 可能成为推动跨学科突破的重要力量。

链接: https://arxiv.org/abs/2505.16477
作者: Yanbo Zhang,Sumeer A. Khan,Adnan Mahmud,Huck Yang,Alexander Lavin,Michael Levin,Jeremy Frey,Jared Dunnmon,James Evans,Alan Bundy,Saso Dzeroski,Jesper Tegner,Hector Zenil
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 45 pages

点击查看摘要

Abstract:With recent Nobel Prizes recognising AI contributions to science, Large Language Models (LLMs) are transforming scientific research by enhancing productivity and reshaping the scientific method. LLMs are now involved in experimental design, data analysis, and workflows, particularly in chemistry and biology. However, challenges such as hallucinations and reliability persist. In this contribution, we review how Large Language Models (LLMs) are redefining the scientific method and explore their potential applications across different stages of the scientific cycle, from hypothesis testing to discovery. We conclude that, for LLMs to serve as relevant and effective creative engines and productivity enhancers, their deep integration into all steps of the scientific process should be pursued in collaboration and alignment with human scientific goals, with clear evaluation metrics. The transition to AI-driven science raises ethical questions about creativity, oversight, and responsibility. With careful guidance, LLMs could evolve into creative engines, driving transformative breakthroughs across scientific disciplines responsibly and effectively. However, the scientific community must also decide how much it leaves to LLMs to drive science, even when associations with ‘reasoning’, mostly currently undeserved, are made in exchange for the potential to explore hypothesis and solution regions that might otherwise remain unexplored by human exploration alone.
zh

[AI-51] ReflectEvo: Improving Meta Introspection of Small LLM s by Learning Self-Reflection

【速读】：该论文试图解决如何通过自我反思学习（reflection learning）提升小语言模型（SLMs）的推理能力问题。其解决方案的关键在于构建一个名为ReflectEvo的新型流水线，该流水线通过迭代生成自我反思内容进行自训练，从而实现持续且自我演进的过程。该方法无需依赖优越模型的蒸馏或细粒度的人工标注，仅依靠自动生成的高质量反思数据，显著提升了SLMs的推理性能。

链接: https://arxiv.org/abs/2505.16475
作者: Jiaqi Li,Xinyi Dong,Yang Liu,Zhizhuo Yang,Quansen Wang,Xiaobo Wang,SongChun Zhu,Zixia Jia,Zilong Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs’ reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.
zh

[AI-52] Conf-GNNRec: Quantifying and Calibrating the Prediction Confidence for GNN-based Recommendation Methods

【速读】：该论文旨在解决基于图神经网络（Graph Neural Network, GNN）的推荐系统在噪声环境下预测结果不可靠的问题，特别是在用户误用和恶意广告等噪声通过消息传播机制积累的情况下，低权重的噪声邻居可能被误认为有效信息，导致预测结果不可信。解决方案的关键在于提出一种量化和校准GNN推荐预测置信度的方法（Conf-GNNRec），其核心包括一种基于用户个性化的评分校准方法，用于动态调整过高的评分以缓解过度自信问题，以及设计一种置信度损失函数，以降低负样本的过度自信并提升推荐性能。

链接: https://arxiv.org/abs/2505.16466
作者: Meng Yan,Cai Xu,Xujing Wang,Ziyu Guan,Wei Zhao,Yuhang Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems based on graph neural networks perform well in tasks such as rating and ranking. However, in real-world recommendation scenarios, noise such as user misuse and malicious advertisement gradually accumulates through the message propagation mechanism. Even if existing studies mitigate their effects by reducing the noise propagation weights, the severe sparsity of the recommender system still leads to the low-weighted noisy neighbors being mistaken as meaningful information, and the prediction result obtained based on the polluted nodes is not entirely trustworthy. Therefore, it is crucial to measure the confidence of the prediction results in this highly noisy framework. Furthermore, our evaluation of the existing representative GNN-based recommendation shows that it suffers from overconfidence. Based on the above considerations, we propose a new method to quantify and calibrate the prediction confidence of GNN-based recommendations (Conf-GNNRec). Specifically, we propose a rating calibration method that dynamically adjusts excessive ratings to mitigate overconfidence based on user personalization. We also design a confidence loss function to reduce the overconfidence of negative samples and effectively improve recommendation performance. Experiments on public datasets demonstrate the validity of Conf-GNNRec in prediction confidence and recommendation performance.
zh

[AI-53] MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

【速读】：该论文试图解决多模态大语言模型（Multi-Modal Large Language Models, MLLMs）在推理能力方面缺乏标准化评估基准的问题，特别是那些引入了中间思维路径（MLLMs-T）的模型。现有研究主要关注感知或最终答案的正确性，而忽视了模型在不同模态下的推理过程和失败机制。解决方案的关键在于提出一个新的基准测试框架——MMM R，其核心包括一个高难度的多模态推理数据集和一个模块化的Reasoning Trace Evaluation Pipeline (RTEP)，用于从相关性、一致性及结构化错误标注等维度评估推理质量，从而弥补准确率与推理质量之间的差距。

链接: https://arxiv.org/abs/2505.16459
作者: Guiyao Tie,Xueyang Zhou,Tianhe Gu,Ruihang Zhang,Chaoran Hu,Sizhe Zhang,Mengqu Sun,Yan Zhang,Pan Zhou,Lichao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 28 figures, 4 tables

点击查看摘要

Abstract:Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.
zh

[AI-54] Psychology-driven LLM Agents for Explainable Panic Prediction on Social Media during Sudden Disaster Events

【速读】：该论文旨在解决突发事件中社交媒体上公众恐慌情绪预测的准确性问题，其主要挑战包括缺乏精细标注数据、风险感知建模不足以及恐慌形成机制的可解释性欠缺。解决方案的关键在于提出一种基于情感唤醒理论的心理驱动生成代理框架（PsychoAgent），通过构建细粒度开放恐慌情绪数据集（COPE）以减少语义偏差，并整合跨领域异构数据来建模风险感知和情绪生成中的认知差异。此外，设计基于大型语言模型的角色扮演代理以增强对恐慌形成过程的可解释性，从而实现从“数据驱动拟合”到“基于角色的模拟与机制解释”的范式转变。

链接: https://arxiv.org/abs/2505.16455
作者: Mengzhu Liu,Zhengqiu Zhu,Chuan Ai,Chen Gao,Xinghong Li,Lingnan He,Kaisheng Lai,Yingfeng Chen,Xin Lu,Yong Li,Quanjun Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:During sudden disaster events, accurately predicting public panic sentiment on social media is crucial for proactive governance and crisis management. Current efforts on this problem face three main challenges: lack of finely annotated data hinders emotion prediction studies, unmodeled risk perception causes prediction inaccuracies, and insufficient interpretability of panic formation mechanisms. We address these issues by proposing a Psychology-driven generative Agent framework (PsychoAgent) for explainable panic prediction based on emotion arousal theory. Specifically, we first construct a fine-grained open panic emotion dataset (namely COPE) via human-large language models (LLMs) collaboration to mitigate semantic bias. Then, we develop a framework integrating cross-domain heterogeneous data grounded in psychological mechanisms to model risk perception and cognitive differences in emotion generation. To enhance interpretability, we design an LLM-based role-playing agent that simulates individual psychological chains through dedicatedly designed prompts. Experimental results on our annotated dataset show that PsychoAgent improves panic emotion prediction performance by 12.6% to 21.7% compared to baseline models. Furthermore, the explainability and generalization of our approach is validated. Crucially, this represents a paradigm shift from opaque “data-driven fitting” to transparent “role-based simulation with mechanistic interpretation” for panic emotion prediction during emergencies. Our implementation is publicly available at: this https URL.
zh

[AI-55] Internal Bias in Reasoning Models leads to Overthinking

【速读】：该论文试图解决当前推理模型中存在的“过度思考”问题（overthinking），即模型在处理推理任务时因冗余和不必要的反思而浪费计算资源。其解决方案的关键在于识别并缓解模型内部对输入文本的偏见（internal bias），该偏见导致模型在初步猜测与实际推理结果冲突时产生不必要的反思。通过遮蔽原始输入部分，可以有效减少内部偏见的影响，从而缩短推理长度并提升准确性。

链接: https://arxiv.org/abs/2505.16448
作者: Renfei Dang,Shujian Huang,Jiajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While current reasoning models possess strong exploratory capabilities, they are often criticized for overthinking due to redundant and unnecessary reflections. In this work, we reveal for the first time that overthinking in reasoning models may stem from their internal bias towards input texts. Upon encountering a reasoning problem, the model immediately forms a preliminary guess about the answer, which we term as an internal bias since it is not derived through actual reasoning. When this guess conflicts with its reasoning result, the model tends to engage in reflection, leading to the waste of computational resources. Through further interpretability experiments, we find that this behavior is largely driven by the model’s excessive attention to the input section, which amplifies the influence of internal bias on its decision-making process. Additionally, by masking out the original input section, the affect of internal bias can be effectively alleviated and the reasoning length could be reduced by 31%-53% across different complex reasoning tasks. Notably, in most cases, this approach also leads to improvements in accuracy. These findings demonstrate a causal relationship between internal bias and overthinking.
zh

[AI-56] AutoMCQ – Automatically Generate Code Comprehension Questions using GenAI

【速读】：该论文试图解决学生对其所编写代码理解不足的问题，这一问题在教育后期才可能被发现，导致纠正错误知识或误解的难度增加。同时，在学生可以访问生成式人工智能（Generative AI）工具的背景下，理解代码的能力变得愈发重要。论文提出的解决方案是利用代码理解问题，通过评估学生对提交代码的理解来检测抄袭行为，但该方法耗时且难以扩展。其关键在于使用GenAI自动生成多项选择题，从而实现代码理解问题的自动化生成，该方法已集成到CodeRunner自动化评估平台中。

链接: https://arxiv.org/abs/2505.16430
作者: Martin Goodfellow,Robbie Booth,Andrew Fagan,Alasdair Lambert
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Students often do not fully understand the code they have written. This sometimes does not become evident until later in their education, which can mean it is harder to fix their incorrect knowledge or misunderstandings. In addition, being able to fully understand code is increasingly important in a world where students have access to generative artificial intelligence (GenAI) tools, such as GitHub Copilot. One effective solution is to utilise code comprehension questions, where a marker asks questions about a submission to gauge understanding, this can also have the side effect of helping to detect plagiarism. However, this approach is time consuming and can be difficult and/or expensive to scale. This paper introduces AutoMCQ, which uses GenAI for the automatic generation of multiple-choice code comprehension questions. This is integrated with the CodeRunner automated assessment platform.
zh

[AI-57] FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS

【速读】：该论文旨在解决现有检索增强推理方法中检索与生成模型分离所带来的问题，包括硬件和操作成本增加以及因表示瓶颈导致的检索错误。其解决方案的关键在于提出一种名为FREESON的新型框架，该框架使大型推理模型（LRM）能够自主进行相关知识检索，通过将LRM同时作为生成器和检索器，从而消除对独立检索模型的依赖。为此，研究者引入了一种针对检索任务优化的蒙特卡洛树搜索变体——CT-MCTS（Corpus-Traversing Monte Carlo Tree Search），使LRM能够在语料库中向包含答案的区域进行遍历。

链接: https://arxiv.org/abs/2505.16409
作者: Chaeeun Kim,Seungone Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Work In Progress

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in multi-step reasoning and calling search engines at appropriate steps. However, existing retrieval-augmented reasoning approaches rely on separate retrieval models, limiting the LRM’s role in retrieval to deciding when to retrieve and how to query. This separation not only increases hardware and operational costs but also leads to errors in the retrieval process due to the representation bottleneck, a phenomenon where the retriever’s embedding space is not expressive enough to meet the generator’s requirements. To address this, we shift our perspective from sequence-to-sequence matching to locating the answer-containing paths within the corpus, and propose a novel framework called FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables LRMs to retrieve relevant knowledge on their own by acting as both a generator and retriever. To achieve this, we introduce a variant of the MCTS algorithm specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus toward answer-containing regions. Our results on five open-domain QA benchmarks, including single-hop and multi-hop questions, show that FREESON achieves an average improvement of 14.4% in EM and F1 over four multi-step reasoning models with a separate retriever, and it also performs comparably to the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.
zh

[AI-58] Serious Games: Human-AI Interaction Evolution and Coevolution

【速读】：该论文试图解决人类与人工智能（Artificial Intelligence, AI）之间进化动态及共同进化的机制问题，旨在通过演化博弈论（Evolutionary Game Theory, EGT）模型分析其潜在的平衡策略与互动模式。解决方案的关键在于利用三种经典EGT模型——斗鸡博弈（Hawk-Dove Game）、重复囚徒困境（Iterated Prisoner’s Dilemma）和消耗战（War of Attrition）——来探讨人类与AI在竞争与合作中的可能演化路径，揭示其协同进化潜力及资源分配策略。

链接: https://arxiv.org/abs/2505.16388
作者: Nandini Doreswamy(1 and 2),Louise Horstmanshof(1) ((1) Southern Cross University, Lismore, New South Wales, Australia, (2) National Coalition of Independent Scholars)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 8 pages, 1 table

点击查看摘要

Abstract:The serious games between humans and AI have only just begun. Evolutionary Game Theory (EGT) models the competitive and cooperative strategies of biological entities. EGT could help predict the potential evolutionary equilibrium of humans and AI. The objective of this work was to examine some of the EGT models relevant to human-AI interaction, evolution, and coevolution. Of thirteen EGT models considered, three were examined: the Hawk-Dove Game, Iterated Prisoner’s Dilemma, and the War of Attrition. This selection was based on the widespread acceptance and clear relevance of these models to potential human-AI evolutionary dynamics and coevolutionary trajectories. The Hawk-Dove Game predicts balanced mixed-strategy equilibria based on the costs of conflict. It also shows the potential for balanced coevolution rather than dominance. Iterated Prisoner’s Dilemma suggests that repeated interaction may lead to cognitive coevolution. It demonstrates how memory and reciprocity can lead to cooperation. The War of Attrition suggests that competition for resources may result in strategic coevolution, asymmetric equilibria, and conventions on sharing resources. Therefore, EGT may provide a suitable framework to understand and predict the human-AI evolutionary dynamic. However, future research could extend beyond EGT and explore additional frameworks, empirical validation methods, and interdisciplinary perspectives. AI is being shaped by human input and is evolving in response to it. So too, neuroplasticity allows the human brain to grow and evolve in response to stimuli. If humans and AI converge in future, what might be the result of human neuroplasticity combined with an ever-evolving AI? Future research should be mindful of the ethical and cognitive implications of human-AI interaction, evolution, and coevolution.
zh

[AI-59] VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving

【速读】：该论文旨在解决基于强化学习（Reinforcement Learning, RL）的自动驾驶策略学习中面临的关键问题，包括样本效率低、泛化能力差以及在安全关键场景中对在线交互和试错学习的依赖所带来的安全隐患。现有方法在复杂驾驶情境中难以准确捕捉“安全”的真实语义，导致驾驶行为过于保守或违反约束。其解决方案的关键在于提出VL-SAFE框架，该框架采用基于世界模型（World Model）的安全强化学习方法，并引入视觉-语言模型（Vision-Language Model, VLM）作为安全引导机制，通过离线数据集进行安全策略学习，从而实现无需与真实环境交互的安全规划与策略优化。

链接: https://arxiv.org/abs/2505.16377
作者: Yansong Qu,Zilin Huang,Zihao Sheng,Jiancong Chen,Sikai Chen,Samuel Labi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL)-based autonomous driving policy learning faces critical limitations such as low sample efficiency and poor generalization; its reliance on online interactions and trial-and-error learning is especially unacceptable in safety-critical scenarios. Existing methods including safe RL often fail to capture the true semantic meaning of “safety” in complex driving contexts, leading to either overly conservative driving behavior or constraint violations. To address these challenges, we propose VL-SAFE, a world model-based safe RL framework with Vision-Language model (VLM)-as-safety-guidance paradigm, designed for offline safe policy learning. Specifically, we construct offline datasets containing data collected by expert agents and labeled with safety scores derived from VLMs. A world model is trained to generate imagined rollouts together with safety estimations, allowing the agent to perform safe planning without interacting with the real environment. Based on these imagined trajectories and safety evaluations, actor-critic learning is conducted under VLM-based safety guidance to optimize the driving policy more safely and efficiently. Extensive evaluations demonstrate that VL-SAFE achieves superior sample efficiency, generalization, safety, and overall performance compared to existing baselines. To the best of our knowledge, this is the first work that introduces a VLM-guided world model-based approach for safe autonomous driving. The demo video and code can be accessed at: this https URL
zh

[AI-60] SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning

【速读】：该论文旨在解决如何设计有效的强化学习（Reinforcement Learning, RL）任务，以充分激发大语言模型（Large Language Models, LLMs）的推理能力这一开放性问题。现有RL任务在可扩展性、可验证性和可控难度方面存在显著局限。为应对这些挑战，本文提出Saturn，一个基于布尔可满足性（SAT）问题的强化学习框架，其关键在于通过SAT问题构建可扩展的任务、实现基于规则的验证以及精确控制难度。Saturn设计了课程学习流水线，通过逐步增加难度的SAT任务训练LLMs，从而系统提升其推理能力，并引入一种机制以确保训练过程的稳定性。

链接: https://arxiv.org/abs/2505.16368
作者: Huanyu Liu,Jia Li,Hao Zhu,Kechi Zhang,Yihong Dong,Ge Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs’ outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLM reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs’ reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.16368 [cs.LG] (or arXiv:2505.16368v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.16368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-61] A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules

【速读】：该论文旨在解决分子空间探索中生成新分子的难题，尤其是在化学有效性保障和计算效率方面的挑战。其解决方案的关键在于提出CoCoGraph，一种具有约束条件的协同图扩散模型，通过内置的约束机制和协同策略，在保证生成分子化学有效性的同时，显著减少了模型参数数量，并在标准基准测试中优于现有方法。

链接: https://arxiv.org/abs/2505.16365
作者: Manuel Ruiz-Botella,Marta Sales-Pardo,Roger Guimerà
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Quantitative Methods (q-bio.QM)
备注: 28 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model’s efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
zh

[AI-62] AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）预训练和后训练过程中优化器效率与性能之间的平衡问题。现有优化器如Adam在计算第二阶矩估计时存在较高的内存和计算开销，而SGD with momentum虽然高效但优化性能有限。解决方案的关键在于提出AdamS，其通过引入一种新的分母结构——即动量与当前梯度平方加权和的平方根，从而避免了对第二阶矩的估计，使得AdamS在保持与SGD with momentum相当的内存和计算开销的同时，实现了更优的优化性能。

链接: https://arxiv.org/abs/2505.16363
作者: Huishuai Zhang,Bohan Wang,Luoxin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed (L_0, L_1) smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
zh

[AI-63] Neuromorphic-based metaheuristics: A new generation of low power low latency and small footprint optimization algorithms

【速读】：该论文试图解决传统冯·诺依曼架构在优化问题求解中的局限性，提出将神经形态计算（Neuromorphic Computing, NC）范式应用于优化算法，特别是元启发式算法的建模与实现，以实现低功耗、低延迟和小体积的优化计算。其解决方案的关键在于利用神经形态系统特有的类脑神经动力学特性，通过脉冲神经网络（Spiking Neural Networks, SNNs）结构来重构元启发式算法，从而突破传统计算架构的性能瓶颈，并为优化问题提供新的高效求解途径。

链接: https://arxiv.org/abs/2505.16362
作者: El-ghazali Talbi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neuromorphic computing (NC) introduces a novel algorithmic paradigm representing a major shift from traditional digital computing of Von Neumann architectures. NC emulates or simulates the neural dynamics of brains in the form of Spiking Neural Networks (SNNs). Much of the research in NC has concentrated on machine learning applications and neuroscience simulations. This paper investigates the modelling and implementation of optimization algorithms and particularly metaheuristics using the NC paradigm as an alternative to Von Neumann architectures, leading to breakthroughs in solving optimization problems. Neuromorphic-based metaheuristics (Nheuristics) are supposed to be characterized by low power, low latency and small footprint. Since NC systems are fundamentally different from conventional Von Neumann computers, several challenges are posed to the design and implementation of Nheuristics. A guideline based on a classification and critical analysis is conducted on the different families of metaheuristics and optimization problems they address. We also discuss future directions that need to be addressed to expand both the development and application of Nheuristics. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.16362 [cs.NE] (or arXiv:2505.16362v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2505.16362 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-64] EquivPruner: Boosting Efficiency and Quality in LLM -Based Search via Action Pruning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在进行复杂推理时因冗余探索语义等价步骤而导致的大量token消耗问题。现有语义相似性方法在特定领域如数学推理中难以准确识别语义等价性。解决方案的关键在于提出EquivPruner，这是一种简单而有效的策略，能够在LLM推理搜索过程中识别并剪枝语义等价的操作，从而减少token消耗并提升搜索效率和推理准确性。

链接: https://arxiv.org/abs/2505.16312
作者: Jiawei Liu,Qisi Chen,Jianshu Zhang,Quan Liu,Defu Lian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel at complex reasoning through search algorithms, yet current strategies often suffer from massive token consumption due to redundant exploration of semantically equivalent steps. Existing semantic similarity methods struggle to accurately identify such equivalence in domain-specific contexts like mathematical reasoning. To address this, we propose EquivPruner, a simple yet effective approach that identifies and prunes semantically equivalent actions during LLM reasoning search. We also introduce MathEquiv, the first dataset we created for mathematical statement equivalence, which enables the training of a lightweight equivalence detector. Extensive experiments across various models and tasks demonstrate that EquivPruner significantly reduces token consumption, improving searching efficiency and often bolstering reasoning accuracy. For instance, when applied to Qwen2.5-Math-7B-Instruct on GSM8K, EquivPruner reduced token consumption by 48.1% while also improving accuracy. Our code is available at this https URL.
zh

[AI-65] Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models

【速读】：该论文旨在解决自监督学习（SSL）模型在音乐信息检索中的编码信息语义及其适用性研究不足的问题，通过分析MusicFM和MuQ两种模型，揭示其在不同下游任务中的优势、层间信息的专门化特性以及特定层选择对性能的影响。解决方案的关键在于系统性地评估SSL模型在多任务场景下的表现，并深入探讨其内部结构与功能特性，从而为后续任务提供更有效的应用指导。

链接: https://arxiv.org/abs/2505.16306
作者: Yizhi Zhou,Haina Zhu,Hangting Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, pre-trained models for music information retrieval based on self-supervised learning (SSL) are becoming popular, showing success in various downstream tasks. However, there is limited research on the specific meanings of the encoded information and their applicability. Exploring these aspects can help us better understand their capabilities and limitations, leading to more effective use in downstream tasks. In this study, we analyze the advanced music representation model MusicFM and the newly emerged SSL model MuQ. We focus on three main aspects: (i) validating the advantages of SSL models across multiple downstream tasks, (ii) exploring the specialization of layer-wise information for different tasks, and (iii) comparing performance differences when selecting specific layers. Through this analysis, we reveal insights into the structure and potential applications of SSL models in music information retrieval. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.16306 [cs.SD] (or arXiv:2505.16306v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2505.16306 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-66] Multimodal Generative AI for Story Point Estimation in Software Development

【速读】：该论文试图解决敏捷软件开发中故事点估算的准确性问题，传统单模态估算方法存在局限性。其解决方案的关键在于利用多模态生成式人工智能（Multimodal Generative AI），通过集成文本、图像和分类数据，结合BERT、CNN和XGBoost等先进模型，提升估算的精度与适应性。

链接: https://arxiv.org/abs/2505.16290
作者: Mohammad Rubyet Islam,Peter Sandborn
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research explores the application of Multimodal Generative AI to enhance story point estimation in Agile software development. By integrating text, image, and categorical data using advanced models like BERT, CNN, and XGBoost, our approach surpasses the limitations of traditional single-modal estimation methods. The results demonstrate strong accuracy for simpler story points, while also highlighting challenges in more complex categories due to data imbalance. This study further explores the impact of categorical data, particularly severity, on the estimation process, emphasizing its influence on model performance. Our findings emphasize the transformative potential of multimodal data integration in refining AI-driven project management, paving the way for more precise, adaptable, and domain-specific AI capabilities. Additionally, this work outlines future directions for addressing data variability and enhancing the robustness of AI in Agile methodologies.
zh

[AI-67] No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agent ic Causal Discovery

【速读】：该论文试图解决深度学习模型在电子健康记录（Electronic Health Records, EHR）数据上训练后缺乏可解释性和交互性的问题，这两点是临床医生高度关注的特征。现有模型的“黑箱”特性使得临床医生难以理解其预测逻辑，从而限制了其在实际医疗决策中的应用。此外，缺乏交互机制也阻碍了临床医生将自身知识和经验融入决策过程。论文提出的解决方案是II-KEA，其关键在于引入一种基于知识增强的代理驱动因果发现框架，通过整合个性化知识数据库和代理型大语言模型（LLMs），提升模型的可解释性并通过定制化知识库和提示实现与临床医生的交互。

链接: https://arxiv.org/abs/2505.16288
作者: Xiaoxue Han,Pengfei Hu,Jun-En Ding,Chang Lu,Feng Liu,Yue Ning
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models trained on extensive Electronic Health Records (EHR) data have achieved high accuracy in diagnosis prediction, offering the potential to assist clinicians in decision-making and treatment planning. However, these models lack two crucial features that clinicians highly value: interpretability and interactivity. The ``black-box’’ nature of these models makes it difficult for clinicians to understand the reasoning behind predictions, limiting their ability to make informed decisions. Additionally, the absence of interactive mechanisms prevents clinicians from incorporating their own knowledge and experience into the decision-making process. To address these limitations, we propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework that integrates personalized knowledge databases and agentic LLMs. II-KEA enhances interpretability through explicit reasoning and causal analysis, while also improving interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts. II-KEA is evaluated on both MIMIC-III and MIMIC-IV, demonstrating superior performance along with enhanced interpretability and interactivity, as evidenced by its strong results from extensive case studies.
zh

[AI-68] Dialogue in Resonance: An Interactive Music Piece for Piano and Real-Time Automatic Transcription System

【速读】：该论文试图解决人机之间在音乐表演中如何实现既有结构性又具备动态互动性的对话问题，传统方法多聚焦于即兴交互，而本文提出了一种平衡的框架。解决方案的关键在于将实时自动音乐转录（real-time automatic music transcription）作为核心机制，使计算机能够实时解析并响应人类演奏者的行为，从而在保持作曲意图的同时，实现具有不可预测性的音乐对话。

链接: https://arxiv.org/abs/2505.16259
作者: Hayeon Bang,Taegyun Kwon,Juhan Nam
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents Dialogue in Resonance, an interactive music piece for a human pianist and a computer-controlled piano that integrates real-time automatic music transcription into a score-driven framework. Unlike previous approaches that primarily focus on improvisation-based interactions, our work establishes a balanced framework that combines composed structure with dynamic interaction. Through real-time automatic transcription as its core mechanism, the computer interprets and responds to the human performer’s input in real time, creating a musical dialogue that balances compositional intent with live interaction while incorporating elements of unpredictability. In this paper, we present the development process from composition to premiere performance, including technical implementation, rehearsal process, and performance considerations.
zh

[AI-69] Manipulating Elasto-Plastic Objects With 3D Occupancy and Learning-Based Predictive Control

【速读】：该论文旨在解决在操作弹性塑性物体时面临的严重自遮挡、表示困难和复杂动力学等问题。其解决方案的关键在于引入一种基于准静态假设的框架，利用3D占用（3D occupancy）来表征弹性塑性物体，并结合通过3D占用训练的动力学模型以及基于学习的预测控制算法。此外，还构建了新的数据采集平台以生成3D占用数据集，并设计了一个融合3D卷积神经网络和图神经网络的深度神经网络，用于预测复杂的形变，从而有效提升操作精度与效率。

链接: https://arxiv.org/abs/2505.16249
作者: Zhen Zhang,Xiangyu Chu,Yunxi Tang,Lulu Zhao,Jing Huang,Zhongliang Jiang,K. W. Samuel Au
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 Pages, 5 figures, accepted for publication in IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Manipulating elasto-plastic objects remains a significant challenge due to severe self-occlusion, difficulties of representation, and complicated dynamics. This work proposes a novel framework for elasto-plastic object manipulation with a quasi-static assumption for motions, leveraging 3D occupancy to represent such objects, a learned dynamics model trained with 3D occupancy, and a learning-based predictive control algorithm to address these challenges effectively. We build a novel data collection platform to collect full spatial information and propose a pipeline for generating a 3D occupancy dataset. To infer the 3D occupancy during manipulation, an occupancy prediction network is trained with multiple RGB images supervised by the generated dataset. We design a deep neural network empowered by a 3D convolution neural network (CNN) and a graph neural network (GNN) to predict the complex deformation with the inferred 3D occupancy results. A learning-based predictive control algorithm is introduced to plan the robot actions, incorporating a novel shape-based action initialization module specifically designed to improve the planner efficiency. The proposed framework in this paper can successfully shape the elasto-plastic objects into a given goal shape and has been verified in various experiments both in simulation and the real world.
zh

[AI-70] MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning

【速读】：该论文试图解决在In-Context Learning (ICL)中，由于需要大量标注数据而导致的高成本问题。解决方案的关键在于提出一种基于影响的多示例ICL框架——Many-Shot Adaptive Pseudo-LabEling (MAPLE)，该框架通过查询大语言模型（LLMs）对具有影响力的未标注样本进行伪标签生成，并将这些伪标签样本自适应地调整为每个测试查询的输入，从而在不显著增加标注成本的情况下提升多示例ICL的性能。

链接: https://arxiv.org/abs/2505.16225
作者: Zihan Chen,Song Wang,Zhen Tan,Jundong Li,Cong Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we propose Many-Shot Adaptive Pseudo-LabEling, namely MAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data.
zh

[AI-71] MADCluster: Model-agnostic Anomaly Detection with Self-supervised Clustering Network

【速读】：该论文试图解决深度学习方法在异常检测中普遍存在的“超球体坍缩”（hypersphere collapse）问题，即正常模式数据在特征空间中聚集过于紧密，导致模型难以有效区分正常与异常样本。解决方案的关键在于提出MADCluster框架，其核心思想是将正常模式数据聚类为一个“单一簇”，同时学习簇中心并映射接近该中心的数据。此外，通过引入一种新的“单向自适应损失”（One-directed Adaptive loss），提升了模型的表达能力和单簇聚类的有效性，该损失函数的优化过程已通过数学证明。

链接: https://arxiv.org/abs/2505.16223
作者: Sangyong Lee,Subo Hwang,Dohoon Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:In this paper, we propose MADCluster, a novel model-agnostic anomaly detection framework utilizing self-supervised clustering. MADCluster is applicable to various deep learning architectures and addresses the ‘hypersphere collapse’ problem inherent in existing deep learning-based anomaly detection methods. The core idea is to cluster normal pattern data into a ‘single cluster’ while simultaneously learning the cluster center and mapping data close to this center. Also, to improve expressiveness and enable effective single clustering, we propose a new ‘One-directed Adaptive loss’. The optimization of this loss is mathematically proven. MADCluster consists of three main components: Base Embedder capturing high-dimensional temporal dynamics, Cluster Distance Mapping, and Sequence-wise Clustering for continuous center updates. Its model-agnostic characteristics are achieved by applying various architectures to the Base Embedder. Experiments on four time series benchmark datasets demonstrate that applying MADCluster improves the overall performance of comparative models. In conclusion, the compatibility of MADCluster shows potential for enhancing model performance across various architectures.
zh

[AI-72] LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead

【速读】：该论文旨在解决在众多大语言模型（Large Language Models, LLMs）中选择最适合特定任务的模型所面临的挑战，这些模型在成本、性能和计算需求方面存在显著差异。论文提出的解决方案是LightRouter框架，其关键在于通过自适应选择机制识别仅需少量引导标记（boot tokens）的模型以降低费用，并采用有效的集成策略融合其输出，从而在保持或提升任务性能的同时实现成本效率的最大化。

链接: https://arxiv.org/abs/2505.16221
作者: Yifan Zhang,Xinkui Zhao,Zuxin Wang,Guanjie Cheng,Yueshen Xu,Shuiguang Deng,Jianwei Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models has unlocked remarkable capabilities across a diverse array of natural language processing tasks. However, the considerable differences among available LLMs-in terms of cost, performance, and computational demands-pose significant challenges for users aiming to identify the most suitable model for specific tasks. In this work, we present LightRouter, a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool, with the objective of jointly optimizing both task performance and cost efficiency. LightRouter leverages an adaptive selection mechanism to identify models that require only a minimal number of boot tokens, thereby reducing costs, and further employs an effective integration strategy to combine their outputs. Extensive experiments across multiple benchmarks demonstrate that LightRouter matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy. Compared with leading high-performing models, LightRouter achieves comparable performance while reducing inference costs by up to 27%. Importantly, our framework operates without any prior knowledge of individual models and relies exclusively on inexpensive, lightweight models. This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.
zh

[AI-73] Velocity Completion Task and Method for Event-based Player Positional Data in Soccer

【速读】：该论文试图解决在团队运动中，基于事件驱动的位置数据缺乏连续时间信息的问题，这限制了对个体代理行为和团队策略的深入动态分析。解决方案的关键在于提出一种新方法，仅利用事件驱动的位置数据同时补全所有代理的速度信息，并基于此验证现有团队运动分析与评估方法的适用性。该方法通过神经网络模型考虑了球员之间或球员与球之间的潜在时间依赖性和图结构，从而提高了速度补全的准确性。

链接: https://arxiv.org/abs/2505.16199
作者: Rikuhei Umemoto,Keisuke Fujii
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures

点击查看摘要

Abstract:In many real-world complex systems, the behavior can be observed as a collection of discrete events generated by multiple interacting agents. Analyzing the dynamics of these multi-agent systems, especially team sports, often relies on understanding the movement and interactions of individual agents. However, while providing valuable snapshots, event-based positional data typically lacks the continuous temporal information needed to directly calculate crucial properties such as velocity. This absence severely limits the depth of dynamic analysis, preventing a comprehensive understanding of individual agent behaviors and emergent team strategies. To address this challenge, we propose a new method to simultaneously complete the velocity of all agents using only the event-based positional data from team sports. Based on this completed velocity information, we investigate the applicability of existing team sports analysis and evaluation methods. Experiments using soccer event data demonstrate that neural network-based approaches outperformed rule-based methods regarding velocity completion error, considering the underlying temporal dependencies and graph structure of player-to-player or player-to-ball interaction. Moreover, the space evaluation results obtained using the completed velocity are closer to those derived from complete tracking data, highlighting our method’s potential for enhanced team sports system analysis.
zh

[AI-74] SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

【速读】：该论文旨在解决ControlNet-based与from-scratch foley合成模型之间的性能差距问题，具体是提升基于ControlNet的音频生成模型在视频同步音效合成任务中的表现。其解决方案的关键在于提出SpecMaskFoley方法，通过将预训练的SpecMaskGIT模型与ControlNet结合，并引入一种频率感知的时间特征对齐器，以解决视频时间特征与音频时间-频率特性之间的不匹配问题，从而避免了传统复杂条件机制的依赖，显著提升了模型性能。

链接: https://arxiv.org/abs/2505.16195
作者: Zhi Zhong,Akira Takahashi,Shuyang Cui,Keisuke Toyama,Shusuke Takahashi,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: 4 pages, 2 figures, 2 tables. Demo page: this https URL

点击查看摘要

Abstract:Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: this https URL
zh

[AI-75] EasyInsert: A Data-Efficient and Generalizable Insertion Policy

【速读】：该论文旨在解决在杂乱环境中进行插接任务的挑战性问题，现有方法在泛化能力、复杂场景适应性和对先验信息（如CAD模型或数字孪生）的依赖方面存在明显不足。解决方案的关键在于提出EasyInsert框架，其核心思想是利用插头与插座之间的相对位姿（delta pose）作为成功插接的充分条件，并通过高效且自动化的现实世界数据收集方式训练一个可泛化的相对位姿预测模型，从而实现无需人工干预的高成功率插接操作。

链接: https://arxiv.org/abs/2505.16187
作者: Guanghe Li,Junming Zhao,Shengjie Wang,Yang Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Insertion task is highly challenging that requires robots to operate with exceptional precision in cluttered environments. Existing methods often have poor generalization capabilities. They typically function in restricted and structured environments, and frequently fail when the plug and socket are far apart, when the scene is densely cluttered, or when handling novel objects. They also rely on strong assumptions such as access to CAD models or a digital twin in simulation. To address this, we propose EasyInsert, a framework which leverages the human intuition that relative pose (delta pose) between plug and socket is sufficient for successful insertion, and employs efficient and automated real-world data collection with minimal human labor to train a generalizable model for relative pose prediction. During execution, EasyInsert follows a coarse-to-fine execution procedure based on predicted delta pose, and successfully performs various insertion tasks. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, handling cases with significant initial pose deviations while maintaining high sample efficiency and requiring little human effort. In real-world experiments, with just 5 hours of training data, EasyInsert achieves over 90% success in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, with only one human demonstration and 4 minutes of automatically collected data for fine-tuning, it reaches over 90% success rate for all 15 objects.
zh

[AI-76] Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value

【速读】：该论文试图解决大规模模型背景下高效数据估值问题，即如何量化个体数据提供者的贡献。传统方法如基于博弈论的Shapley值和基于影响函数的技术面临计算成本高或需访问完整数据和模型训练细节的问题，难以实现部分数据估值。解决方案的关键在于提出一种名为Unlearning Shapley的新框架，该框架利用机器遗忘技术，通过从预训练模型中移除目标数据并测量在可访问测试集上的性能变化，结合蒙特卡洛采样计算Shapley值，从而避免重新训练并消除对完整数据的依赖。

链接: https://arxiv.org/abs/2505.16147
作者: Le Ma,Shirao Yang,Zihao Wang,Yinggui Wang,Lei Wang,Tao Wei,Kejun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of large models has intensified the need for efficient data valuation methods to quantify the contribution of individual data providers. Traditional approaches, such as game-theory-based Shapley value and influence-function-based techniques, face prohibitive computational costs or require access to full data and model training details, making them hardly achieve partial data valuation. To address this, we propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently. By unlearning target data from a pretrained model and measuring performance shifts on a reachable test set, our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data. Crucially, Unlearning Shapley supports both full and partial data valuation, making it scalable for large models (e.g., LLMs) and practical for data markets. Experiments on benchmark datasets and large-scale text corpora demonstrate that our approach matches the accuracy of state-of-the-art methods while reducing computational overhead by orders of magnitude. Further analysis confirms a strong correlation between estimated values and the true impact of data subsets, validating its reliability in real-world scenarios. This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.
zh

[AI-77] Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）推理基准无法有效捕捉真实创造力的问题，这些问题通常倾向于奖励对已有模式的记忆而非创造性、多步骤的逻辑推理。解决方案的关键在于提出Sudoku-Bench，一个精心设计的基准测试集，包含具有挑战性和非传统规则的数独变体，旨在评估模型的创造性多步骤逻辑推理能力。这些数独变体通过引入独特或相互作用的约束条件，使记忆策略失效，并要求求解器发现新的逻辑突破点，同时保持统一且简洁的结构以实现清晰一致的评估。

链接: https://arxiv.org/abs/2505.16135
作者: Jeffrey Seely,Yuki Imajuku,Tianyu Zhao,Edoardo Cetin,Llion Jones
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins’'). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles – making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.
zh

[AI-78] Scalable Graph Generative Modeling via Substructure Sequences

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）在可扩展性方面的局限性，包括表达能力受限、过平滑、过压缩以及难以建模长程依赖等问题。其解决方案的关键在于提出一种超越传统消息传递机制的生成式图模式机器（Generative Graph Pattern Machine, G²PM），该框架将图实例表示为子结构序列，并通过生成式预训练学习可迁移的表示，从而实现更强的可扩展性和泛化能力。

链接: https://arxiv.org/abs/2505.16130
作者: Zehong Wang,Zheyuan Zhang,Tianyi Ma,Chuxu Zhang,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations – including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G ^2 PM), a generative Transformer pre-training framework for graphs. G ^2 PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G ^2 PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks – including node classification, graph classification, and transfer learning – G ^2 PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at this https URL.
zh

[AI-79] LLM -Powered AI Agent Systems and Their Applications in Industry

【速读】：该论文试图解决当前基于大语言模型（Large Language Models, LLMs）的智能体系统在实际应用中面临的一系列挑战，包括高推理延迟、输出不确定性、缺乏评估指标以及安全漏洞等问题。其解决方案的关键在于针对上述问题提出潜在的缓解策略，以提升LLM-powered agents的可靠性、效率和安全性，从而推动其在多个领域中的更广泛应用。

链接: https://arxiv.org/abs/2505.16120
作者: Guannan Liang,Qianqian Tong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the author’s accepted version of the paper accepted to appear at IEEE AIIoT 2025. The final version will be available via IEEE Xplore. \c{opyright}2025 IEEE. Personal use of this material is permitted

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.
zh

[AI-80] Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language

【速读】：该论文试图解决在自然语言中求解谜题这一长期存在的AI挑战，特别是针对需要精确推理和全面搜索的复杂谜题，尽管大型语言模型（LLMs）在多种任务中表现出色，但仍存在局限性。解决方案的关键在于提出一种名为Logic-of-Thought (Logot) 的新框架，该框架将LLMs与逻辑编程相结合，通过LLMs将谜题规则和状态转换为答案集程序（ASP），再由ASP解释器准确高效地推导出解，从而融合了LLMs的自然语言理解能力和逻辑程序的精确推理能力。

链接: https://arxiv.org/abs/2505.16114
作者: Naiqi Li,Peiyuan Liu,Zheng Liu,Tao Dai,Yong Jiang,Shu-Tao Xia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Solving puzzles in natural language poses a long-standing challenge in AI. While large language models (LLMs) have recently shown impressive capabilities in a variety of tasks, they continue to struggle with complex puzzles that demand precise reasoning and exhaustive search. In this paper, we propose Logic-of-Thought (Logot), a novel framework that bridges LLMs with logic programming to address this problem. Our method leverages LLMs to translate puzzle rules and states into answer set programs (ASPs), the solution of which are then accurately and efficiently inferred by an ASP interpreter. This hybrid approach combines the natural language understanding of LLMs with the precise reasoning capabilities of logic programs. We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks. Our code and data are available at: this https URL.
zh

[AI-81] owards Trustworthy Keylogger detection: A Comprehensive Analysis of Ensemble Techniques and Feature Selections through Explainable AI

【速读】：该论文旨在解决关键记录器（Keylogger）检测问题，通过分析系统行为和网络流量模式来识别异常活动。其解决方案的关键在于采用传统机器学习模型（如SVC、随机森林、决策树、XGBoost、AdaBoost、逻辑回归和朴素贝叶斯）以及先进的集成方法（如堆叠、融合和投票）进行分类，并结合特征选择方法（如信息增益、Lasso L1和Fisher Score）以提升预测性能并降低计算复杂度。此外，研究还引入可解释人工智能（XAI）技术，如SHAP（全局）和LIME（局部），以提供模型决策的详细解释。

链接: https://arxiv.org/abs/2505.16103
作者: Monirul Islam Mahmud
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Keylogger detection involves monitoring for unusual system behaviors such as delays between typing and character display, analyzing network traffic patterns for data exfiltration. In this study, we provide a comprehensive analysis for keylogger detection with traditional machine learning models - SVC, Random Forest, Decision Tree, XGBoost, AdaBoost, Logistic Regression and Naive Bayes and advanced ensemble methods including Stacking, Blending and Voting. Moreover, feature selection approaches such as Information gain, Lasso L1 and Fisher Score are thoroughly assessed to improve predictive performance and lower computational complexity. The Keylogger Detection dataset from publicly available Kaggle website is used in this project. In addition to accuracy-based classification, this study implements the approach for model interpretation using Explainable AI (XAI) techniques namely SHAP (Global) and LIME (Local) to deliver finer explanations for how much each feature contributes in assisting or hindering the detection process. To evaluate the models result, we have used AUC score, sensitivity, Specificity, Accuracy and F1 score. The best performance was achieved by AdaBoost with 99.76% accuracy, F1 score of 0.99, 100% precision, 98.6% recall, 1.0 specificity and 0.99 of AUC that is near-perfect classification with Fisher Score.
zh

[AI-82] rialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials

【速读】：该论文旨在解决垂直领域人工智能（Artificial Intelligence, AI）在临床试验中缺乏高质量数据基础的问题，特别是针对临床试验规划、设计和总结等任务。其解决方案的关键在于构建了一个大规模、结构化的数据库——TrialPanorama，该数据库整合了来自15个全球来源的1,657,476条临床试验记录，并与DrugBank和MedDRA等生物医学本体进行关联，从而为临床试验相关任务提供统一且可扩展的数据资源。此外，研究还基于TrialPanorama设计了一套涵盖系统综述和试验设计的八项基准任务，以评估AI模型在高风险临床试验流程中的性能。

链接: https://arxiv.org/abs/2505.16097
作者: Zifeng Wang,Qiao Jin,Jiacheng Lin,Junyi Gao,Jathurshan Pradeepkumar,Pengcheng Jiang,Benjamin Danek,Zhiyong Lu,Jimeng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.
zh

[AI-83] SynEVO: A neuro-inspired spatiotemporal evolutional framework for cross-domain adaptation

【速读】：该论文旨在解决时空系统中跨领域知识迁移能力有限的问题，当前的时空学习模型通常从特定源数据独立训练，导致不同源之间的迁移性受限，即使相关任务也需要重新设计和训练。解决方案的关键在于通过引入集体智能和模型进化机制，实现跨领域知识的共享与聚合。论文受神经科学理论启发，提出了Synaptic EVOlutional spatiotemporal network（SynEVO），其核心在于打破模型独立性，通过重新排序样本组、设计互补学习器以及自适应动态耦合器，实现模型的演化与任务间的共性与个性解耦。

链接: https://arxiv.org/abs/2505.16080
作者: Jiayue Liu,Zhongchao Yi,Zhengyang Zhou,Qihe Huang,Kuo Yang,Xu Wang,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Discovering regularities from spatiotemporal systems can benefit various scientific and social planning. Current spatiotemporal learners usually train an independent model from a specific source data that leads to limited transferability among sources, where even correlated tasks requires new design and training. The key towards increasing cross-domain knowledge is to enable collective intelligence and model evolution. In this paper, inspired by neuroscience theories, we theoretically derive the increased information boundary via learning cross-domain collective intelligence and propose a Synaptic EVOlutional spatiotemporal network, SynEVO, where SynEVO breaks the model independence and enables cross-domain knowledge to be shared and aggregated. Specifically, we first re-order the sample groups to imitate the human curriculum learning, and devise two complementary learners, elastic common container and task-independent extractor to allow model growth and task-wise commonality and personality disentanglement. Then an adaptive dynamic coupler with a new difference metric determines whether the new sample group should be incorporated into common container to achieve model evolution under various domains. Experiments show that SynEVO improves the generalization capacity by at most 42% under cross-domain scenarios and SynEVO provides a paradigm of NeuroAI for knowledge transfer and adaptation.
zh

[AI-84] Bidirectional Variational Autoencoders

【速读】：该论文试图解决传统变分自编码器（Variational Autoencoder, VAE）中编码器-解码器结构参数量大且效率较低的问题。其解决方案的关键在于提出一种新型的双向变分自编码器（Bidirectional Variational Autoencoder, BVAE）架构，该架构使用单一神经网络实现编码和解码功能，通过同一突触网络在正向和反向方向上分别完成编码与解码任务，从而将参数数量减少了近50%，同时仍保持了对单向VAEs的性能优势。

链接: https://arxiv.org/abs/2505.16074
作者: Bart Kosko,Olaoluwa Adigun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:We present the new bidirectional variational autoencoder (BVAE) network architecture. The BVAE uses a single neural network both to encode and decode instead of an encoder-decoder network pair. The network encodes in the forward direction and decodes in the backward direction through the same synaptic web. Simulations compared BVAEs and ordinary VAEs on the four image tasks of image reconstruction, classification, interpolation, and generation. The image datasets included MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and CelebA-64 face images. The bidirectional structure of BVAEs cut the parameter count by almost 50% and still slightly outperformed the unidirectional VAEs.
zh

[AI-85] How Memory Management Impacts LLM Agents : An Empirical Study of Experience-Following Behavior

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理中记忆管理选择对长期性能的影响问题。研究重点在于分析记忆系统中的两个基本操作——记忆添加和删除——如何影响代理的行为，特别是在任务分布变化和有限记忆资源等挑战性条件下的表现。论文提出的关键解决方案是结合选择性添加与删除策略，以减轻误差传播和不匹配经验重放等问题，从而提升代理的长期性能，实验结果显示该方法相比简单的记忆增长策略平均提升了10%的性能。

链接: https://arxiv.org/abs/2505.16067
作者: Zidi Xiong,Yuping Lin,Wenya Xie,Pengfei He,Jiliang Tang,Himabindu Lakkaraju,Zhen Xiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents’ behavior, especially their long-term performance. Specifically, we focus on two fundamental memory operations that are widely used by many agent frameworks-addition, which incorporates new experiences into the memory base, and deletion, which selectively removes past experiences-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where outdated or irrelevant experiences negatively influence current tasks. Through controlled experiments, we show that combining selective addition and deletion strategies can help mitigate these negative effects, yielding an average absolute performance gain of 10% compared to naive memory growth. Furthermore, we highlight how memory management choices affect agents’ behavior under challenging conditions such as task distribution shifts and constrained memory resources. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance. We also release our code to facilitate further study.
zh

[AI-86] Mesh-free sparse identification of nonlinear dynamics

【速读】：该论文旨在解决从任意传感器布置和非均匀时间采样数据中识别动力系统控制方程的问题（governing equations of a dynamical system），传统方法通常依赖于高质量的时空数据和结构化网格。其解决方案的关键在于提出了一种名为无网格SINDy（mesh-free SINDy）的新算法，该算法利用神经网络近似和自动微分技术，能够在高噪声和数据有限的情况下保持计算效率，并且训练过程简单，几乎无需超参数调优。

链接: https://arxiv.org/abs/2505.16058
作者: Mars Liyao Gao,J. Nathan Kutz,Bernat Font
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注: 17 pages, 13 figures, 14 tables

点击查看摘要

Abstract:Identifying the governing equations of a dynamical system is one of the most important tasks for scientific modeling. However, this procedure often requires high-quality spatio-temporal data uniformly sampled on structured grids. In this paper, we propose mesh-free SINDy, a novel algorithm which leverages the power of neural network approximation as well as auto-differentiation to identify governing equations from arbitrary sensor placements and non-uniform temporal data sampling. We show that mesh-free SINDy is robust to high noise levels and limited data while remaining computationally efficient. In our implementation, the training procedure is straight-forward and nearly free of hyperparameter tuning, making mesh-free SINDy widely applicable to many scientific and engineering problems. In the experiments, we demonstrate its effectiveness on a series of PDEs including the Burgers’ equation, the heat equation, the Korteweg-De Vries equation and the 2D advection-diffusion equation. We conduct detailed numerical experiments on all datasets, varying the noise levels and number of samples, and we also compare our approach to previous state-of-the-art methods. It is noteworthy that, even in high-noise and low-data scenarios, mesh-free SINDy demonstrates robust PDE discovery, achieving successful identification with up to 75% noise for the Burgers’ equation using 5,000 samples and with as few as 100 samples and 1% noise. All of this is achieved within a training time of under one minute.
zh

[AI-87] Signals of Provenance: Practices Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals

【速读】：该论文试图解决用户在识别AI生成（AIG）内容时面临的挑战，特别是在依赖视觉线索的指示器可能无法有效服务于不同感官能力用户的问题。解决方案的关键在于通过半结构化访谈（N=28）研究视力正常（sighted）和盲人或低视力（BLV）用户对自披露AI指示器的交互方式，揭示内容型（如标题、描述）和菜单辅助型（如AI标签）指示器的不同优劣势，并提出改进AIG标识符在可访问性、一致性及信息清晰度方面的设计建议。

链接: https://arxiv.org/abs/2505.16057
作者: Ayae Ide,Tory Park,Jaron Mink,Tanusree Sharma
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:AI-Generated (AIG) content has become increasingly widespread by recent advances in generative models and the easy-to-use tools that have significantly lowered the technical barriers for producing highly realistic audio, images, and videos through simple natural language prompts. In response, platforms are adopting provable provenance with platforms recommending AIG to be self-disclosed and signaled to users. However, these indicators may be often missed, especially when they rely solely on visual cues and make them ineffective to users with different sensory abilities. To address the gap, we conducted semi-structured interviews (N=28) with 15 sighted and 13 BLV participants to examine their interaction with AIG content through self-disclosed AI indicators. Our findings reveal diverse mental models and practices, highlighting different strengths and weaknesses of content-based (e.g., title, description) and menu-aided (e.g., AI labels) indicators. While sighted participants leveraged visual and audio cues, BLV participants primarily relied on audio and existing assistive tools, limiting their ability to identify AIG. Across both groups, they frequently overlooked menu-aided indicators deployed by platforms and rather interacted with content-based indicators such as title and comments. We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload. These issues were especially critical for BLV individuals due to the insufficient accessibility of interface elements. We provide practical recommendations and design implications for future AIG indicators across several dimensions.
zh

[AI-88] Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

【速读】：该论文旨在解决在内存受限设备上高效部署大规模Mixture-of-Experts (MoE)模型的问题，其核心挑战在于如何优化专家（expert）的缓存策略以提升内存效率而不牺牲推理速度。解决方案的关键在于提出两种度量局部路由一致性（local routing consistency）的指标：Segment Routing Best Performance (SRP) 和 Segment Cache Best Hit Rate (SCH)，通过分析不同MoE模型的路由行为，揭示了模型结构、专家共享机制及领域专业化对路由一致性的影响力，从而为内存高效的MoE设计与部署提供了理论依据和实践指导。

链接: https://arxiv.org/abs/2505.16056
作者: Jingcong Liang,Siyuan Wang,Miren Tian,Yitong Li,Duyu Tang,Zhongyu Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this local routing consistency varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) Segment Routing Best Performance (SRP), which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) Segment Cache Best Hit Rate (SCH), which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at this https URL .
zh

[AI-89] SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLM）在二维空间中进行物理和空间推理的能力评估问题。其解决方案的关键在于构建一个基于拓扑优化（topology optimization）的新型数据集，该数据集通过提供二维边界条件、施加力和支撑条件，要求LLMs推断出最优的材料分布。该数据集涵盖了从填补部分结构中的缺失区域到预测完整材料分布等多种任务，从而挑战模型在无仿真工具或显式物理模型的情况下，理解力流和材料分布的约束条件，评估其对结构稳定性和空间组织的推理能力。

链接: https://arxiv.org/abs/2505.16048
作者: Philipp D. Siedler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.
zh

[AI-90] Equivariant Eikonal Neural Networks: Grid-Free Scalable Travel-Time Prediction on Homogeneous Spaces

【速读】：该论文旨在解决Eikonal方程的高效、可操控且适用于多种黎曼流形的求解问题。其解决方案的关键在于引入等变神经Eikonal求解器（Equivariant Neural Eikonal Solvers），该框架将等变神经场（Equivariant Neural Fields, ENFs）与神经Eikonal求解器相结合，通过共享的神经网络主干和基于李群的信号特定潜在变量（表示为点云）来建模多样化的Eikonal解，从而实现解的可操控性、几何鲁棒性和表示效率的提升。

链接: https://arxiv.org/abs/2505.16035
作者: Alejandro García-Castellanos,David R. Wessels,Nicky J. van den Berg,Remco Duits,Daniël M. Pelt,Erik J. Bekkers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Equivariant Neural Eikonal Solvers, a novel framework that integrates Equivariant Neural Fields (ENFs) with Neural Eikonal Solvers. Our approach employs a single neural field where a unified shared backbone is conditioned on signal-specific latent variables - represented as point clouds in a Lie group - to model diverse Eikonal solutions. The ENF integration ensures equivariant mapping from these latent representations to the solution field, delivering three key benefits: enhanced representation efficiency through weight-sharing, robust geometric grounding, and solution steerability. This steerability allows transformations applied to the latent point cloud to induce predictable, geometrically meaningful modifications in the resulting Eikonal solution. By coupling these steerable representations with Physics-Informed Neural Networks (PINNs), our framework accurately models Eikonal travel-time solutions while generalizing to arbitrary Riemannian manifolds with regular group actions. This includes homogeneous spaces such as Euclidean, position-orientation, spherical, and hyperbolic manifolds. We validate our approach through applications in seismic travel-time modeling of 2D and 3D benchmark datasets. Experimental results demonstrate superior performance, scalability, adaptability, and user controllability compared to existing Neural Operator-based Eikonal solver methods.
zh

[AI-91] Childrens Mental Models of AI Reasoning : Implications for AI Literacy Education

【速读】：该论文试图解决儿童对人工智能（Artificial Intelligence, AI）推理过程概念化的理解问题，特别是如何构建AI素养。解决方案的关键在于通过两阶段方法——包括与8名儿童的共同设计会话以及针对106名3至8年级学生的实地研究——识别出儿童对AI推理的三种模型：演绎型、归纳型和内在型。研究揭示了不同年龄段儿童在理解AI推理机制上的差异，并提出了影响AI课程设计和可解释AI工具开发的启示。

链接: https://arxiv.org/abs/2505.16031
作者: Aayushi Dangol,Robert Wolfe,Runhua Zhao,JaeWon Kim,Trushaa Ramanan,Katie Davis,Julie A. Kientz
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) advances in reasoning capabilities, most recently with the emergence of Large Reasoning Models (LRMs), understanding how children conceptualize AI’s reasoning processes becomes critical for fostering AI literacy. While one of the “Five Big Ideas” in AI education highlights reasoning algorithms as central to AI decision-making, less is known about children’s mental models in this area. Through a two-phase approach, consisting of a co-design session with 8 children followed by a field study with 106 children (grades 3-8), we identified three models of AI reasoning: Deductive, Inductive, and Inherent. Our findings reveal that younger children (grades 3-5) often attribute AI’s reasoning to inherent intelligence, while older children (grades 6-8) recognize AI as a pattern recognizer. We highlight three tensions that surfaced in children’s understanding of AI reasoning and conclude with implications for scaffolding AI curricula and designing explainable AI tools.
zh

[AI-92] oward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging

【速读】：该论文旨在解决扩散轨迹蒸馏方法中存在理论理解不足的问题，特别是不同蒸馏策略与生成质量之间的权衡机制尚未明确，这限制了其优化和选择。解决方案的关键在于将轨迹蒸馏重新解释为线性 regime 下的算子合并问题，其中教师模型的每一步均表示为作用于噪声数据的线性算子，并通过几何视角将其视为与噪声调度对应的投影和缩放操作。在此基础上，提出一种动态规划算法以计算最优合并策略，从而最大程度地保持信号保真度，并揭示了由数据协方差结构主导的最优策略中的尖锐相变现象。

链接: https://arxiv.org/abs/2505.16024
作者: Weiguo Gao,Ming Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 19 figures

点击查看摘要

Abstract:Diffusion trajectory distillation methods aim to accelerate sampling in diffusion models, which produce high-quality outputs but suffer from slow sampling speeds. These methods train a student model to approximate the multi-step denoising process of a pretrained teacher model in a single step, enabling one-shot generation. However, theoretical insights into the trade-off between different distillation strategies and generative quality remain limited, complicating their optimization and selection. In this work, we take a first step toward addressing this gap. Specifically, we reinterpret trajectory distillation as an operator merging problem in the linear regime, where each step of the teacher model is represented as a linear operator acting on noisy data. These operators admit a clear geometric interpretation as projections and rescalings corresponding to the noise schedule. During merging, signal shrinkage occurs as a convex combination of operators, arising from both discretization and limited optimization time of the student model. We propose a dynamic programming algorithm to compute the optimal merging strategy that maximally preserves signal fidelity. Additionally, we demonstrate the existence of a sharp phase transition in the optimal strategy, governed by data covariance structures. Our findings enhance the theoretical understanding of diffusion trajectory distillation and offer practical insights for improving distillation strategies.
zh

[AI-93] Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

【速读】：该论文试图解决在Flow-Lenia——一种具有质量守恒和参数局部化的连续细胞自动机（CA）中，自动化发现系统级动态的问题，特别是旨在揭示导致进化和生态系统动态自组织的过程。解决方案的关键在于引入一种基于好奇心驱动的AI科学家方法，结合内在动机的目标探索过程（IMGEPs），利用全局模拟指标如进化活动、基于压缩的复杂性和多尺度熵来驱动多样化的环境探索。该方法通过实验验证，相较于随机搜索能够揭示更丰富的动态，并通过交互式探索工具实现人机协作的科学探究流程。

链接: https://arxiv.org/abs/2505.15998
作者: Thomas Michel,Marko Cvjetko,Gautier Hamon,Pierre-Yves Oudeyer,Clément Moulin-Frier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, submitted to ALIFE 2025 Conference

点击查看摘要

Abstract:We present a method for the automated discovery of system-level dynamics in Flow-Lenia - a continuous cellular automaton (CA) with mass conservation and parameter localization - using a curiosity-driven AI scientist. This method aims to uncover processes leading to self-organization of evolutionary and ecosystemic dynamics in CAs. We build on previous work which uses diversity search algorithms in Lenia to find self-organized individual patterns, and extend it to large environments that support distinct interacting patterns. We adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to drive exploration of diverse Flow-Lenia environments using simulation-wide metrics, such as evolutionary activity, compression-based complexity, and multi-scale entropy. We test our method in two experiments, showcasing its ability to illuminate significantly more diverse dynamics compared to random search. We show qualitative results illustrating how ecosystemic simulations enable self-organization of complex collective behaviors not captured by previous individual pattern search and analysis. We complement automated discovery with an interactive exploration tool, creating an effective human-AI collaborative workflow for scientific investigation. Though demonstrated specifically with Flow-Lenia, this methodology provides a framework potentially applicable to other parameterizable complex systems where understanding emergent collective properties is of interest.
zh

[AI-94] PhyX: Does Your Model Have the “Wits” for Physical Reasoning ?

【速读】：该论文试图解决现有基准测试未能捕捉智能的核心方面——物理推理（physical reasoning）的问题，即模型在视觉场景中结合领域知识、符号推理和对现实世界约束的理解能力。解决方案的关键是引入PhyX，这是首个大规模基准测试，旨在评估模型在物理基础上的推理能力，其包含3K个精心设计的多模态问题，覆盖25个子领域和6个核心物理领域，并通过多种评估范式深入分析模型的物理推理能力。

链接: https://arxiv.org/abs/2505.15929
作者: Hui Shen,Taiqiang Wu,Qi Han,Yunta Hsieh,Jizhou Wang,Yuyue Zhang,Yuxin Cheng,Zijian Hao,Yuansheng Ni,Xin Wang,Zhongwei Wan,Kai Zhang,Wendong Xu,Jing Xiong,Ping Luo,Wenhu Chen,Chaofan Tao,Zhuoqing Mao,Ngai Wong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.
zh

[AI-95] Last Layer Empirical Bayes ICLR2025

【速读】：该论文试图解决神经网络预测中固有不确定性的量化问题（uncertainty quantification），这是人工智能领域的一个关键挑战。其解决方案的关键在于提出一种称为“最后一层经验贝叶斯”（last layer empirical Bayes, LLEB）的方法，该方法通过将可学习的先验建模为归一化流（normalizing flow），并将其仅应用于网络的最后一层，以最大化证据下界（evidence lower bound）。这种方法在标准贝叶叶斯神经网络（BNNs）和深度集成（deep ensembles）之间提供了中间方案，展示了经验贝叶斯在不确定性量化中的潜力。

链接: https://arxiv.org/abs/2505.15888
作者: Valentin Villecroze,Yixin Wang,Gabriel Loaiza-Ganem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at the ICBINB Worshop at ICLR 2025

点击查看摘要

Abstract:The task of quantifying the inherent uncertainty associated with neural network predictions is a key challenge in artificial intelligence. Bayesian neural networks (BNNs) and deep ensembles are among the most prominent approaches to tackle this task. Both approaches produce predictions by computing an expectation of neural network outputs over some distribution on the corresponding weights; this distribution is given by the posterior in the case of BNNs, and by a mixture of point masses for ensembles. Inspired by recent work showing that the distribution used by ensembles can be understood as a posterior corresponding to a learned data-dependent prior, we propose last layer empirical Bayes (LLEB). LLEB instantiates a learnable prior as a normalizing flow, which is then trained to maximize the evidence lower bound; to retain tractability we use the flow only on the last layer. We show why LLEB is well motivated, and how it interpolates between standard BNNs and ensembles in terms of the strength of the prior that they use. LLEB performs on par with existing approaches, highlighting that empirical Bayes is a promising direction for future research in uncertainty quantification.
zh

[AI-96] Bandit based Dynamic Candidate Edge Selection in Solving Traveling Salesman Problems

【速读】：该论文旨在解决传统路由算法（如用于旅行商问题的Lin-Kernighan-Helsgaun算法，LKH）在局部搜索过程中依赖静态预定义候选边所导致的容易陷入局部最优的问题。其解决方案的关键在于通过引入多臂老虎机模型（multi-armed bandit model），动态选择每轮迭代中最合适的候选边，从而扩展候选集并提升算法的搜索智能性与效率。

链接: https://arxiv.org/abs/2505.15862
作者: Long Wanga,Jiongzhi Zheng,Zhengda Xiong,ChuMin Li,Kun He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Algorithms designed for routing problems typically rely on high-quality candidate edges to guide their search, aiming to reduce the search space and enhance the search efficiency. However, many existing algorithms, like the classical Lin-Kernighan-Helsgaun (LKH) algorithm for the Traveling Salesman Problem (TSP), often use predetermined candidate edges that remain static throughout local searches. This rigidity could cause the algorithm to get trapped in local optima, limiting its potential to find better solutions. To address this issue, we propose expanding the candidate sets to include other promising edges, providing them an opportunity for selection. Specifically, we incorporate multi-armed bandit models to dynamically select the most suitable candidate edges in each iteration, enabling LKH to make smarter choices and lead to improved solutions. Extensive experiments on multiple TSP benchmarks show the excellent performance of our method. Moreover, we employ this bandit-based method to LKH-3, an extension of LKH tailored for solving various TSP variant problems, and our method also significantly enhances LKH-3’s performance across typical TSP variants.
zh

[AI-97] AutoData: A Multi-Agent System for Open Web Data Collection

【速读】：该论文旨在解决数据驱动系统和人工智能技术快速发展背景下，高质量网络来源数据集的获取问题。传统网络数据收集方法在人力成本和可扩展性方面存在显著限制，现有解决方案要么依赖于适应性和可重复性较差的包装器方法，要么因计算和财务成本过高而受限于大型语言模型（Large Language Model, LLM）方法。论文提出的解决方案是AutoData，其关键在于采用一种无需或仅需极少人工干预的多智能体系统，通过自然语言指令指定所需数据集，并结合一种由中央任务管理器协调的新型定向消息超图架构，以及一种新型超图缓存系统，以提升多智能体协作效率并降低令牌成本。

链接: https://arxiv.org/abs/2505.15859
作者: Tianyi Ma,Yiyue Qian,Zheyuan Zhang,Zehong Wang,Xiaoye Qian,Feifan Bai,Yifan Ding,Xuwei Luo,Shinan Zhang,Keerthiram Murugesan,Chuxu Zhang,Yanfang Ye
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData’s superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at this https URL.
zh

[AI-98] DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster Management

【速读】：该论文旨在解决现有信息检索（Information Retrieval, IR）基准在灾害管理场景中适用性不足的问题，因为现有基准主要关注通用或专业领域，如医疗或金融，而忽视了灾害管理中特有的语言复杂性和多样化信息需求。其解决方案的关键是提出DisastIR，这是首个专为灾害管理设计的全面IR评估基准，包含9,600个多样化的用户查询和超过130万对标注的查询-段落数据，覆盖48项从六种搜索意图和八个灾害类别中衍生的检索任务。

链接: https://arxiv.org/abs/2505.15856
作者: Kai Yin,Xiangjue Dong,Chengkai Liu,Lipai Huang,Yiming Xiao,Zhewei Liu,Ali Mostafavi,James Caverlee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective disaster management requires timely access to accurate and contextually relevant information. Existing Information Retrieval (IR) benchmarks, however, focus primarily on general or specialized domains, such as medicine or finance, neglecting the unique linguistic complexity and diverse information needs encountered in disaster management scenarios. To bridge this gap, we introduce DisastIR, the first comprehensive IR evaluation benchmark specifically tailored for disaster management. DisastIR comprises 9,600 diverse user queries and more than 1.3 million labeled query-passage pairs, covering 48 distinct retrieval tasks derived from six search intents and eight general disaster categories that include 301 specific event types. Our evaluations of 30 state-of-the-art retrieval models demonstrate significant performance variances across tasks, with no single model excelling universally. Furthermore, comparative analyses reveal significant performance gaps between general-domain and disaster management-specific tasks, highlighting the necessity of disaster management-specific benchmarks for guiding IR model selection to support effective decision-making in disaster management scenarios. All source codes and DisastIR are available at this https URL.
zh

[AI-99] Integration of TinyML and LargeML: A Survey of 6G and Beyond

【速读】：该论文旨在解决未来6G网络中大规模机器学习（LargeML）与微型机器学习（TinyML）融合所带来的挑战，以实现智能服务和应用的高效部署与资源管理。其解决方案的关键在于探索如何有效整合TinyML的轻量化与资源效率优势，以及LargeML的强大模型能力，从而支持广泛且复杂的物联网服务和生成式内容应用，同时应对性能优化、实际部署策略、资源管理和安全等核心问题。

链接: https://arxiv.org/abs/2505.15854
作者: Thai-Hoc Vu,Ngo Hoang Tu,Thien Huynh-The,Kyungchun Lee,Sunghwan Kim,Miroslav Voznak,Quoc-Viet Pham
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: This work was submitted to IEEE Communications Surveys Tutorials

点击查看摘要

Abstract:The transition from 5G networks to 6G highlights a significant demand for machine learning (ML). Deep learning models, in particular, have seen wide application in mobile networking and communications to support advanced services in emerging wireless environments, such as smart healthcare, smart grids, autonomous vehicles, aerial platforms, digital twins, and the metaverse. The rapid expansion of Internet-of-Things (IoT) devices, many with limited computational capabilities, has accelerated the development of tiny machine learning (TinyML) and resource-efficient ML approaches for cost-effective services. However, the deployment of large-scale machine learning (LargeML) solutions require major computing resources and complex management strategies to support extensive IoT services and ML-generated content applications. Consequently, the integration of TinyML and LargeML is projected as a promising approach for future seamless connectivity and efficient resource management. Although the integration of TinyML and LargeML shows abundant potential, several challenges persist, including performance optimization, practical deployment strategies, effective resource management, and security considerations. In this survey, we review and analyze the latest research aimed at enabling the integration of TinyML and LargeML models for the realization of smart services and applications in future 6G networks and beyond. The paper concludes by outlining critical challenges and identifying future research directions for the holistic integration of TinyML and LargeML in next-generation wireless networks. Comments: This work was submitted to IEEE Communications Surveys Tutorials Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2505.15854 [cs.NI] (or arXiv:2505.15854v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2505.15854 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-100] Exploring Moral Exercises for Human Oversight of AI systems: Insights from Three Pilot Studies

【速读】：该论文试图解决如何通过道德训练（moral exercises）帮助人工智能代理（AI actors）培养美德，以实现对人工智能系统的有效人类监督。解决方案的关键在于构建一种包含激发积极个人态度、促进关系理解以及培养技术道德智慧的核心方法论，这些要素对于确保人工智能系统的人类监督能力至关重要。

链接: https://arxiv.org/abs/2505.15851
作者: Silvia Crafa,Teresa Scantamburlo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper elaborates on the concept of moral exercises as a means to help AI actors cultivate virtues that enable effective human oversight of AI systems. We explore the conceptual framework and significance of moral exercises, situating them within the contexts of philosophical discourse, ancient practices, and contemporary AI ethics scholarship. We outline the core pillars of the moral exercises methodology - eliciting an engaged personal disposition, fostering relational understanding, and cultivating technomoral wisdom - and emphasize their relevance to key activities and competencies essential for human oversight of AI systems. Our argument is supported by findings from three pilot studies involving a company, a multidisciplinary team of AI researchers, and higher education students. These studies allow us to explore both the potential and the limitations of moral exercises. Based on the collected data, we offer insights into how moral exercises can foster a responsible AI culture within organizations, and suggest directions for future research.
zh

[AI-101] Quantum-Evolutionary Neural Networks for Multi-Agent Federated Learning

【速读】：该论文旨在解决在复杂、去中心化环境中实现可扩展、自适应且隐私保护的决策系统的问题。其解决方案的关键在于提出一种结合量子启发神经网络与进化算法的新型框架——量子-进化神经网络（QE-NN），该框架利用量子计算原理（如量子叠加和纠缠）提升学习速度和决策准确性，同时通过进化优化持续改进代理在动态、不确定环境中的行为，此外还引入联邦学习以确保隐私保护，使去中心化代理能够在不共享敏感数据的情况下协作。

链接: https://arxiv.org/abs/2505.15836
作者: Aarav Lala,Kalyan Cherukuri
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As artificial intelligence continues to drive innovation in complex, decentralized environments, the need for scalable, adaptive, and privacy-preserving decision-making systems has become critical. This paper introduces a novel framework combining quantum-inspired neural networks with evolutionary algorithms to optimize real-time decision-making in multi-agent systems (MAS). The proposed Quantum-Evolutionary Neural Network (QE-NN) leverages quantum computing principles – such as quantum superposition and entanglement – to enhance learning speed and decision accuracy, while integrating evolutionary optimization to continually refine agent behaviors in dynamic, uncertain environments. By utilizing federated learning, QE-NN ensures privacy preservation, enabling decentralized agents to collaborate without sharing sensitive data. The framework is designed to allow agents to adapt in real-time to their environments, optimizing decision-making processes for applications in areas such as autonomous systems, smart cities, and healthcare. This research represents a breakthrough in merging quantum computing, evolutionary optimization, and privacy-preserving techniques to solve complex problems in multi-agent decision-making systems, pushing the boundaries of AI in real-world, privacy-sensitive applications.
zh

[AI-102] ransforming Decoder-Only Transformers for Accurate WiFi-Telemetry Based Indoor Localization

【速读】：该论文试图解决WiFi室内定位中因环境变化导致的信号衰减、多径效应和干扰等问题，以及不同厂商WiFi设备产生的遥测数据差异和应用场景需求多样化带来的挑战。其解决方案的关键在于提出一种基于生成式预训练变换器（Generative Pretrained Transformer, GPT）的系统WiFiGPT，该系统能够有效处理上述数据差异，并通过大型语言模型（Large Language Models, LLMs）捕捉无线遥测中的细微空间模式，从而实现高精度的定位，无需人工定制的信号处理或校准流程。

链接: https://arxiv.org/abs/2505.15835
作者: Nayan Sanjay Bhatia,Katia Obraczka
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, In Submission

点击查看摘要

Abstract:Wireless Fidelity (WiFi) based indoor positioning is a widely researched area for determining the position of devices within a wireless network. Accurate indoor location has numerous applications, such as asset tracking and indoor navigation. Despite advances in WiFi localization techniques – in particular approaches that leverage WiFi telemetry – their adoption in practice remains limited due to several factors including environmental changes that cause signal fading, multipath effects, interference, which, in turn, impact positioning accuracy. In addition, telemetry data differs depending on the WiFi device vendor, offering distinct features and formats; use case requirements can also vary widely. Currently, there is no unified model to handle all these variations effectively. In this paper, we present WiFiGPT, a Generative Pretrained Transformer (GPT) based system that is able to handle these variations while achieving high localization accuracy. Our experiments with WiFiGPT demonstrate that GPTs, in particular Large Language Models (LLMs), can effectively capture subtle spatial patterns in noisy wireless telemetry, making them reliable regressors. Compared to existing state-of-the-art methods, our method matches and often surpasses conventional approaches for multiple types of telemetry. Achieving sub-meter accuracy for RSSI and FTM and centimeter-level precision for CSI demonstrates the potential of LLM-based localisation to outperform specialized techniques, all without handcrafted signal processing or calibration.
zh

[AI-103] MPPFND: A Dataset and Analysis of Detecting Fake News with Multi-Platform Propagation

【速读】：该论文试图解决跨平台虚假新闻检测问题，即现有检测算法通常针对特定平台进行训练，忽略了不同平台在传播特性上的差异。其解决方案的关键在于提出一种多平台虚假新闻检测模型（APSL），该模型利用图神经网络从多个平台中提取社会上下文特征，并通过考虑跨平台传播差异来提升虚假新闻检测的性能。

链接: https://arxiv.org/abs/2505.15834
作者: Congyuan Zhao,Lingwei Wei,Ziming Qin,Wei Zhou,Yunya Song,Songlin Hu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Cogsci 2025

点击查看摘要

Abstract:Fake news spreads widely on social media, leading to numerous negative effects. Most existing detection algorithms focus on analyzing news content and social context to detect fake news. However, these approaches typically detect fake news based on specific platforms, ignoring differences in propagation characteristics across platforms. In this paper, we introduce the MPPFND dataset, which captures propagation structures across multiple platforms. We also describe the commenting and propagation characteristics of different platforms to show that their social contexts have distinct features. We propose a multi-platform fake news detection model (APSL) that uses graph neural networks to extract social context features from various platforms. Experiments show that accounting for cross-platform propagation differences improves fake news detection performance.
zh

[AI-104] From Hand-Crafted Metrics to Evolved Training-Free Performance Predictors for Neural Architecture Search via Genetic Programming

【速读】：该论文旨在解决零成本（Zero-cost, ZC）代理在神经网络架构搜索（Neural Architecture Search, NAS）中存在的一致性不足以及设计过程依赖人工的问题。其关键解决方案是提出一种基于遗传编程的符号回归框架，以自动化设计ZC度量，从而生成具有强正排名相关性的ZC代理，该代理在多种NAS搜索空间和任务中表现出更高的泛化能力。

链接: https://arxiv.org/abs/2505.15832
作者: Quan Minh Phan,Ngoc Hoang Luong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating the network performance using zero-cost (ZC) metrics has proven both its efficiency and efficacy in Neural Architecture Search (NAS). However, a notable limitation of most ZC proxies is their inconsistency, as reflected by the substantial variation in their performance across different problems. Furthermore, the design of existing ZC metrics is manual, involving a time-consuming trial-and-error process that requires substantial domain expertise. These challenges raise two critical questions: (1) Can we automate the design of ZC metrics? and (2) Can we utilize the existing hand-crafted ZC metrics to synthesize a more generalizable one? In this study, we propose a framework based on Symbolic Regression via Genetic Programming to automate the design of ZC metrics. Our framework is not only highly extensible but also capable of quickly producing a ZC metric with a strong positive rank correlation to true network performance across diverse NAS search spaces and tasks. Extensive experiments on 13 problems from NAS-Bench-Suite-Zero demonstrate that our automatically generated proxies consistently outperform hand-crafted alternatives. Using our evolved proxy metric as the search objective in an evolutionary algorithm, we could identify network architectures with competitive performance within 15 minutes using a single consumer GPU.
zh

[AI-105] Generative AI-Aided QoE Maximization for RIS-Assisted Digital Twin Interaction

【速读】：该论文旨在解决一种面向数字孪生（Digital Twin, DT）交互的用户体验质量（Quality of Experience, QoE）感知资源分配问题，特别是在存在DT模型不确定演化的场景下。其核心挑战在于DT模型的动态变化导致多个场景特定问题，需在模型演化时不断重新求解。解决方案的关键在于结合决策变换器的动态优化能力与生成式人工智能（Generative AI, GAI）的泛化优势，提出了一种名为提示引导的决策变换器集成零 forcing 优化（Prompt-Guided Decision Transformer Integrated with Zero-Forcing Optimization, PG-ZFO）的新方法。

链接: https://arxiv.org/abs/2505.15828
作者: Jiayuan Chen,Yuxiang Li,Changyan Yi,Shimin Gong
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate a quality of experience (QoE)-aware resource allocation problem for reconfigurable intelligent surface (RIS)-assisted digital twin (DT) interaction with uncertain evolution. In the considered system, mobile users are expected to interact with a DT model maintained on a DT server that is deployed on a base station, via effective uplink and downlink channels assisted by an RIS. Our goal is to maximize the sum of all mobile users’ joint subjective and objective QoE in DT interactions across various DT scenes, by jointly optimizing phase shift matrix, receive/transmit beamforming matrix, rendering resolution configuration and computing resource allocation. While solving this problem is challenging mainly due to the uncertain evolution of the DT model, which leads to multiple scene-specific problems, and require us to constantly re-solve each of them whenever DT model evolves. To this end, leveraging the dynamic optimization capabilities of decision transformers and the generalization strengths of generative artificial intelligence (GAI), we propose a novel GAI-aided approach, called the prompt-guided decision transformer integrated with zero-forcing optimization (PG-ZFO). Simulations are conducted to evaluate the proposed PG-ZFO, demonstrating its effectiveness and superiority over counterparts. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.15828 [cs.NI] (or arXiv:2505.15828v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2505.15828 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-106] A Novel Compound AI Model for 6G Networks in 3D Continuum

【速读】：该论文旨在解决6G网络在三维连续体（3D continuum）环境中所面临的复杂性问题，特别是当前基于单一模型的AI网络管理方法在跨域交互建模、适应性和计算资源消耗方面的不足。其解决方案的关键在于提出了一种形式化的Compound AI系统模型，并引入了一个三元框架，将复杂任务分解为专业化的、可互操作的模块，从而实现对异构组件的协调与分布式智能处理。该模块化架构能够有效应对6G网络在三维连续体中的独特挑战，同时引入了模型与系统性能之间的基本权衡，需在设计中予以平衡。

链接: https://arxiv.org/abs/2505.15821
作者: Milos Gravara,Andrija Stanisic,Stefan Nastic
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:The 3D continuum presents a complex environment that spans the terrestrial, aerial and space domains, with 6Gnetworks serving as a key enabling technology. Current AI approaches for network management rely on monolithic models that fail to capture cross-domain interactions, lack adaptability,and demand prohibitive computational resources. This paper presents a formal model of Compound AI systems, introducing a novel tripartite framework that decomposes complex tasks into specialized, interoperable modules. The proposed modular architecture provides essential capabilities to address the unique challenges of 6G networks in the 3D continuum, where heterogeneous components require coordinated, yet distributed, intelligence. This approach introduces a fundamental trade-off between model and system performance, which must be carefully addressed. Furthermore, we identify key challenges faced by Compound AI systems within 6G networks operating in the 3D continuum, including cross-domain resource orchestration, adaptation to dynamic topologies, and the maintenance of consistent AI service quality across heterogeneous environments.
zh

[AI-107] Common Data Format (CDF): A Standardized Format for Match-Data in Football (Soccer)

【速读】：该论文试图解决足球比赛中多方收集的数据存在异构性问题，即不同数据提供方在数据内容、规范、表示方式及交付形式上存在差异，导致数据分析面临显著障碍。解决方案的关键是提出一种统一的标准化格式——通用数据格式（Common Data Format, CDF），该格式定义了五类比赛数据的最小模式：比赛表单数据、视频片段、事件数据、跟踪数据和比赛元数据，旨在确保数据的清晰性、充分上下文化和完整性，从而支持常见的下游分析任务。

链接: https://arxiv.org/abs/2505.15820
作者: Gabriel Anzer,Kilian Arnsmeyer,Pascal Bauer,Joris Bekkers,Ulf Brefeld,Jesse Davis,Nicolas Evans,Matthias Kempe,Samuel J Robertson,Joshua Wyatt Smith,Jan Van Haaren
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:During football matches, a variety of different parties (e.g., companies) each collect (possibly overlapping) data about the match ranging from basic information (e.g., starting players) to detailed positional data. This data is provided to clubs, federations, and other organizations who are increasingly interested in leveraging this data to inform their decision making. Unfortunately, analyzing such data pose significant barriers because each provider may (1) collect different data, (2) use different specifications even within the same category of data, (3) represent the data differently, and (4) delivers the data in a different manner (e.g., file format, protocol). Consequently, working with these data requires a significant investment of time and money. The goal of this work is to propose a uniform and standardized format for football data called the Common Data Format (CDF). The CDF specifies a minimal schema for five types of match data: match sheet data, video footage, event data, tracking data, and match meta data. It aims to ensure that the provided data is clear, sufficiently contextualized (e.g., its provenance is clear), and complete such that it enables common downstream analysis tasks. Concretely, this paper will detail the technical specifications of the CDF, the representational choices that were made to help ensure the clarity of the provided data, and a concrete approach for delivering data in the CDF.
zh

[AI-108] Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

【速读】：该论文旨在解决长链式思维（long CoT）模型在训练过程中因计算成本高昂而导致的固定训练数据集利用率不足的问题。现有方法要么完全丢弃负面样本（RFT），要么对所有标记施加相同的惩罚（RL），未能有效利用负面样本中包含的自我反思和错误修正等有价值的学习信号。论文提出的解决方案是行为约束策略梯度与负样本增强（BCPG-NSA），其关键在于通过三个阶段：样本分割、基于共识的步骤正确性评估以及结合负样本增强的策略优化，从而更精细地挖掘负面样本中的正向步骤，提升模型性能与样本效率。

链接: https://arxiv.org/abs/2505.14403
作者: Zhaohui Yang,Shilei Jiang,Chen Hu,Linjing Li,Shihong Deng,Daxin Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
zh

[AI-109] X-KAN: Optimizing Local Kolmogorov-Arnold Networks via Evolutionary Rule-Based Machine Learning IJCAI2025

【速读】：该论文旨在解决传统神经网络方法在处理局部复杂或不连续函数时表现不佳的问题，这些问题通常由于依赖单一全局模型覆盖整个问题空间而导致。解决方案的关键在于提出X-KAN，一种结合了Kolmogorov-Arnold Networks (KAN)的高表达能力与XCSF（一种基于进化规则的机器学习框架）自适应划分能力的方法，通过将局部KAN模型作为规则后件，并利用规则前件定义局部区域，从而实现对复杂函数的有效逼近。

链接: https://arxiv.org/abs/2505.14273
作者: Hiroki Shiraishi,Hisao Ishibuchi,Masaya Nakata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
备注: Accepted by the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Function approximation is a critical task in various fields. However, existing neural network approaches struggle with locally complex or discontinuous functions due to their reliance on a single global model covering the entire problem space. We propose X-KAN, a novel method that optimizes multiple local Kolmogorov-Arnold Networks (KANs) through an evolutionary rule-based machine learning framework called XCSF. X-KAN combines KAN’s high expressiveness with XCSF’s adaptive partitioning capability by implementing local KAN models as rule consequents and defining local regions via rule antecedents. Our experimental results on artificial test functions and real-world datasets demonstrate that X-KAN significantly outperforms conventional methods, including XCSF, Multi-Layer Perceptron, and KAN, in terms of approximation accuracy. Notably, X-KAN effectively handles functions with locally complex or discontinuous structures that are challenging for conventional KAN, using a compact set of rules (average 7.2 \pm 2.3 rules). These results validate the effectiveness of using KAN as a local model in XCSF, which evaluates the rule fitness based on both accuracy and generality. Our X-KAN implementation is available at this https URL.
zh

[AI-110] Active Speech Enhancement: Active Speech Denoising Decliping and Deveraberation

【速读】：该论文试图解决在复杂声学环境中提升语音可懂度和感知质量的问题，传统方法如主动降噪（Active Noise Cancellation, ANC）仅关注抑制外部干扰，而未能对语音信号进行主动优化。解决方案的关键在于提出一种基于Transformer-Mamba的新型架构，并结合任务特定的损失函数，以联合优化干扰抑制与信号增强，从而实现对语音信号的主动调控。

链接: https://arxiv.org/abs/2505.16911
作者: Ofir Yaish,Yehuda Mishaly,Eliya Nachmani
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a new paradigm for active sound modification: Active Speech Enhancement (ASE). While Active Noise Cancellation (ANC) algorithms focus on suppressing external interference, ASE goes further by actively shaping the speech signal – both attenuating unwanted noise components and amplifying speech-relevant frequencies – to improve intelligibility and perceptual quality. To enable this, we propose a novel Transformer-Mamba-based architecture, along with a task-specific loss function designed to jointly optimize interference suppression and signal enrichment. Our method outperforms existing baselines across multiple speech processing tasks – including denoising, dereverberation, and declipping – demonstrating the effectiveness of active, targeted modulation in challenging acoustic environments.
zh

[AI-111] Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate INTERSPEECH2025

【速读】：该论文试图解决传统神经语音编解码器在固定帧率（Constant Frame Rate, CFR）下无法适应语音信号时间变化的信息密度问题，从而导致比特率和令牌序列长度效率不高的问题。解决方案的关键在于提出一种时序灵活编码（Temporally Flexible Coding, TFC）技术，首次将可变帧率（Variable Frame Rate, VFR）引入神经语音编解码器中，通过动态分配帧率以适应时间熵的变化，实现可调的平均帧率和更高的编码灵活性。

链接: https://arxiv.org/abs/2505.16845
作者: Hanglei Zhang,Yiwei Guo,Zhihan Li,Xiang Hao,Xie Chen,Kai Yu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.
zh

[AI-112] SEED: Speaker Embedding Enhancement Diffusion Model INTERSPEECH2025

【速读】：该论文旨在解决实际应用场景中说话人识别系统因环境不匹配导致的性能下降问题（environmental mismatch）。其解决方案的关键在于提出一种基于扩散模型（diffusion-based method）的方法，通过从预训练的说话人识别模型中提取说话人嵌入（speaker embeddings），并生成优化后的嵌入。该方法在训练阶段通过前向过程向干净和噪声语音中提取的说话人嵌入分别添加高斯噪声，随后在反向过程中重建为干净嵌入，而在推理阶段则通过扩散过程重新生成所有嵌入，从而提升识别准确率。

链接: https://arxiv.org/abs/2505.16798
作者: KiHyun Nam,Jungwoo Heo,Jee-weon Jung,Gangin Park,Chaeyoung Jung,Ha-Jin Yu,Joon Son Chung
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025. The official code can be found at this https URL

点击查看摘要

Abstract:A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here this https URL
zh

[AI-113] Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting INTERSPEECH2025

【速读】：该论文旨在解决基于文本注册的开放词汇关键词检测（open-vocabulary keyword spotting, KWS）中音频与文本模态之间嵌入表示异质性带来的挑战。为了解决这一问题，其关键在于通过深度度量学习（deep metric learning, DML）优化音频和文本编码器，使多模态嵌入能够在共享的嵌入空间中进行直接比较。同时，引入模态对抗学习（Modality Adversarial Learning, MAL），通过对抗训练模态分类器，促使编码器生成具有模态不变性的嵌入表示，从而减少异构模态间的领域差异。

链接: https://arxiv.org/abs/2505.16735
作者: Youngmoon Jung,Yong-Hyeok Lee,Myunghun Jung,Jaeyoung Roh,Chang Woo Han,Hoon-Young Cho
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figures, Accepted at Interspeech 2025

点击查看摘要

Abstract:For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct comprehensive comparisons across various DML objectives. Experiments on the Wall Street Journal (WSJ) and LibriPhrase datasets demonstrate the effectiveness of the proposed approach.
zh

[AI-114] Materials Generation in the Era of Artificial Intelligence: A Comprehensive Survey

【速读】：该论文试图解决当前材料发现过程中对高效、精准设计具有特定性能的新材料的需求，以及现有研究中缺乏系统性综述的问题。其解决方案的关键在于利用数据驱动的生成式人工智能（Generative AI）模型，通过直接生成满足预设性能要求的新材料，从而加速材料的发现与设计过程。

链接: https://arxiv.org/abs/2505.16379
作者: Zhixun Li,Bin Cao,Rui Jiao,Liang Wang,Ding Wang,Yang Liu,Dingshuo Chen,Jia Li,Qiang Liu,Yu Rong,Liang Wang,Tong-yi Zhang,Jeffrey Xu Yu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Materials are the foundation of modern society, underpinning advancements in energy, electronics, healthcare, transportation, and infrastructure. The ability to discover and design new materials with tailored properties is critical to solving some of the most pressing global challenges. In recent years, the growing availability of high-quality materials data combined with rapid advances in Artificial Intelligence (AI) has opened new opportunities for accelerating materials discovery. Data-driven generative models provide a powerful tool for materials design by directly create novel materials that satisfy predefined property requirements. Despite the proliferation of related work, there remains a notable lack of up-to-date and systematic surveys in this area. To fill this gap, this paper provides a comprehensive overview of recent progress in AI-driven materials generation. We first organize various types of materials and illustrate multiple representations of crystalline materials. We then provide a detailed summary and taxonomy of current AI-driven materials generation approaches. Furthermore, we discuss the common evaluation metrics and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future directions and challenges in this fast-growing field. The related sources can be found at this https URL.
zh

[AI-115] Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

【速读】：该论文旨在解决语音流畅性障碍（speech dysfluency）自动检测的问题，传统方法在分类任务中表现有限，且文本无关模型在上下文依赖的病例中容易出现误判。其解决方案的关键在于提出Dysfluent-WFST，这是一种零样本解码器，能够同时进行音素转录和流畅性检测，通过引入上游编码器如WavLM而无需额外训练，实现了语音错误率和流畅性检测的最先进性能，表明在解码过程中显式建模发音行为比复杂架构更为关键。

链接: https://arxiv.org/abs/2505.16351
作者: Chenxu Guo,Jiachen Lian,Xuanru Zhou,Jinming Zhang,Shuhe Li,Zongli Ye,Hwi Joo Park,Anaisha Das,Zoe Ezzes,Jet Vonk,Brittany Morin,Rian Bogley,Lisa Wauters,Zachary Miller,Maria Gorno-Tempini,Gopala Anumanchipalli
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.
zh

[AI-116] Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing

【速读】：该论文试图解决深度神经网络（Deep Neural Networks, DNN）在大规模模型优化中的挑战，特别是针对模型压缩中的细粒度剪枝-量化问题。解决方案的关键在于将模型压缩问题重新表述为二次无约束二进制优化（Quadratic Unconstrained Binary Optimization, QUBO）问题，并利用量子退火（Adiabatic Quantum Computing, AQC）技术进行求解，从而实现对实际DNN模型的有效压缩。

链接: https://arxiv.org/abs/2505.16332
作者: Zhehui Wanga,Benjamin Chen Ming Choonga,Tian Huang,Daniel Gerlinghoffa,Rick Siow Mong Goh,Cheng Liu,Tao Luo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum optimization is the most mature quantum computing technology to date, providing a promising approach towards efficiently solving complex combinatorial problems. Methods such as adiabatic quantum computing (AQC) have been employed in recent years on important optimization problems across various domains. In deep learning, deep neural networks (DNN) have reached immense sizes to support new predictive capabilities. Optimization of large-scale models is critical for sustainable deployment, but becomes increasingly challenging with ever-growing model sizes and complexity. While quantum optimization is suitable for solving complex problems, its application to DNN optimization is not straightforward, requiring thorough reformulation for compatibility with commercially available quantum devices. In this work, we explore the potential of adopting AQC for fine-grained pruning-quantization of convolutional neural networks. We rework established heuristics to formulate model compression as a quadratic unconstrained binary optimization (QUBO) problem, and assess the solution space offered by commercial quantum annealing devices. Through our exploratory efforts of reformulation, we demonstrate that AQC can achieve effective compression of practical DNN models. Experiments demonstrate that adiabatic quantum computing (AQC) not only outperforms classical algorithms like genetic algorithms and reinforcement learning in terms of time efficiency but also excels at identifying global optima.
zh

[AI-117] Artificial Intelligence for Direct Prediction of Molecular Dynamics Across Chemical Space

【速读】：该论文旨在解决传统分子动力学（Molecular Dynamics, MD）模拟中由于依赖顺序数值积分而导致的模拟效率低下问题。其解决方案的关键在于提出MDtrajNet-1，一个基于等变神经网络与Transformer架构的生成式AI模型，该模型能够直接生成跨越化学空间的MD轨迹，绕过力场计算和积分过程，从而将模拟速度提升两个数量级。

链接: https://arxiv.org/abs/2505.16301
作者: Fuchun Ge,Pavlo O. Dral
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Molecular dynamics (MD) is a powerful tool for exploring the behavior of atomistic systems, but its reliance on sequential numerical integration limits simulation efficiency. We present MDtrajNet-1, a foundational AI model that directly generates MD trajectories across chemical space, bypassing force calculations and integration. This approach accelerates simulations by up to two orders of magnitude compared to traditional MD, even those enhanced by machine-learning interatomic potentials. MDtrajNet-1 combines equivariant neural networks with a Transformer-based architecture to achieve strong accuracy and transferability in predicting long-time trajectories for both known and unseen systems. Remarkably, the errors of the trajectories generated by MDtrajNet-1 for various molecular systems are close to those of the conventional ab initio MD. The model’s flexible design supports diverse application scenarios, including different statistical ensembles, boundary conditions, and interaction types. By overcoming the intrinsic speed barrier of conventional MD, MDtrajNet-1 opens new frontiers in efficient and scalable atomistic simulations.
zh

[AI-118] Using Echo-State Networks to Reproduce Rare Events in Chaotic Systems

【速读】：该论文试图解决在混沌状态下预测竞争性Lotka-Volterra模型的时间序列和统计特性的问题，其解决方案的关键在于应用Echo-State Networks（ESN）来学习该模型的混沌吸引子，并准确再现依赖变量的直方图，包括尾部和罕见事件，同时采用广义极值分布（Generalized Extreme Value distribution）来量化尾部行为。

链接: https://arxiv.org/abs/2505.16208
作者: Anton Erofeev,Balasubramanya T. Nadiga,Ilya Timofeyev
机构: 未知
类目: Chaotic Dynamics (nlin.CD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:We apply the Echo-State Networks to predict the time series and statistical properties of the competitive Lotka-Volterra model in the chaotic regime. In particular, we demonstrate that Echo-State Networks successfully learn the chaotic attractor of the competitive Lotka-Volterra model and reproduce histograms of dependent variables, including tails and rare events. We use the Generalized Extreme Value distribution to quantify the tail behavior.
zh

[AI-119] Interpretable Machine Learning for Macro Alpha: A News Sentiment Case Study ALT

【速读】：该论文试图解决如何从全球新闻情感中提取宏观经济阿尔法（macroeconomic alpha）的问题。其解决方案的关键在于构建一个可解释的机器学习（interpretable machine learning, ML）框架，利用FinBERT模型对GDELT项目的全球新闻流进行处理，生成包含均值语气、波动性和事件影响的日度情感指数，并通过XGBoost分类器预测主要外汇和债券期货的次日收益率，从而实现高性能的交易策略。

链接: https://arxiv.org/abs/2505.16136
作者: Yuke Zhang
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
备注: 18 pages (including references), 1 figure, 1 table. Code available at \url{ this https URL }. Keywords: Macro Sentiment, News Sentiment, Algorithmic Trading, GDELT, FinBERT, NLP, Alternative Data, Foreign Exchange, Treasury Futures, Quantitative Finance, Machine Learning, SHAP, Interpretability

点击查看摘要

Abstract:This study introduces an interpretable machine learning (ML) framework to extract macroeconomic alpha from global news sentiment. We process the Global Database of Events, Language, and Tone (GDELT) Project’s worldwide news feed using FinBERT – a Bidirectional Encoder Representations from Transformers (BERT) based model pretrained on finance-specific language – to construct daily sentiment indices incorporating mean tone, dispersion, and event impact. These indices drive an XGBoost classifier, benchmarked against logistic regression, to predict next-day returns for EUR/USD, USD/JPY, and 10-year U.S. Treasury futures (ZN). Rigorous out-of-sample (OOS) backtesting (5-fold expanding-window cross-validation, OOS period: c. 2017-April 2025) demonstrates exceptional, cost-adjusted performance for the XGBoost strategy: Sharpe ratios achieve 5.87 (EUR/USD), 4.65 (USD/JPY), and 4.65 (Treasuries), with respective compound annual growth rates (CAGRs) exceeding 50% in Foreign Exchange (FX) and 22% in bonds. Shapley Additive Explanations (SHAP) affirm that sentiment dispersion and article impact are key predictive features. Our findings establish that integrating domain-specific Natural Language Processing (NLP) with interpretable ML offers a potent and explainable source of macro alpha.
zh

[AI-120] An Inclusive Foundation Model for Generalizable Cytogenetics in Precision Oncology

【速读】：该论文旨在解决染色体分析中因染色体异常的复杂性和多样性导致的AI模型开发困难问题，以及自动化方法因缺乏涵盖多种资源条件的全面数据集而表现出的任务特异性与泛化能力不足的问题。其解决方案的关键在于提出CHROMA，一个用于细胞遗传学的预训练基础模型，通过自监督学习在超过84,000个样本（约400万张染色体图像）上进行预训练，从而学习到可泛化的染色体异常表征，能够在标注数据较少和数据不平衡的情况下仍优于其他方法，为临床分析提供可扩展、通用的解决方案。

链接: https://arxiv.org/abs/2505.15868
作者: Changchun Yang(1,2,3,4),Weiqian Dai(1),Yilan Zhang(2,3,4),Siyuan Chen(2,3,4),Jingdong Hu(5),Junkai Su(5),Yuxuan Chen(5),Ao Xu(5),Na Li(5),Xin Gao(2,3,4),Yongguo Yu(1) ((1) Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine (2) Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia (3) Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia (4) Center of Excellence on Generative AI, King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia (5) Smiltec (Suzhou) Co., Ltd)
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: These authors contributed equally to this work: Changchun Yang, Weiqian Dai, Yilan Zhang

点击查看摘要

Abstract:Chromosome analysis is vital for diagnosing genetic disorders and guiding cancer therapy decisions through the identification of somatic clonal aberrations. However, developing an AI model are hindered by the overwhelming complexity and diversity of chromosomal abnormalities, requiring extensive annotation efforts, while automated methods remain task-specific and lack generalizability due to the scarcity of comprehensive datasets spanning diverse resource conditions. Here, we introduce CHROMA, a foundation model for cytogenomics, designed to overcome these challenges by learning generalizable representations of chromosomal abnormalities. Pre-trained on over 84,000 specimens (~4 million chromosomal images) via self-supervised learning, CHROMA outperforms other methods across all types of abnormalities, even when trained on fewer labelled data and more imbalanced datasets. By facilitating comprehensive mapping of instability and clonal leisons across various aberration types, CHROMA offers a scalable and generalizable solution for reliable and automated clinical analysis, reducing the annotation workload for experts and advancing precision oncology through the early detection of rare genomic abnormalities, enabling broad clinical AI applications and making advanced genomic analysis more accessible.
zh

[AI-121] What Lives? A meta-analysis of diverse opinions on the definition of life

【速读】：该论文试图解决“生命”定义的跨学科分歧问题，旨在通过计算方法揭示不同领域对生命概念的理解差异与共性。其解决方案的关键在于利用生成式 AI (Generative AI) 和大型语言模型（LLMs）对跨学科专家提供的生命定义进行语义分析，通过配对相关性分析、凝聚聚类、簇内语义分析及t-SNE投影，将定义映射到特征向量空间，从而揭示生命概念的连续性景观，提出一种连接还原论与整体论方法的新路径。

链接: https://arxiv.org/abs/2505.15849
作者: Reed Bender,Karina Kofman,Blaise Agüera y Arcas,Michael Levin
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Biomolecules (q-bio.BM); Cell Behavior (q-bio.CB); Subcellular Processes (q-bio.SC); Applications (stat.AP)
备注: 54 pages, 4 figures, 2 tables, 11 supplemental figures, 3 supplemental tables

点击查看摘要

Abstract:The question of “what is life?” has challenged scientists and philosophers for centuries, producing an array of definitions that reflect both the mystery of its emergence and the diversity of disciplinary perspectives brought to bear on the question. Despite significant progress in our understanding of biological systems, psychology, computation, and information theory, no single definition for life has yet achieved universal acceptance. This challenge becomes increasingly urgent as advances in synthetic biology, artificial intelligence, and astrobiology challenge our traditional conceptions of what it means to be alive. We undertook a methodological approach that leverages large language models (LLMs) to analyze a set of definitions of life provided by a curated set of cross-disciplinary experts. We used a novel pairwise correlation analysis to map the definitions into distinct feature vectors, followed by agglomerative clustering, intra-cluster semantic analysis, and t-SNE projection to reveal underlying conceptual archetypes. This methodology revealed a continuous landscape of the themes relating to the definition of life, suggesting that what has historically been approached as a binary taxonomic problem should be instead conceived as differentiated perspectives within a unified conceptual latent space. We offer a new methodological bridge between reductionist and holistic approaches to fundamental questions in science and philosophy, demonstrating how computational semantic analysis can reveal conceptual patterns across disciplinary boundaries, and opening similar pathways for addressing other contested definitional territories across the sciences.
zh

机器学习

[LG-0] A Unified Framework for Simultaneous Parameter and Function Discovery in Differential Equations

链接: https://arxiv.org/abs/2505.16996
作者: Shalev Manor,Mohammad Kohandel
类目: Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Inverse problems involving differential equations often require identifying unknown parameters or functions from data. Existing approaches, such as Physics-Informed Neural Networks (PINNs), Universal Differential Equations (UDEs) and Universal Physics-Informed Neural Networks (UPINNs), are effective at isolating either parameters or functions but can face challenges when applied simultaneously due to solution non-uniqueness. In this work, we introduce a framework that addresses these limitations by establishing conditions under which unique solutions can be guaranteed. To illustrate, we apply it to examples from biological systems and ecological dynamics, demonstrating accurate and interpretable results. Our approach significantly enhances the potential of machine learning techniques in modeling complex systems in science and engineering.

[LG-1] PICT – A Differentiable GPU-Accelerated Multi-Block PISO Solver for Simulation-Coupled Learning Tasks in Fluid Dynamics

链接: https://arxiv.org/abs/2505.16992
作者: Aleksandra Franz,Hao Wei,Luca Guastoni,Nils Thuerey
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Source code at this https URL

点击查看摘要

Abstract:Despite decades of advancements, the simulation of fluids remains one of the most challenging areas of in scientific computing. Supported by the necessity of gradient information in deep learning, differentiable simulators have emerged as an effective tool for optimization and learning in physics simulations. In this work, we present our fluid simulator PICT, a differentiable pressure-implicit solver coded in PyTorch with Graphics-processing-unit (GPU) support. We first verify the accuracy of both the forward simulation and our derived gradients in various established benchmarks like lid-driven cavities and turbulent channel flows before we show that the gradients provided by our solver can be used to learn complicated turbulence models in 2D and 3D. We apply both supervised and unsupervised training regimes using physical priors to match flow statistics. In particular, we learn a stable sub-grid scale (SGS) model for a 3D turbulent channel flow purely based on reference statistics. The low-resolution corrector trained with our solver runs substantially faster than the highly resolved references, while keeping or even surpassing their accuracy. Finally, we give additional insights into the physical interpretation of different solver gradients, and motivate a physically informed regularization technique. To ensure that the full potential of PICT can be leveraged, it is published as open source: this https URL.

[LG-2] Bigger Isnt Always Memorizing: Early Stopping Overparameterized Diffusion Models

链接: https://arxiv.org/abs/2505.16959
作者: Alessandro Favero,Antonio Sclocchi,Matthieu Wyart
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion probabilistic models have become a cornerstone of modern generative AI, yet the mechanisms underlying their generalization remain poorly understood. In fact, if these models were perfectly minimizing their training loss, they would just generate data belonging to their training set, i.e., memorize, as empirically found in the overparameterized regime. We revisit this view by showing that, in highly overparameterized diffusion models, generalization in natural data domains is progressively achieved during training before the onset of memorization. Our results, ranging from image to language diffusion models, systematically support the empirical law that memorization time is proportional to the dataset size. Generalization vs. memorization is then best understood as a competition between time scales. We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules, where generalization corresponds to the hierarchical acquisition of deeper grammar rules as training time grows, and the generalization cost of early stopping can be characterized. We summarize these results in a phase diagram. Overall, our results support that a principled early-stopping criterion - scaling with dataset size - can effectively optimize generalization while avoiding memorization, with direct implications for hyperparameter transfer and privacy-sensitive applications.

[LG-3] ICYM2I: The illusion of multimodal informativeness under missingness

链接: https://arxiv.org/abs/2505.16953
作者: Young Sang Choi,Vincent Jeanselme,Pierre Elias,Shalmali Joshi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multimodal learning is of continued interest in artificial intelligence-based applications, motivated by the potential information gain from combining different types of data. However, modalities collected and curated during development may differ from the modalities available at deployment due to multiple factors including cost, hardware failure, or – as we argue in this work – the perceived informativeness of a given modality. Naïve estimation of the information gain associated with including an additional modality without accounting for missingness may result in improper estimates of that modality’s value in downstream tasks. Our work formalizes the problem of missingness in multimodal learning and demonstrates the biases resulting from ignoring this process. To address this issue, we introduce ICYM2I (In Case You Multimodal Missed It), a framework for the evaluation of predictive performance and information gain under missingness through inverse probability weighting-based correction. We demonstrate the importance of the proposed adjustment to estimate information gain under missingness on synthetic, semi-synthetic, and real-world medical datasets.

[LG-4] A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization

链接: https://arxiv.org/abs/2505.16952
作者: Shengyu Feng,Weiwei Sun,Shanda Li,Ameet Talwalkar,Yiming Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) has demonstrated considerable potential in supporting model design and optimization for combinatorial optimization (CO) problems. However, much of the progress to date has been evaluated on small-scale, synthetic datasets, raising concerns about the practical effectiveness of ML-based solvers in real-world, large-scale CO scenarios. Additionally, many existing CO benchmarks lack sufficient training data, limiting their utility for evaluating data-driven approaches. To address these limitations, we introduce FrontierCO, a comprehensive benchmark that covers eight canonical CO problem types and evaluates 16 representative ML-based solvers–including graph neural networks and large language model (LLM) agents. FrontierCO features challenging instances drawn from industrial applications and frontier CO research, offering both realistic problem difficulty and abundant training data. Our empirical results provide critical insights into the strengths and limitations of current ML methods, helping to guide more robust and practically relevant advances at the intersection of machine learning and combinatorial optimization. Our data is available at this https URL.

[LG-5] NY Real Estate Racial Equity Analysis via Applied Machine Learning

链接: https://arxiv.org/abs/2505.16946
作者: Sanjana Chalavadi,Andrei Pastor,Terry Leitch
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study analyzes tract-level real estate ownership patterns in New York State (NYS) and New York City (NYC) to uncover racial disparities. We use an advanced race/ethnicity imputation model (LSTM+Geo with XGBoost filtering, validated at 89.2% accuracy) to compare the predicted racial composition of property owners to the resident population from census data. We examine both a Full Model (statewide) and a Name-Only LSTM Model (NYC) to assess how incorporating geospatial context affects our predictions and disparity estimates. The results reveal significant inequities: White individuals hold a disproportionate share of properties and property value relative to their population, while Black, Hispanic, and Asian communities are underrepresented as property owners. These disparities are most pronounced in minority-majority neighborhoods, where ownership is predominantly White despite a predominantly non-White population. Corporate ownership (LLCs, trusts, etc.) exacerbates these gaps by reducing owner-occupied opportunities in urban minority communities. We provide a breakdown of ownership vs. population by race for majority-White, -Black, -Hispanic, and -Asian tracts, identify those with extreme ownership disparities, and compare patterns in urban, suburban, and rural contexts. The findings underscore persistent racial inequity in property ownership, reflecting broader historical and socio-economic forces, and highlight the importance of data-driven approaches to address these issues.

[LG-6] SPAR: Self-supervised Placement-Aware Representation Learning for Multi-Node IoT Systems

链接: https://arxiv.org/abs/2505.16936
作者: Yizhuo Chen,Tianchen Wang,You Lyu,Yanlan Hu,Jinyang Li,Tomoyoshi Kimura,Hongjue Zhao,Yigong Hu,Denizhan Kara,Tarek Abdelzaher
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work develops the underpinnings of self-supervised placement-aware representation learning given spatially-distributed (multi-view and multimodal) sensor observations, motivated by the need to represent external environmental state in multi-sensor IoT systems in a manner that correctly distills spatial phenomena from the distributed multi-vantage observations. The objective of sensing in IoT systems is, in general, to collectively represent an externally observed environment given multiple vantage points from which sensory observations occur. Pretraining of models that help interpret sensor data must therefore encode the relation between signals observed by sensors and the observers’ vantage points in order to attain a representation that encodes the observed spatial phenomena in a manner informed by the specific placement of the measuring instruments, while allowing arbitrary placement. The work significantly advances self-supervised model pretraining from IoT signals beyond current solutions that often overlook the distinctive spatial nature of IoT data. Our framework explicitly learns the dependencies between measurements and geometric observer layouts and structural characteristics, guided by a core design principle: the duality between signals and observer positions. We further provide theoretical analyses from the perspectives of information theory and occlusion-invariant representation learning to offer insight into the rationale behind our design. Experiments on three real-world datasets–covering vehicle monitoring, human activity recognition, and earthquake localization–demonstrate the superior generalizability and robustness of our method across diverse modalities, sensor placements, application-level inference tasks, and spatial scales.

[LG-7] Risk-Averse Reinforcement Learning with Itakura-Saito Loss

链接: https://arxiv.org/abs/2505.16925
作者: Igor Udovichenko,Olivier Croissant,Anita Toleutaeva,Evgeny Burnaev,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Risk-averse reinforcement learning finds application in various high-stakes fields. Unlike classical reinforcement learning, which aims to maximize expected returns, risk-averse agents choose policies that minimize risk, occasionally sacrificing expected value. These preferences can be framed through utility theory. We focus on the specific case of the exponential utility function, where we can derive the Bellman equations and employ various reinforcement learning algorithms with few modifications. However, these methods suffer from numerical instability due to the need for exponent computation throughout the process. To address this, we introduce a numerically stable and mathematically sound loss function based on the Itakura-Saito divergence for learning state-value and action-value functions. We evaluate our proposed loss function against established alternatives, both theoretically and empirically. In the experimental section, we explore multiple financial scenarios, some with known analytical solutions, and show that our loss function outperforms the alternatives.

[LG-8] Scalable and Interpretable Contextual Bandits: A Literature Review and Retail Offer Prototype

链接: https://arxiv.org/abs/2505.16918
作者: Nikola Tankovic,Robert Sajina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a concise review of Contextual Multi-Armed Bandit (CMAB) methods and introduces an experimental framework for scalable, interpretable offer selection, addressing the challenge of fast-changing offers. The approach models context at the product category level, allowing offers to span multiple categories and enabling knowledge transfer across similar offers. This improves learning efficiency and generalization in dynamic environments. The framework extends standard CMAB methodology to support multi-category contexts, and achieves scalability through efficient feature engineering and modular design. Advanced features such as MPG (Member Purchase Gap) and MF (Matrix Factorization) capture nuanced user-offer interactions, with implementation in Python for practical deployment. A key contribution is interpretability at scale: logistic regression models yield transparent weight vectors, accessible via a large language model (LLM) interface for real-time, user-level tracking and explanation of evolving preferences. This enables the generation of detailed member profiles and identification of behavioral patterns, supporting personalized offer optimization and enhancing trust in automated decisions. By situating our prototype alongside established paradigms like Generalized Linear Models and Thompson Sampling, we demonstrate its value for both research and real-world CMAB applications. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.16918 [cs.LG] (or arXiv:2505.16918v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.16918 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] Unsupervised Prompting for Graph Neural Networks

链接: https://arxiv.org/abs/2505.16903
作者: Peyman Baghershahi,Sourav Medya
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures, 14 tables

点击查看摘要

Abstract:Prompt tuning methods for Graph Neural Networks (GNNs) have become popular to address the semantic gap between pre-training and fine-tuning steps. However, existing GNN prompting methods rely on labeled data and involve lightweight fine-tuning for downstream tasks. Meanwhile, in-context learning methods for Large Language Models (LLMs) have shown promising performance with no parameter updating and no or minimal labeled data. Inspired by these approaches, in this work, we first introduce a challenging problem setup to evaluate GNN prompting methods. This setup encourages a prompting function to enhance a pre-trained GNN’s generalization to a target dataset under covariate shift without updating the GNN’s parameters and with no labeled data. Next, we propose a fully unsupervised prompting method based on consistency regularization through pseudo-labeling. We use two regularization techniques to align the prompted graphs’ distribution with the original data and reduce biased predictions. Through extensive experiments under our problem setting, we demonstrate that our unsupervised approach outperforms the state-of-the-art prompting methods that have access to labels.

[LG-10] Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks

链接: https://arxiv.org/abs/2505.16901
作者: Hongyuan Tao,Ying Zhang,Zhenhao Tang,Hongen Peng,Xukun Zhu,Bingchang Liu,Yingguang Yang,Ziyin Zhang,Zhaogui Xu,Haipeng Zhang,Linchao Zhu,Rui Wang,Hang Yu,Jianguo Li,Peng Di
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 31 pages, 9 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM’s attention mechanism and map node attributes to the LLM’s input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model. This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%.

[LG-11] A Multi-Step Comparative Framework for Anomaly Detection in IoT Data Streams

链接: https://arxiv.org/abs/2505.16872
作者: Mohammed Al-Qudah,Fadi AlMahamid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) devices has introduced critical security challenges, underscoring the need for accurate anomaly detection. Although numerous studies have proposed machine learning (ML) methods for this purpose, limited research systematically examines how different preprocessing steps–normalization, transformation, and feature selection–interact with distinct model architectures. To address this gap, this paper presents a multi-step evaluation framework assessing the combined impact of preprocessing choices on three ML algorithms: RNN-LSTM, autoencoder neural networks (ANN), and Gradient Boosting (GBoosting). Experiments on the IoTID20 dataset shows that GBoosting consistently delivers superior accuracy across preprocessing configurations, while RNN-LSTM shows notable gains with z-score normalization and autoencoders excel in recall, making them well-suited for unsupervised scenarios. By offering a structured analysis of preprocessing decisions and their interplay with various ML techniques, the proposed framework provides actionable guidance to enhance anomaly detection performance in IoT environments.

[LG-12] Redefining Clustered Federated Learning for System Identification: The Path of ClusterCraft

链接: https://arxiv.org/abs/2505.16857
作者: Ertuğrul Keçeci,Müjde Güzelkaya,Tufan Kumbasar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the System Identification (SYSID) problem within the framework of federated learning. We introduce a novel algorithm, Incremental Clustering-based federated learning method for SYSID (IC-SYSID), designed to tackle SYSID challenges across multiple data sources without prior knowledge. IC-SYSID utilizes an incremental clustering method, ClusterCraft (CC), to eliminate the dependency on the prior knowledge of the dataset. CC starts with a single cluster model and assigns similar local workers to the same clusters by dynamically increasing the number of clusters. To reduce the number of clusters generated by CC, we introduce ClusterMerge, where similar cluster models are merged. We also introduce enhanced ClusterCraft to reduce the generation of similar cluster models during the training. Moreover, IC-SYSID addresses cluster model instability by integrating a regularization term into the loss function and initializing cluster models with scaled Glorot initialization. It also utilizes a mini-batch deep learning approach to manage large SYSID datasets during local training. Through the experiments conducted on a real-world representing SYSID problem, where a fleet of vehicles collaboratively learns vehicle dynamics, we show that IC-SYSID achieves a high SYSID performance while preventing the learning of unstable clusters.

[LG-13] Strategically Linked Decisions in Long-Term Planning and Reinforcement Learning

链接: https://arxiv.org/abs/2505.16833
作者: Alihan Hüyük,Finale Doshi-Velez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term planning, as in reinforcement learning (RL), involves finding strategies: actions that collectively work toward a goal rather than individually optimizing their immediate outcomes. As part of a strategy, some actions are taken at the expense of short-term benefit to enable future actions with even greater returns. These actions are only advantageous if followed up by the actions they facilitate, consequently, they would not have been taken if those follow-ups were not available. In this paper, we quantify such dependencies between planned actions with strategic link scores: the drop in the likelihood of one decision under the constraint that a follow-up decision is no longer available. We demonstrate the utility of strategic link scores through three practical applications: (i) explaining black-box RL agents by identifying strategically linked pairs among decisions they make, (ii) improving the worst-case performance of decision support systems by distinguishing whether recommended actions can be adopted as standalone improvements or whether they are strategically linked hence requiring a commitment to a broader strategy to be effective, and (iii) characterizing the planning processes of non-RL agents purely through interventions aimed at measuring strategic link scores - as an example, we consider a realistic traffic simulator and analyze through road closures the effective planning horizon of the emergent routing behavior of many drivers.

[LG-14] Contextual Learning for Stochastic Optimization

链接: https://arxiv.org/abs/2505.16829
作者: Anna Heuser,Thomas Kesselheim
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT)
*备注: Full version of EC’25 paper

点击查看摘要

Abstract:Motivated by stochastic optimization, we introduce the problem of learning from samples of contextual value distributions. A contextual value distribution can be understood as a family of real-valued distributions, where each sample consists of a context x and a random variable drawn from the corresponding real-valued distribution D_x . By minimizing a convex surrogate loss, we learn an empirical distribution D’_x for each context, ensuring a small Lévy distance to D_x . We apply this result to obtain the sample complexity bounds for the learning of an \epsilon -optimal policy for stochastic optimization problems defined on an unknown contextual value distribution. The sample complexity is shown to be polynomial for the general case of strongly monotone and stable optimization problems, including Single-item Revenue Maximization, Pandora’s Box and Optimal Stopping.

[LG-15] LLM -Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols

链接: https://arxiv.org/abs/2505.16821
作者: Ziming liu,Bryan Liu,Alvaro Valcarce,Xiaoli Chu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This work has been submitted to the IEEE for possible publication. Focuses on applying LLMs to 5G RRC protocol generation; primary: cs.NI; cross-list: eess.SP, cs.LG

点击查看摘要

Abstract:Integrating large AI models (LAMs) into 6G mobile networks promises to redefine protocol design and control-plane intelligence by enabling autonomous, cognitive network operations. While industry concepts, such as ETSI’s Experiential Networked Intelligence (ENI), envision LAM-driven agents for adaptive network slicing and intent-based management, practical implementations still face challenges in protocol literacy and real-world deployment. This paper presents an end-to-end demonstration of a LAM that generates standards-compliant, ASN.1-encoded Radio Resource Control (RRC) messages as part of control-plane procedures inside a gNB. We treat RRC messaging as a domain-specific language and fine-tune a decoder-only transformer model (LLaMA class) using parameter-efficient Low-Rank Adaptation (LoRA) on RRC messages linearized to retain their ASN.1 syntactic structure before standard byte-pair encoding tokenization. This enables combinatorial generalization over RRC protocol states while minimizing training overhead. On 30k field-test request-response pairs, our 8 B model achieves a median cosine similarity of 0.97 with ground-truth messages on an edge GPU – a 61 % relative gain over a zero-shot LLaMA-3 8B baseline – indicating substantially improved structural and semantic RRC fidelity. Overall, our results show that LAMs, when augmented with Radio Access Network (RAN)-specific reasoning, can directly orchestrate control-plane procedures, representing a stepping stone toward the AI-native air-interface paradigm. Beyond RRC emulation, this work lays the groundwork for future AI-native wireless standards.

[LG-16] FlowMixer: A Constrained Neural Architecture for Interpretable Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2505.16786
作者: Fares B. Mehouachi,Saif Eddin Jabari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce FlowMixer, a neural architecture that leverages constrained matrix operations to model structured spatiotemporal patterns. At its core, FlowMixer incorporates non-negative matrix mixing layers within a reversible mapping framework-applying transforms before mixing and their inverses afterward. This shape-preserving design enables a Kronecker-Koopman eigenmode framework that bridges statistical learning with dynamical systems theory, providing interpretable spatiotemporal patterns and facilitating direct algebraic manipulation of prediction horizons without retraining. Extensive experiments across diverse domains demonstrate FlowMixer’s robust long-horizon forecasting capabilities while effectively modeling physical phenomena such as chaotic attractors and turbulent flows. These results suggest that architectural constraints can simultaneously enhance predictive performance and mathematical interpretability in neural forecasting systems.

[LG-17] Multi-Output Gaussian Processes for Graph-Structured Data

链接: https://arxiv.org/abs/2505.16755
作者: Ayano Nakai-Kasai,Tadashi Wadayama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-structured data is a type of data to be obtained associated with a graph structure where vertices and edges describe some kind of data correlation. This paper proposes a regression method on graph-structured data, which is based on multi-output Gaussian processes (MOGP), to capture both the correlation between vertices and the correlation between associated data. The proposed formulation is built on the definition of MOGP. This allows it to be applied to a wide range of data configurations and scenarios. Moreover, it has high expressive capability due to its flexibility in kernel design. It includes existing methods of Gaussian processes for graph-structured data as special cases and is possible to remove restrictions on data configurations, model selection, and inference scenarios in the existing methods. The performance of extensions achievable by the proposed formulation is evaluated through computer experiments with synthetic and real data.

[LG-18] PyTupli: A Scalable Infrastructure for Collaborative Offline Reinforcement Learning Projects

链接: https://arxiv.org/abs/2505.16754
作者: Hannah Markgraf,Michael Eichelbeck,Daria Cappey,Selin Demirtürk,Yara Schattschneider,Matthias Althoff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) has gained traction as a powerful paradigm for learning control policies from pre-collected data, eliminating the need for costly or risky online interactions. While many open-source libraries offer robust implementations of offline RL algorithms, they all rely on datasets composed of experience tuples consisting of state, action, next state, and reward. Managing, curating, and distributing such datasets requires suitable infrastructure. Although static datasets exist for established benchmark problems, no standardized or scalable solution supports developing and sharing datasets for novel or user-defined benchmarks. To address this gap, we introduce PyTupli, a Python-based tool to streamline the creation, storage, and dissemination of benchmark environments and their corresponding tuple datasets. PyTupli includes a lightweight client library with defined interfaces for uploading and retrieving benchmarks and data. It supports fine-grained filtering at both the episode and tuple level, allowing researchers to curate high-quality, task-specific datasets. A containerized server component enables production-ready deployment with authentication, access control, and automated certificate provisioning for secure use. By addressing key barriers in dataset infrastructure, PyTupli facilitates more collaborative, reproducible, and scalable offline RL research.

[LG-19] Revenue Optimization with Price-Sensitive and Interdependent Demand DATE

链接: https://arxiv.org/abs/2505.16748
作者: Julien Laasri,Marc Revol
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 21 pages, 17 figures, dated 2018, in French

点击查看摘要

Abstract:As Kalyan T. Talluri and Garrett J. Van Ryzin describe in their work [3], Revenue Management aims to maximize an organization’s revenue by considering three types of decision categories: structural, pricing, and quantity. In this document, our primary focus will be on decisions related to pricing and quantity for the sale of airline tickets on a direct flight over a certain number of time periods. More specifically, we will only focus on the optimization aspect of this problem. We will assume the demand data to be given, since Air France estimates it beforehand using real data. Similarly, we assume all price options to be predetermined by Air France’s algorithms and verified by their analysts. Our objective will be to maximize the revenue of a direct flight by choosing the prices for each product from the predefined set of options. – Comme décrit par Kalyan T. Talluri et Garrett J. Van Ryzin dans leur ouvrage [3], le Revenue Management consiste en la maximisation du revenu d’un organisme à partir de trois types de catégories de décision : structurelles, prix et quantité. Dans ce document, nous nous intéresserons principalement aux décisions de type prix et quantité pour la vente de billets d’avion sur un vol direct au cours d’un certain nombre de pas de temps. Plus précisément, nous nous situerons dans la partie optimisation du problème. Nous prendrons ainsi les données de demande comme acquises, car elles sont estimées au préalable par Air France à partir des données réelles. De même, pour chaque produit que l’on cherchera à vendre, on nous impose en amont les prix possibles que l’on a droit d’utiliser et qui se basent sur des algorithmes d’Air France dont les résultats sont vérifiés par des analystes. Notre but sera alors de maximiser le revenu d’un vol direct en choisissant les prix de chaque produit parmi ceux imposés. Comments: 21 pages, 17 figures, dated 2018, in French Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) MSC classes: 90C59 ACMclasses: G.1.6 Cite as: arXiv:2505.16748 [cs.LG] (or arXiv:2505.16748v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.16748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Meta-reinforcement learning with minimum attention

链接: https://arxiv.org/abs/2505.16741
作者: Pilhwa Lee,Shashank Gupta
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Minimum attention applies the least action principle in the changes of control concerning state and time, first proposed by Brockett. The involved regularization is highly relevant in emulating biological control, such as motor learning. We apply minimum attention in reinforcement learning (RL) as part of the rewards and investigate its connection to meta-learning and stabilization. Specifically, model-based meta-learning with minimum attention is explored in high-dimensional nonlinear dynamics. Ensemble-based model learning and gradient-based meta-policy learning are alternately performed. Empirically, we show that the minimum attention does show outperforming competence in comparison to the state-of-the-art algorithms in model-free and model-based RL, i.e., fast adaptation in few shots and variance reduction from the perturbations of the model and environment. Furthermore, the minimum attention demonstrates the improvement in energy efficiency.

[LG-21] Backward Oversmoothing: why is it hard to train deep Graph Neural Networks?

链接: https://arxiv.org/abs/2505.16736
作者: Nicolas Keriven
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Oversmoothing has long been identified as a major limitation of Graph Neural Networks (GNNs): input node features are smoothed at each layer and converge to a non-informative representation, if the weights of the GNN are sufficiently bounded. This assumption is crucial: if, on the contrary, the weights are sufficiently large, then oversmoothing may not happen. Theoretically, GNN could thus learn to not oversmooth. However it does not really happen in practice, which prompts us to examine oversmoothing from an optimization point of view. In this paper, we analyze backward oversmoothing, that is, the notion that backpropagated errors used to compute gradients are also subject to oversmoothing from output to input. With non-linear activation functions, we outline the key role of the interaction between forward and backward smoothing. Moreover, we show that, due to backward oversmoothing, GNNs provably exhibit many spurious stationary points: as soon as the last layer is trained, the whole GNN is at a stationary point. As a result, we can exhibit regions where gradients are near-zero while the loss stays high. The proof relies on the fact that, unlike forward oversmoothing, backward errors are subjected to a linear oversmoothing even in the presence of non-linear activation function, such that the average of the output error plays a key role. Additionally, we show that this phenomenon is specific to deep GNNs, and exhibit counter-example Multi-Layer Perceptron. This paper is a step toward a more complete comprehension of the optimization landscape specific to GNNs.

[LG-22] Maximum Total Correlation Reinforcement Learning ICML2025

链接: https://arxiv.org/abs/2505.16734
作者: Bang You,Puze Liu,Huaping Liu,Jan Peters,Oleg Arenz
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the episode. To that end, we introduce a modification of the reinforcement learning problem that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower-bound approximation. In simulated robot environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.

[LG-23] Forward-only Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2505.16733
作者: Ziwei Luo,Fredrik K. Gustafsson,Jens Sjölund,Thomas B. Schön
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent linear stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves competitive performance on various image-conditioned (e.g., image restoration) and unconditional generation tasks, demonstrating its effectiveness in generative modelling. Our code is available at this https URL.

[LG-24] Robust LLM Fingerprinting via Domain-Specific Watermarks

链接: https://arxiv.org/abs/2505.16723
作者: Thibaud Gloaguen,Robin Staab,Nikola Jovanović,Martin Vechev
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As open-source language models (OSMs) grow more capable and are widely shared and finetuned, ensuring model provenance, i.e., identifying the origin of a given model instance, has become an increasingly important issue. At the same time, existing backdoor-based model fingerprinting techniques often fall short of achieving key requirements of real-world model ownership detection. In this work, we build on the observation that while current open-source model watermarks fail to achieve reliable content traceability, they can be effectively adapted to address the challenge of model provenance. To this end, we introduce the concept of domain-specific watermarking for model fingerprinting. Rather than watermarking all generated content, we train the model to embed watermarks only within specified subdomains (e.g., particular languages or topics). This targeted approach ensures detection reliability, while improving watermark durability and quality under a range of real-world deployment settings. Our evaluations show that domain-specific watermarking enables model fingerprinting with strong statistical guarantees, controllable false positive rates, high detection power, and preserved generation quality. Moreover, we find that our fingerprints are inherently stealthy and naturally robust to real-world variability across deployment scenarios.

[LG-25] he Computational Complexity of Counting Linear Regions in ReLU Neural Networks

链接: https://arxiv.org/abs/2505.16716
作者: Moritz Stargalla,Christoph Hertrich,Daniel Reichman
类目: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Combinatorics (math.CO)
*备注: 25 pages

点击查看摘要

Abstract:An established measure of the expressive power of a given ReLU neural network is the number of linear regions into which it partitions the input space. There exist many different, non-equivalent definitions of what a linear region actually is. We systematically assess which papers use which definitions and discuss how they relate to each other. We then analyze the computational complexity of counting the number of such regions for the various definitions. Generally, this turns out to be an intractable problem. We prove NP- and #P-hardness results already for networks with one hidden layer and strong hardness of approximation results for two or more hidden layers. Finally, on the algorithmic side, we demonstrate that counting linear regions can at least be achieved in polynomial space for some common definitions.

[LG-26] Learning Genomic Structure from k-mers

链接: https://arxiv.org/abs/2505.16680
作者: Filip Thor,Carl Nettelblad
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Sequencing a genome to determine an individual’s DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of k -mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the E.\ coli genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter \Gamma . The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method’s favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.

[LG-27] On the Out-of-Distribution Generalization of Self-Supervised Learning

链接: https://arxiv.org/abs/2505.16675
作者: Wenwen Qiang,Jingyao Wang,Zeen Song,Jiangmeng Li,Changwen Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the spurious variable and label variable is mutually independent. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.

[LG-28] Quantum Feature Optimization for Enhanced Clustering of Blockchain Transaction Data

链接: https://arxiv.org/abs/2505.16672
作者: Yun-Cheng Tsai,Samuel Yen-Chi Chen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, 1 table

点击查看摘要

Abstract:Blockchain transaction data exhibits high dimensionality, noise, and intricate feature entanglement, presenting significant challenges for traditional clustering algorithms. In this study, we conduct a comparative analysis of three clustering approaches: (1) Classical K-Means Clustering, applied to pre-processed feature representations; (2) Hybrid Clustering, wherein classical features are enhanced with quantum random features extracted using randomly initialized quantum neural networks (QNNs); and (3) Fully Quantum Clustering, where a QNN is trained in a self-supervised manner leveraging a SwAV-based loss function to optimize the feature space for clustering directly. The proposed experimental framework systematically investigates the impact of quantum circuit depth and the number of learned prototypes, demonstrating that even shallow quantum circuits can effectively extract meaningful non-linear representations, significantly improving clustering performance.

[LG-29] Stochastic Forward-Forward Learning through Representational Dimensionality Compression

链接: https://arxiv.org/abs/2505.16649
作者: Zhichao Zhu,Yang Qi,Hengyuan Ma,Wenlian Lu,Jianfeng Feng
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 14 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm provides a bottom-up alternative to backpropagation (BP) for training neural networks, relying on a layer-wise “goodness” function to guide learning. Existing goodness functions, inspired by energy-based learning (EBL), are typically defined as the sum of squared post-synaptic activations, neglecting the correlations between neurons. In this work, we propose a novel goodness function termed dimensionality compression that uses the effective dimensionality (ED) of fluctuating neural responses to incorporate second-order statistical structure. Our objective minimizes ED for clamped inputs when noise is considered while maximizing it across the sample distribution, promoting structured representations without the need to prepare negative samples. We demonstrate that this formulation achieves competitive performance compared to other non-BP methods. Moreover, we show that noise plays a constructive role that can enhance generalization and improve inference when predictions are derived from the mean of squared outputs, which is equivalent to making predictions based on the energy term. Our findings contribute to the development of more biologically plausible learning algorithms and suggest a natural fit for neuromorphic computing, where stochasticity is a computational resource rather than a nuisance. The code is available at this https URL

[LG-30] Reconsidering Fairness Through Unawareness from the Perspective of Model Multiplicity

链接: https://arxiv.org/abs/2505.16638
作者: Benedikt Höltgen,Nuria Oliver
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fairness through Unawareness (FtU) describes the idea that discrimination against demographic groups can be avoided by not considering group membership in the decisions or predictions. This idea has long been criticized in the machine learning literature as not being sufficient to ensure fairness. In addition, the use of additional features is typically thought to increase the accuracy of the predictions for all groups, so that FtU is sometimes thought to be detrimental to all groups. In this paper, we show both theoretically and empirically that FtU can reduce algorithmic discrimination without necessarily reducing accuracy. We connect this insight with the literature on Model Multiplicity, to which we contribute with novel theoretical and empirical results. Furthermore, we illustrate how, in a real-life application, FtU can contribute to the deployment of more equitable policies without losing efficacy. Our findings suggest that FtU is worth considering in practical applications, particularly in high-risk scenarios, and that the use of protected attributes such as gender in predictive models should be accompanied by a clear and well-founded justification.

[LG-31] Multivariate Latent Recalibration for Conditional Normalizing Flows

链接: https://arxiv.org/abs/2505.16636
作者: Victor Dheur,Souhaib Ben Taieb
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliably characterizing the full conditional distribution of a multivariate response variable given a set of covariates is crucial for trustworthy decision-making. However, misspecified or miscalibrated multivariate models may yield a poor approximation of the joint distribution of the response variables, leading to unreliable predictions and suboptimal decisions. Furthermore, standard recalibration methods are primarily limited to univariate settings, while conformal prediction techniques, despite generating multivariate prediction regions with coverage guarantees, do not provide a full probability density function. We address this gap by first introducing a novel notion of latent calibration, which assesses probabilistic calibration in the latent space of a conditional normalizing flow. Second, we propose latent recalibration (LR), a novel post-hoc model recalibration method that learns a transformation of the latent space with finite-sample bounds on latent calibration. Unlike existing methods, LR produces a recalibrated distribution with an explicit multivariate density function while remaining computationally efficient. Extensive experiments on both tabular and image datasets show that LR consistently improves latent calibration error and the negative log-likelihood of the recalibrated models.

[LG-32] WikiDBGraph: Large-Scale Database Graph of Wikidata for Collaborative Learning

链接: https://arxiv.org/abs/2505.16635
作者: Zhaomin Wu,Ziyang Wang,Bingsheng He
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data, ubiquitous and rich in informational value, is an increasing focus for deep representation learning, yet progress is hindered by studies centered on single tables or isolated databases, which limits model capabilities due to data scale. While collaborative learning approaches such as federated learning, transfer learning, split learning, and tabular foundation models aim to learn from multiple correlated databases, they are challenged by a scarcity of real-world interconnected tabular resources. Current data lakes and corpora largely consist of isolated databases lacking defined inter-database correlations. To overcome this, we introduce WikiDBGraph, a large-scale graph of 100,000 real-world tabular databases from WikiData, interconnected by 17 million edges and characterized by 13 node and 12 edge properties derived from its database schema and data distribution. WikiDBGraph’s weighted edges identify both instance- and feature-overlapped databases. Experiments on these newly identified databases confirm that collaborative learning yields superior performance, thereby offering considerable promise for structured foundation model training while also exposing key challenges and future directions for learning from interconnected tabular data.

[LG-33] CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models

链接: https://arxiv.org/abs/2505.16620
作者: Benjamin Herdeanu,Juan Nathaniel,Carla Roesch,Jatan Buch,Gregor Ramien,Johannes Haux,Pierre Gentine
类目: Machine Learning (cs.LG)
*备注: 16+19 pages, 5+8 figures

点击查看摘要

Abstract:Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on this https URL.

[LG-34] raining on Plausible Counterfactuals Removes Spurious Correlations

链接: https://arxiv.org/abs/2505.16583
作者: Shpresim Sadiku,Kartikeya Chitranshi,Hiroshi Kera,Sebastian Pokutta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emphincorrect target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.

[LG-35] Large Language Model-Empowered Interactive Load Forecasting

链接: https://arxiv.org/abs/2505.16577
作者: Yu Zuo,Dalin Qin,Yi Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing complexity of power systems has made accurate load forecasting more important than ever. An increasing number of advanced load forecasting methods have been developed. However, the static design of current methods offers no mechanism for human-model interaction. As the primary users of forecasting models, system operators often find it difficult to understand and apply these advanced models, which typically requires expertise in artificial intelligence (AI). This also prevents them from incorporating their experience and real-world contextual understanding into the forecasting process. Recent breakthroughs in large language models (LLMs) offer a new opportunity to address this issue. By leveraging their natural language understanding and reasoning capabilities, we propose an LLM-based multi-agent collaboration framework to bridge the gap between human operators and forecasting models. A set of specialized agents is designed to perform different tasks in the forecasting workflow and collaborate via a dedicated communication mechanism. This framework embeds interactive mechanisms throughout the load forecasting pipeline, reducing the technical threshold for non-expert users and enabling the integration of human experience. Our experiments demonstrate that the interactive load forecasting accuracy can be significantly improved when users provide proper insight in key stages. Our cost analysis shows that the framework remains affordable, making it practical for real-world deployment.

[LG-36] A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

链接: https://arxiv.org/abs/2505.16563
作者: Chen Gong,Rui Xing,Zhenzhe Zheng,Fan Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework \sf Titan to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, \sf Titan filters out a candidate dataset with potentially high importance in a coarse-grained this http URL the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, \sf Titan leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate \sf Titan on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that \sf Titan achieves up to 43% reduction in training time and 6.2% increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.

[LG-37] owards Coordinate- and Dimension-Agnostic Machine Learning for Partial Differential Equations

链接: https://arxiv.org/abs/2505.16549
作者: Trung V. Phan,George A. Kevrekidis,Soledad Villar,Yannis G. Kevrekidis,Juan M. Bello-Rivas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The machine learning methods for data-driven identification of partial differential equations (PDEs) are typically defined for a given number of spatial dimensions and a choice of coordinates the data have been collected in. This dependence prevents the learned evolution equation from generalizing to other spaces. In this work, we reformulate the problem in terms of coordinate- and dimension-independent representations, paving the way toward what we call ``spatially liberated" PDE learning. To this end, we employ a machine learning approach to predict the evolution of scalar field systems expressed in the formalism of exterior calculus, which is coordinate-free and immediately generalizes to arbitrary dimensions by construction. We demonstrate the performance of this approach in the FitzHugh-Nagumo and Barkley reaction-diffusion models, as well as the Patlak-Keller-Segel model informed by in-situ chemotactic bacteria observations. We provide extensive numerical experiments that demonstrate that our approach allows for seamless transitions across various spatial contexts. We show that the field dynamics learned in one space can be used to make accurate predictions in other spaces with different dimensions, coordinate systems, boundary conditions, and curvatures.

[LG-38] Incremental Sequence Classification with Temporal Consistency

链接: https://arxiv.org/abs/2505.16548
作者: Lucas Maystre,Gabriel Barello,Tudor Berariu,Aleix Cambray,Rares Dolga,Alvaro Ortega Gonzalez,Andrei Nica,David Barber
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We address the problem of incremental sequence classification, where predictions are updated as new elements in the sequence are revealed. Drawing on temporal-difference learning from reinforcement learning, we identify a temporal-consistency condition that successive predictions should satisfy. We leverage this condition to develop a novel loss function for training incremental sequence classifiers. Through a concrete example, we demonstrate that optimizing this loss can offer substantial gains in data efficiency. We apply our method to text classification tasks and show that it improves predictive accuracy over competing approaches on several benchmark datasets. We further evaluate our approach on the task of verifying large language model generations for correctness in grade-school math problems. Our results show that models trained with our method are better able to distinguish promising generations from unpromising ones after observing only a few tokens.

[LG-39] HOFT: Householder Orthogonal Fine-tuning

链接: https://arxiv.org/abs/2505.16531
作者: Alejandro Moreno Arcas,Albert Sanchis,Jorge Civera,Alfons Juan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexity. Moreover, some theoretical properties of the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are evaluated in downstream tasks, namely commonsense reasoning, machine translation, subject-driven generation and mathematical reasoning. Compared with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or better results.

[LG-40] Joint Relational Database Generation via Graph-Conditional Diffusion Models

链接: https://arxiv.org/abs/2505.16527
作者: Mohamed Amine Ketata,David Lüdke,Leo Schwinn,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building generative models for relational databases (RDBs) is important for applications like privacy-preserving data release and augmenting real datasets. However, most prior work either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially. This approach limits parallelism, restricts flexibility in downstream applications like missing value imputation, and compounds errors due to commonly made conditional independence assumptions. We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order. By using a natural graph representation of RDBs, we propose the Graph-Conditional Relational Diffusion Model (GRDM). GRDM leverages a graph neural network to jointly denoise row attributes and capture complex inter-table dependencies. Extensive experiments on six real-world RDBs demonstrate that our approach substantially outperforms autoregressive baselines in modeling multi-hop inter-table correlations and achieves state-of-the-art performance on single-table fidelity metrics.

[LG-41] Accuracy vs. Accuracy: Computational Tradeoffs Between Classification Rates and Utility

链接: https://arxiv.org/abs/2505.16494
作者: Noga Amit,Omer Reingold,Guy N. Rothblum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the foundations of fairness and its interplay with utility and efficiency in settings where the training data contain richer labels, such as individual types, rankings, or risk estimates, rather than just binary outcomes. In this context, we propose algorithms that achieve stronger notions of evidence-based fairness than are possible in standard supervised learning. Our methods support classification and ranking techniques that preserve accurate subpopulation classification rates, as suggested by the underlying data distributions, across a broad class of classification rules and downstream applications. Furthermore, our predictors enable loss minimization, whether aimed at maximizing utility or in the service of fair treatment. Complementing our algorithmic contributions, we present impossibility results demonstrating that simultaneously achieving accurate classification rates and optimal loss minimization is, in some cases, computationally infeasible. Unlike prior impossibility results, our notions are not inherently in conflict and are simultaneously satisfied by the Bayes-optimal predictor. Furthermore, we show that each notion can be satisfied individually via efficient learning. Our separation thus stems from the computational hardness of learning a sufficiently good approximation of the Bayes-optimal predictor. These computational impossibilities present a choice between two natural and attainable notions of accuracy that could both be motivated by fairness. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.16494 [cs.LG] (or arXiv:2505.16494v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.16494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics

链接: https://arxiv.org/abs/2505.16493
作者: Seyedeh Fatemeh Ebrahimi,Jaakko Peltonen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.

[LG-43] Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling

链接: https://arxiv.org/abs/2505.16481
作者: Xinxing Shi,Xiaoyu Jiang,Mauricio A. Álvarez
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior, thereby capturing richer correlations among latent variables. However, performing exact GP inference in large-scale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions or large sets of inducing points. In this work, we propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference. By confining computations to the nearest neighbours of each data point, our method preserves essential latent dependencies, allowing more flexible kernel choices and mitigating the need for numerous inducing points. Through extensive experiments on tasks including representation learning, data imputation, and conditional generation, we demonstrate that our approach outperforms other GPVAE variants in both predictive performance and computational efficiency.

[LG-44] Graph-Supported Dynamic Algorithm Configuration for Multi-Objective Combinatorial Optimization

链接: https://arxiv.org/abs/2505.16471
作者: Robbert Reijnen,Yaoxin Wu,Zaharah Bukhsh,Yingqian Zhang
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has been widely used for dynamic algorithm configuration, particularly in evolutionary computation, which benefits from the adaptive update of parameters during the algorithmic execution. However, applying DRL to algorithm configuration for multi-objective combinatorial optimization (MOCO) problems remains relatively unexplored. This paper presents a novel graph neural network (GNN) based DRL to configure multi-objective evolutionary algorithms. We model the dynamic algorithm configuration as a Markov decision process, representing the convergence of solutions in the objective space by a graph, with their embeddings learned by a GNN to enhance the state representation. Experiments on diverse MOCO challenges indicate that our method outperforms traditional and DRL-based algorithm configuration methods in terms of efficacy and adaptability. It also exhibits advantageous generalizability across objective types and problem sizes, and applicability to different evolutionary computation methods.

[LG-45] Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

链接: https://arxiv.org/abs/2505.16446
作者: Zhaoxin Wang,Handing Wang,Cong Tian,Yaochu Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) enable powerful cross-modal reasoning capabilities. However, the expanded input space introduces new attack surfaces. Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision. As MLLMs increasingly incorporate cross-modal consistency and alignment mechanisms, such explicit attacks become easier to detect and block. In this work, we propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via least significant bit steganography and couples them with seemingly benign, image-related textual prompts. To further enhance attack effectiveness across diverse MLLMs, we incorporate adversarial suffixes generated by a surrogate model and introduce a template optimization module that iteratively refines both the prompt and embedding based on model feedback. On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.

[LG-46] Performance Guaranteed Poisoning Attacks in Federated Learning: A Sliding Mode Approach

链接: https://arxiv.org/abs/2505.16403
作者: Huazi Pan,Yanjun Zhang,Leo Yu Zhang,Scott Adams,Abbas Kouzani,Suiyang Khoo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Manipulation of local training data and local updates, i.e., the poisoning attack, is the main threat arising from the collaborative nature of the federated learning (FL) paradigm. Most existing poisoning attacks aim to manipulate local data/models in a way that causes denial-of-service (DoS) issues. In this paper, we introduce a novel attack method, named Federated Learning Sliding Attack (FedSA) scheme, aiming at precisely introducing the extent of poisoning in a subtle controlled manner. It operates with a predefined objective, such as reducing global model’s prediction accuracy by 10%. FedSA integrates robust nonlinear control-Sliding Mode Control (SMC) theory with model poisoning attacks. It can manipulate the updates from malicious clients to drive the global model towards a compromised state, achieving this at a controlled and inconspicuous rate. Additionally, leveraging the robust control properties of FedSA allows precise control over the convergence bounds, enabling the attacker to set the global accuracy of the poisoned model to any desired level. Experimental results demonstrate that FedSA can accurately achieve a predefined global accuracy with fewer malicious clients while maintaining a high level of stealth and adjustable learning rates.

[LG-47] Divide-Fuse-Conquer: Eliciting “Aha Moments” in Multi-Scenario Games

链接: https://arxiv.org/abs/2505.16401
作者: Xiaoqing Zhang,Huabin Zheng,Ang Lv,Yuhan Liu,Zirui Song,Flood Sung,Xiuying Chen,Rui Yan
类目: Machine Learning (cs.LG)
*备注: 25 pages, 13 figures, and 8 tables

点击查看摘要

Abstract:Large language models (LLMs) have been observed to suddenly exhibit advanced reasoning abilities during reinforcement learning (RL), resembling an ``aha moment’’ triggered by simple outcome-based rewards. While RL has proven effective in eliciting such breakthroughs in tasks involving mathematics, coding, and vision, it faces significant challenges in multi-scenario games. The diversity of game rules, interaction modes, and environmental complexities often leads to policies that perform well in one scenario but fail to generalize to others. Simply combining multiple scenarios during training introduces additional challenges, such as training instability and poor performance. To overcome these challenges, we propose Divide-Fuse-Conquer, a framework designed to enhance generalization in multi-scenario RL. This approach starts by heuristically grouping games based on characteristics such as rules and difficulties. Specialized models are then trained for each group to excel at games in the group is what we refer to as the divide step. Next, we fuse model parameters from different groups as a new model, and continue training it for multiple groups, until the scenarios in all groups are conquered. Experiments across 18 TextArena games show that Qwen2.5-32B-Align trained with the Divide-Fuse-Conquer strategy reaches a performance level comparable to Claude3.5, achieving 7 wins and 4 draws. We hope our approach can inspire future research on using reinforcement learning to improve the generalization of LLMs.

[LG-48] Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space

链接: https://arxiv.org/abs/2505.16386
作者: Ahmed K. Kadhim,Lei Jiao,Rishad Shafik,Ole-Christoffer Granmo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing complexity of large-scale language models has amplified concerns regarding their interpretability and reusability. While traditional embedding models like Word2Vec and GloVe offer scalability, they lack transparency and often behave as black boxes. Conversely, interpretable models such as the Tsetlin Machine ™ have shown promise in constructing explainable learning systems, though they previously faced limitations in scalability and reusability. In this paper, we introduce Omni Tsetlin Machine AutoEncoder (Omni TM-AE), a novel embedding model that fully exploits the information contained in the TM’s state matrix, including literals previously excluded from clause formation. This method enables the construction of reusable, interpretable embeddings through a single training phase. Extensive experiments across semantic similarity, sentiment classification, and document clustering tasks show that Omni TM-AE performs competitively with and often surpasses mainstream embedding models. These results demonstrate that it is possible to balance performance, scalability, and interpretability in modern Natural Language Processing (NLP) systems without resorting to opaque architectures.

[LG-49] Arrival Control in Quasi-Reversible Queueing Systems: Optimization and Reinforcement Learning

链接: https://arxiv.org/abs/2505.16353
作者: Céline Comte(CNRS, LAAS-SARA, LAAS-RISC),Pascal Moyal(IECL)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a versatile scheme for optimizing the arrival rates of quasi-reversible queueing systems. We first propose an alternative definition of quasi-reversibility that encompasses reversibility and highlights the importance of the definition of customer classes. In a second time, we introduce balanced arrival control policies, which generalize the notion of balanced arrival rates introduced in the context of Whittle networks, to the much broader class of quasi-reversible queueing systems. We prove that supplementing a quasi-reversible queueing system with a balanced arrival-control policy preserves the quasi-reversibility, and we specify the form of the stationary measures. We revisit two canonical examples of quasi-reversible queueing systems, Whittle networks and order-independent queues. Lastly, we focus on the problem of admission control and leverage our results in the frameworks of optimization and reinforcement learning.

[LG-50] Graph Attention Network for Optimal User Association in Wireless Networks

链接: https://arxiv.org/abs/2505.16347
作者: Javad Mirzaei,Jeebak Mitra,Gwenael Poitau
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 7 figures

点击查看摘要

Abstract:With increased 5G deployments, network densification is higher than ever to support the exponentially high throughput requirements. However, this has meant a significant increase in energy consumption, leading to higher operational expenditure (OpEx) for network operators creating an acute need for improvements in network energy savings (NES). A key determinant of operational efficacy in cellular networks is the user association (UA) policy, as it affects critical aspects like spectral efficiency, load balancing etc. and therefore impacts the overall energy consumption of the network directly. Furthermore, with cellular network topologies lending themselves well to graphical abstractions, use of graphs in network optimization has gained significant prominence. In this work, we propose and analyze a graphical abstraction based optimization for UA in cellular networks to improve NES by determining when energy saving features like cell switch off can be activated. A comparison with legacy approaches establishes the superiority of the proposed approach.

[LG-51] A Square Peg in a Square Hole: Meta-Expert for Long-Tailed Semi-Supervised Learning ICML2025

链接: https://arxiv.org/abs/2505.16341
作者: Yaxin Hou,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注: The paper is accepted by ICML 2025

点击查看摘要

Abstract:This paper studies the long-tailed semi-supervised learning (LTSSL) with distribution mismatch, where the class distribution of the labeled training data follows a long-tailed distribution and mismatches with that of the unlabeled training data. Most existing methods introduce auxiliary classifiers (experts) to model various unlabeled data distributions and produce pseudo-labels, but the expertises of various experts are not fully utilized. We observe that different experts are good at predicting different intervals of samples, e.g., long-tailed expert is skilled in samples located in the head interval and uniform expert excels in samples located in the medium interval. Therefore, we propose a dynamic expert assignment module that can estimate the class membership (i.e., head, medium, or tail class) of samples, and dynamically assigns suitable expert to each sample based on the estimated membership to produce high-quality pseudo-label in the training phase and produce prediction in the testing phase. We also theoretically reveal that integrating different experts’ strengths will lead to a smaller generalization error bound. Moreover, we find that the deeper features are more biased toward the head class but with more discriminative ability, while the shallower features are less biased but also with less discriminative ability. We, therefore, propose a multi-depth feature fusion module to utilize different depth features to mitigate the model bias. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR-10-LT, STL-10-LT, and SVHN-LT datasets across various settings. The code is available at this https URL.

[LG-52] Improving Chemical Understanding of LLM s via SMILES Parsing

链接: https://arxiv.org/abs/2505.16340
作者: Yunhui Jang,Jaehyung Kim,Sungsoo Ahn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.

[LG-53] Understanding Differential Transformer Unchains Pretrained Self-Attentions

链接: https://arxiv.org/abs/2505.16333
作者: Chaerin Kong,Jiho Jang,Nojun Kwak
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data ( 0.01%).

[LG-54] ChemMLLM : Chemical Multimodal Large Language Model

链接: https://arxiv.org/abs/2505.16326
作者: Qian Tan,Dongzhan Zhou,Peng Xia,Wanhao Liu,Wanli Ouyang,Lei Bai,Yuqiang Li,Tianfan Fu
类目: Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, in this paper, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 118.9% (4.27 vs 1.95 property improvement). The code is publicly available at this https URL.

[LG-55] FreshRetailNet-50K: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail

链接: https://arxiv.org/abs/2505.16319
作者: Yangyang Wang,Jiawei Gu,Li Long,Xin Li,Li Shen,Zhouyu Fu,Xiangjun Zhou,Xu Jiang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (this https URL) and code (this https URL) are openly released.

[LG-56] CAIFormer: A Causal Informed Transformer for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2505.16308
作者: Xingyu Zhang,Wenwen Qiang,Siyu Zhao,Huijie Guo,Jiangmeng Li,Chuxiong Sun,Changwen Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing multivariate time series forecasting methods adopt an all-to-all paradigm that feeds all variable histories into a unified model to predict their future values without distinguishing their individual roles. However, this undifferentiated paradigm makes it difficult to identify variable-specific causal influences and often entangles causally relevant information with spurious correlations. To address this limitation, we propose an all-to-one forecasting paradigm that predicts each target variable separately. Specifically, we first construct a Structural Causal Model from observational data and then, for each target variable, we partition the historical sequence into four sub-segments according to the inferred causal structure: endogenous, direct causal, collider causal, and spurious correlation. The prediction relies solely on the first three causally relevant sub-segments, while the spurious correlation sub-segment is excluded. Furthermore, we propose Causal Informed Transformer (CAIFormer), a novel forecasting model comprising three components: Endogenous Sub-segment Prediction Block, Direct Causal Sub-segment Prediction Block, and Collider Causal Sub-segment Prediction Block, which process the endogenous, direct causal, and collider causal sub-segments, respectively. Their outputs are then combined to produce the final prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the CAIFormer.

[LG-57] Large-Scale Bayesian Tensor Reconstruction: An Approximate Message Passing Solution

链接: https://arxiv.org/abs/2505.16305
作者: Bingyang Cheng,Zhongtao Chen,Yichen Jin,Hao Zhang,Chen Zhang,Edmud Y. Lam,Yik-Chung Wu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Tensor CANDECOMP/PARAFAC decomposition (CPD) is a fundamental model for tensor reconstruction. Although the Bayesian framework allows for principled uncertainty quantification and automatic hyperparameter learning, existing methods do not scale well for large tensors because of high-dimensional matrix inversions. To this end, we introduce CP-GAMP, a scalable Bayesian CPD algorithm. This algorithm leverages generalized approximate message passing (GAMP) to avoid matrix inversions and incorporates an expectation-maximization routine to jointly infer the tensor rank and noise power. Through multiple experiments, for synthetic 100x100x100 rank 20 tensors with only 20% elements observed, the proposed algorithm reduces runtime by 82.7% compared to the state-of-the-art variational Bayesian CPD method, while maintaining comparable reconstruction accuracy.

[LG-58] Fairness under Competition

链接: https://arxiv.org/abs/2505.16291
作者: Ronen Gradwohl,Eilam Shapira,Moshe Tennenholtz
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Algorithmic fairness has emerged as a central issue in ML, and it has become standard practice to adjust ML algorithms so that they will satisfy fairness requirements such as Equal Opportunity. In this paper we consider the effects of adopting such fair classifiers on the overall level of ecosystem fairness. Specifically, we introduce the study of fairness with competing firms, and demonstrate the failure of fair classifiers in yielding fair ecosystems. Our results quantify the loss of fairness in systems, under a variety of conditions, based on classifiers’ correlation and the level of their data overlap. We show that even if competing classifiers are individually fair, the ecosystem’s outcome may be unfair; and that adjusting biased algorithms to improve their individual fairness may lead to an overall decline in ecosystem fairness. In addition to these theoretical results, we also provide supporting experimental evidence. Together, our model and results provide a novel and essential call for action.

[LG-59] Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

链接: https://arxiv.org/abs/2505.16284
作者: Josh Alman,Zhao Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention mechanisms lie at the heart of modern large language models (LLMs). Straightforward algorithms for forward and backward (gradient) computation take quadratic time, and a line of work initiated by [Alman and Song NeurIPS 2023] and [Alman and Song NeurIPS 2024] has shown that quadratic time is necessary unless the model weights are small, in which case almost linear time algorithms are possible. In this paper, we show that large weights are necessary to avoid a strong preclusion to representational strength we call layer collapse, which means that the entire network can be approximated well by a network with only a single layer. Thus, the quadratic running time of attention is unavoidable for expressive transformers. The notion of layer collapse that we introduce is a variant on the notion of rank collapse from the work of [Dong, Cordonnier, and Loukas ICML 2021]. They showed that in Self Attention Networks with small weights and with skip connections, rank collapse must occur. This is typically interpreted as justifying the necessity of skip connections in expressive networks. However, our result shows that even with skip connections, if the weights are small, then layer collapse still occurs. Thus, only large weights, and not skip connections, can prevent these representational weaknesses. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.16284 [cs.LG] (or arXiv:2505.16284v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.16284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] hink-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

链接: https://arxiv.org/abs/2505.16265
作者: Ilgee Hong,Changlong Yu,Liang Qiu,Weixiang Yan,Zhenghao Xu,Haoming Jiang,Qingru Zhang,Qin Lu,Xin Liu,Chao Zhang,Tuo Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model’s long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches.

[LG-61] Small-to-Large Generalization: Data Influences Models Consistently Across Scale

链接: https://arxiv.org/abs/2505.16260
作者: Alaa Khaddaj,Logan Engstrom,Aleksander Madry
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Choice of training data distribution greatly influences model behavior. Yet, in large-scale settings, precisely characterizing how changes in training data affects predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data distribution influence model behavior across compute scale? We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data. Equipped with these findings, we characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.

[LG-62] Graph Neural Network-Based Collaborative Perception for Adaptive Scheduling in Distributed Systems

链接: https://arxiv.org/abs/2505.16248
作者: Wenxuan Zhu,Qiyuan Wu,Tengda Tang,Renzi Meng,Sheng Chai,Xuehui Quan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the limitations of multi-node perception and delayed scheduling response in distributed systems by proposing a GNN-based multi-node collaborative perception mechanism. The system is modeled as a graph structure. Message-passing and state-update modules are introduced. A multi-layer graph neural network is constructed to enable efficient information aggregation and dynamic state inference among nodes. In addition, a perception representation method is designed by fusing local states with global features. This improves each node’s ability to perceive the overall system status. The proposed method is evaluated within a customized experimental framework. A dataset featuring heterogeneous task loads and dynamic communication topologies is used. Performance is measured in terms of task completion rate, average latency, load balancing, and transmission efficiency. Experimental results show that the proposed method outperforms mainstream algorithms under various conditions, including limited bandwidth and dynamic structural changes. It demonstrates superior perception capabilities and cooperative scheduling performance. The model achieves rapid convergence and efficient responses to complex system states.

[LG-63] Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies

链接: https://arxiv.org/abs/2505.16242
作者: Runze Yan,Xun Shen,Akifumi Wachi,Sebastien Gros,Anni Zhao,Xiao Hu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose \textitOffline Guarded Safe Reinforcement Learning ( \mathsfOGSRL ), a theoretically grounded model-based offline RL framework. \mathsfOGSRL introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data.

[LG-64] Realistic Evaluation of TabPFN v2 in Open Environments

链接: https://arxiv.org/abs/2505.16226
作者: Zi-Jian Cheng,Zi-Yi Jia,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data, owing to its ubiquitous presence in real-world domains, has garnered significant attention in machine learning research. While tree-based models have long dominated tabular machine learning tasks, the recently proposed deep learning model TabPFN v2 has emerged, demonstrating unparalleled performance and scalability potential. Although extensive research has been conducted on TabPFN v2 to further improve performance, the majority of this research remains confined to closed environments, neglecting the challenges that frequently arise in open environments. This raises the question: Can TabPFN v2 maintain good performance in open environments? To this end, we conduct the first comprehensive evaluation of TabPFN v2’s adaptability in open environments. We construct a unified evaluation framework covering various real-world challenges and assess the robustness of TabPFN v2 under open environments scenarios using this framework. Empirical results demonstrate that TabPFN v2 shows significant limitations in open environments but is suitable for small-scale, covariate-shifted, and class-balanced tasks. Tree-based models remain the optimal choice for general tabular tasks in open environments. To facilitate future research on open environments challenges, we advocate for open environments tabular benchmarks, multi-metric evaluation, and universal modules to strengthen model robustness. We publicly release our evaluation framework at this https URL.

[LG-65] Reward-Aware Proto-Representations in Reinforcement Learning

链接: https://arxiv.org/abs/2505.16217
作者: Hon Tik Tse,Siddarth Chandrasekar,Marlos C. Machado
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

[LG-66] A Scalable Hierarchical Intrusion Detection System for Internet of Vehicles

链接: https://arxiv.org/abs/2505.16215
作者: Md Ashraf Uddin,Nam H. Chu,Reza Rafeh,Mutaz Barika
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to its nature of dynamic, mobility, and wireless data transfer, the Internet of Vehicles (IoV) is prone to various cyber threats, ranging from spoofing and Distributed Denial of Services (DDoS) attacks to malware. To safeguard the IoV ecosystem from intrusions, malicious activities, policy violations, intrusion detection systems (IDS) play a critical role by continuously monitoring and analyzing network traffic to identify and mitigate potential threats in real-time. However, most existing research has focused on developing centralized, machine learning-based IDS systems for IoV without accounting for its inherently distributed nature. Due to intensive computing requirements, these centralized systems often rely on the cloud to detect cyber threats, increasing delay of system response. On the other hand, edge nodes typically lack the necessary resources to train and deploy complex machine learning algorithms. To address this issue, this paper proposes an effective hierarchical classification framework tailored for IoV networks. Hierarchical classification allows classifiers to be trained and tested at different levels, enabling edge nodes to detect specific types of attacks independently. With this approach, edge nodes can conduct targeted attack detection while leveraging cloud nodes for comprehensive threat analysis and support. Given the resource constraints of edge nodes, we have employed the Boruta feature selection method to reduce data dimensionality, optimizing processing efficiency. To evaluate our proposed framework, we utilize the latest IoV security dataset CIC-IoV2024, achieving promising results that demonstrate the feasibility and effectiveness of our models in securing IoV networks.

[LG-67] Directional Convergence Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

链接: https://arxiv.org/abs/2505.16204
作者: Ichiro Hashimoto
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 34 pages

点击查看摘要

Abstract:In this paper, we prove directional convergence of network parameters of fixed width leaky ReLU two-layer neural networks optimized by gradient descent with exponential loss, which was previously only known for gradient flow. By a careful analysis of the convergent direction, we establish sufficient conditions of benign overfitting and discover a new phase transition in the test error bound. All of these results hold beyond the nearly orthogonal data setting which was studied in prior works. As an application, we demonstrate that benign overfitting occurs with high probability in sub-Gaussian mixture models.

[LG-68] Enhancing Federated Survival Analysis through Peer-Driven Client Reputation in Healthcare

链接: https://arxiv.org/abs/2505.16190
作者: Navid Seidi,Satyaki Roy,Sajal Das
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) holds great promise for digital health by enabling collaborative model training without compromising patient data privacy. However, heterogeneity across institutions, lack of sustained reputation, and unreliable contributions remain major challenges. In this paper, we propose a robust, peer-driven reputation mechanism for federated healthcare that employs a hybrid communication model to integrate decentralized peer feedback with clustering-based noise handling to enhance model aggregation. Crucially, our approach decouples the federated aggregation and reputation mechanisms by applying differential privacy to client-side model updates before sharing them for peer evaluation. This ensures sensitive information remains protected during reputation computation, while unaltered updates are sent to the server for global model training. Using the Cox Proportional Hazards model for survival analysis across multiple federated nodes, our framework addresses both data heterogeneity and reputation deficit by dynamically adjusting trust scores based on local performance improvements measured via the concordance index. Experimental evaluations on both synthetic datasets and the SEER dataset demonstrate that our method consistently achieves high and stable C-index values, effectively down-weighing noisy client updates and outperforming FL methods that lack a reputation system.

[LG-69] Why Can Accurate Models Be Learned from Inaccurate Annotations?

链接: https://arxiv.org/abs/2505.16159
作者: Chongjie Si,Yidan Cui,Fuchao Yang,Xiaokang Yang,Wei Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from inaccurate annotations has gained significant attention due to the high cost of precise labeling. However, despite the presence of erroneous labels, models trained on noisy data often retain the ability to make accurate predictions. This intriguing phenomenon raises a fundamental yet largely unexplored question: why models can still extract correct label information from inaccurate annotations remains unexplored. In this paper, we conduct a comprehensive investigation into this issue. By analyzing weight matrices from both empirical and theoretical perspectives, we find that label inaccuracy primarily accumulates noise in lower singular components and subtly perturbs the principal subspace. Within a certain range, the principal subspaces of weights trained on inaccurate labels remain largely aligned with those learned from clean labels, preserving essential task-relevant information. We formally prove that the angles of principal subspaces exhibit minimal deviation under moderate label inaccuracy, explaining why models can still generalize effectively. Building on these insights, we propose LIP, a lightweight plug-in designed to help classifiers retain principal subspace information while mitigating noise induced by label inaccuracy. Extensive experiments on tasks with various inaccuracy conditions demonstrate that LIP consistently enhances the performance of existing algorithms. We hope our findings can offer valuable theoretical and practical insights to understand of model robustness under inaccurate supervision.

[LG-70] Multimodal Online Federated Learning with Modality Missing in Internet of Things

链接: https://arxiv.org/abs/2505.16138
作者: Heqiang Wang,Xiang Liu,Xiaoxiong Zhong,Lixing Chen,Fangming Liu,Weizhe Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The Internet of Things (IoT) ecosystem generates vast amounts of multimodal data from heterogeneous sources such as sensors, cameras, and microphones. As edge intelligence continues to evolve, IoT devices have progressed from simple data collection units to nodes capable of executing complex computational tasks. This evolution necessitates the adoption of distributed learning strategies to effectively handle multimodal data in an IoT environment. Furthermore, the real-time nature of data collection and limited local storage on edge devices in IoT call for an online learning paradigm. To address these challenges, we introduce the concept of Multimodal Online Federated Learning (MMO-FL), a novel framework designed for dynamic and decentralized multimodal learning in IoT environments. Building on this framework, we further account for the inherent instability of edge devices, which frequently results in missing modalities during the learning process. We conduct a comprehensive theoretical analysis under both complete and missing modality scenarios, providing insights into the performance degradation caused by missing modalities. To mitigate the impact of modality missing, we propose the Prototypical Modality Mitigation (PMM) algorithm, which leverages prototype learning to effectively compensate for missing modalities. Experimental results on two multimodal datasets further demonstrate the superior performance of PMM compared to benchmarks.

[LG-71] Robust Invariant Representation Learning by Distribution Extrapolation

链接: https://arxiv.org/abs/2505.16126
作者: Kotaro Yoshida,Slavakis Konstantinos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Invariant risk minimization (IRM) aims to enable out-of-distribution (OOD) generalization in deep learning by learning invariant representations. As IRM poses an inherently challenging bi-level optimization problem, most existing approaches – including IRMv1 – adopt penalty-based single-level approximations. However, empirical studies consistently show that these methods often fail to outperform well-tuned empirical risk minimization (ERM), highlighting the need for more robust IRM implementations. This work theoretically identifies a key limitation common to many IRM variants: their penalty terms are highly sensitive to limited environment diversity and over-parameterization, resulting in performance degradation. To address this issue, a novel extrapolation-based framework is proposed that enhances environmental diversity by augmenting the IRM penalty through synthetic distributional shifts. Extensive experiments – ranging from synthetic setups to realistic, over-parameterized scenarios – demonstrate that the proposed method consistently outperforms state-of-the-art IRM variants, validating its effectiveness and robustness.

[LG-72] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

链接: https://arxiv.org/abs/2505.16122
作者: Junhong Lin,Xinyue Zeng,Jie Zhu,Song Wang,Julian Shun,Jun Wu,Dawei Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E^3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in E^3 . Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget’s ability to close performance gaps without retraining. Our code is available at this http URL.

[LG-73] A Generic Framework for Conformal Fairness ICLR2025

链接: https://arxiv.org/abs/2505.16115
作者: Aditya T. Vadlamani,Anutam Srinivasan,Pranav Maneriker,Ali Payani,Srinivasan Parthasarathy
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Camera Ready Version

点击查看摘要

Abstract:Conformal Prediction (CP) is a popular method for uncertainty quantification with machine learning models. While conformal prediction provides probabilistic guarantees regarding the coverage of the true label, these guarantees are agnostic to the presence of sensitive attributes within the dataset. In this work, we formalize \textitConformal Fairness, a notion of fairness using conformal predictors, and provide a theoretically well-founded algorithm and associated framework to control for the gaps in coverage between different sensitive groups. Our framework leverages the exchangeability assumption (implicit to CP) rather than the typical IID assumption, allowing us to apply the notion of Conformal Fairness to data types and tasks that are not IID, such as graph data. Experiments were conducted on graph and tabular datasets to demonstrate that the algorithm can control fairness-related gaps in addition to coverage aligned with theoretical expectations.

[LG-74] ools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools

链接: https://arxiv.org/abs/2505.16113
作者: Panagiotis Lymperopoulos,Vasanth Sarathy
类目: Machine Learning (cs.LG)
*备注: 10 pages 3 figures 3 tables

点击查看摘要

Abstract:Modern Large Language Models (LLMs) often require external tools, such as machine learning classifiers or knowledge retrieval systems, to provide accurate answers in domains where their pre-trained knowledge is insufficient. This integration of LLMs with external tools expands their utility but also introduces a critical challenge: determining the trustworthiness of responses generated by the combined system. In high-stakes applications, such as medical decision-making, it is essential to assess the uncertainty of both the LLM’s generated text and the tool’s output to ensure the reliability of the final response. However, existing uncertainty quantification methods do not account for the tool-calling scenario, where both the LLM and external tool contribute to the overall system’s uncertainty. In this work, we present a novel framework for modeling tool-calling LLMs that quantifies uncertainty by jointly considering the predictive uncertainty of the LLM and the external tool. We extend previous methods for uncertainty quantification over token sequences to this setting and propose efficient approximations that make uncertainty computation practical for real-world applications. We evaluate our framework on two new synthetic QA datasets, derived from well-known machine learning datasets, which require tool-calling for accurate answers. Additionally, we apply our method to retrieval-augmented generation (RAG) systems and conduct a proof-of-concept experiment demonstrating the effectiveness of our uncertainty metrics in scenarios where external information retrieval is needed. Our results show that the framework is effective in enhancing trust in LLM-based systems, especially in cases where the LLM’s internal knowledge is insufficient and external tools are required.

[LG-75] Reinforcement Learning for Stock Transactions DATE

链接: https://arxiv.org/abs/2505.16099
作者: Ziyi(Queena)Zhou,Nicholas Stern,Julien Laasri
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, paper dated December 19, 2018

点击查看摘要

Abstract:Much research has been done to analyze the stock market. After all, if one can determine a pattern in the chaotic frenzy of transactions, then they could make a hefty profit from capitalizing on these insights. As such, the goal of our project was to apply reinforcement learning (RL) to determine the best time to buy a stock within a given time frame. With only a few adjustments, our model can be extended to identify the best time to sell a stock as well. In order to use the format of free, real-world data to train the model, we define our own Markov Decision Process (MDP) problem. These two papers [5] [6] helped us in formulating the state space and the reward system of our MDP problem. We train a series of agents using Q-Learning, Q-Learning with linear function approximation, and deep Q-Learning. In addition, we try to predict the stock prices using machine learning regression and classification models. We then compare our agents to see if they converge on a policy, and if so, which one learned the best policy to maximize profit on the stock market.

[LG-76] FR-Mamba: Time-Series Physical Field Reconstruction Based on State Space Model

链接: https://arxiv.org/abs/2505.16083
作者: Jiahuan Long,Wenzhe Zhang,Ning Wang,Tingsong Jiang,Wen Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physical field reconstruction (PFR) aims to predict the state distribution of physical quantities (e.g., velocity, pressure, and temperature) based on limited sensor measurements. It plays a critical role in domains such as fluid dynamics and thermodynamics. However, existing deep learning methods often fail to capture long-range temporal dependencies, resulting in suboptimal performance on time-evolving physical systems. To address this, we propose FR-Mamba, a novel spatiotemporal flow field reconstruction framework based on state space modeling. Specifically, we design a hybrid neural network architecture that combines Fourier Neural Operator (FNO) and State Space Model (SSM) to capture both global spatial features and long-range temporal dependencies. We adopt Mamba, a recently proposed efficient SSM architecture, to model long-range temporal dependencies with linear time complexity. In parallel, the FNO is employed to capture non-local spatial features by leveraging frequency-domain transformations. The spatiotemporal representations extracted by these two components are then fused to reconstruct the full-field distribution of the physical system. Extensive experiments demonstrate that our approach significantly outperforms existing PFR methods in flow field reconstruction tasks, achieving high-accuracy performance on long sequences.

[LG-77] Ensembling Sparse Autoencoders

链接: https://arxiv.org/abs/2505.16077
作者: Soham Gadgil,Chris Lin,Su-In Lee
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs trained with different initial weights can learn different features, demonstrating that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we propose to ensemble multiple SAEs through naive bagging and boosting. Specifically, SAEs trained with different weight initializations are ensembled in naive bagging, whereas SAEs sequentially trained to minimize the residual error are ensembled in boosting. We evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability. Furthermore, ensembling SAEs performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.

[LG-78] Few-Shot Test-Time Optimization Without Retraining for Semiconductor Recipe Generation and Beyond

链接: https://arxiv.org/abs/2505.16060
作者: Shangding Gu,Donghao Ying,Ming Jin,Yu Joe Lu,Jun Wang,Javad Lavaei,Costas Spanos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Model Feedback Learning (MFL), a novel test-time optimization framework for optimizing inputs to pre-trained AI models or deployed hardware systems without requiring any retraining of the models or modifications to the hardware. In contrast to existing methods that rely on adjusting model parameters, MFL leverages a lightweight reverse model to iteratively search for optimal inputs, enabling efficient adaptation to new objectives under deployment constraints. This framework is particularly advantageous in real-world settings, such as semiconductor manufacturing recipe generation, where modifying deployed systems is often infeasible or cost-prohibitive. We validate MFL on semiconductor plasma etching tasks, where it achieves target recipe generation in just five iterations, significantly outperforming both Bayesian optimization and human experts. Beyond semiconductor applications, MFL also demonstrates strong performance in chemical processes (e.g., chemical vapor deposition) and electronic systems (e.g., wire bonding), highlighting its broad applicability. Additionally, MFL incorporates stability-aware optimization, enhancing robustness to process variations and surpassing conventional supervised learning and random search methods in high-dimensional control settings. By enabling few-shot adaptation, MFL provides a scalable and efficient paradigm for deploying intelligent control in real-world environments.

[LG-79] Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs

链接: https://arxiv.org/abs/2505.16053
作者: Jan Tönshoff,Martin Grohe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver’s computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.

[LG-80] owards Identifiability of Interventional Stochastic Differential Equations

链接: https://arxiv.org/abs/2505.15987
作者: Aaron Zweig,Zaikang Lin,Elham Azizi,David Knowles
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study identifiability of stochastic differential equation (SDE) models under multiple interventions. Our results give the first provable bounds for unique recovery of SDE parameters given samples from their stationary distributions. We give tight bounds on the number of necessary interventions for linear SDEs, and upper bounds for nonlinear SDEs in the small noise regime. We experimentally validate the recovery of true parameters in synthetic data, and motivated by our theoretical results, demonstrate the advantage of parameterizations with learnable activation functions.

[LG-81] Real-Time Stress Monitoring Detection and Management in College Students: A Wearable Technology and Machine-Learning Approach

链接: https://arxiv.org/abs/2505.15974
作者: Alan Ta,Nilsu Salgin,Mustafa Demir,Kala Philips Randal,Ranjana K. Mehta,Anthony McDonald,Carly McCord,Farzan Sasangohar
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 31 pages, 5 figures

点击查看摘要

Abstract:College students are increasingly affected by stress, anxiety, and depression, yet face barriers to traditional mental health care. This study evaluated the efficacy of a mobile health (mHealth) intervention, Mental Health Evaluation and Lookout Program (mHELP), which integrates a smartwatch sensor and machine learning (ML) algorithms for real-time stress detection and self-management. In a 12-week randomized controlled trial (n = 117), participants were assigned to a treatment group using mHELP’s full suite of interventions or a control group using the app solely for real-time stress logging and weekly psychological assessments. The primary outcome, “Moments of Stress” (MS), was assessed via physiological and self-reported indicators and analyzed using Generalized Linear Mixed Models (GLMM) approaches. Similarly, secondary outcomes of psychological assessments, including the Generalized Anxiety Disorder-7 (GAD-7) for anxiety, the Patient Health Questionnaire (PHQ-8) for depression, and the Perceived Stress Scale (PSS), were also analyzed via GLMM. The finding of the objective measure, MS, indicates a substantial decrease in MS among the treatment group compared to the control group, while no notable between-group differences were observed in subjective scores of anxiety (GAD-7), depression (PHQ-8), or stress (PSS). However, the treatment group exhibited a clinically meaningful decline in GAD-7 and PSS scores. These findings underscore the potential of wearable-enabled mHealth tools to reduce acute stress in college populations and highlight the need for extended interventions and tailored features to address chronic symptoms like depression.

[LG-82] Data-driven Verification of Procedural Programs with Integer Arrays

链接: https://arxiv.org/abs/2505.15958
作者: Ahmed Bouajjani,Wael-Amine Boutglay,Peter Habermehl
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of verifying automatically procedural programs manipulating parametric-size arrays of integers, encoded as a constrained Horn clauses solving problem. We propose a new algorithmic method for synthesizing loop invariants and procedure pre/post-conditions represented as universally quantified first-order formulas constraining the array elements and program variables. We adopt a data-driven approach that extends the decision tree Horn-ICE framework to handle arrays. We provide a powerful learning technique based on reducing a complex classification problem of vectors of integer arrays to a simpler classification problem of vectors of integers. The obtained classifier is generalized to get universally quantified invariants and procedure pre/post-conditions. We have implemented our method and shown its efficiency and competitiveness w.r.t. state-of-the-art tools on a significant benchmark.

[LG-83] AllM etrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning

链接: https://arxiv.org/abs/2505.15931
作者: Morteza Alizadeh,Mehrdad Oveisi,Sonya Falahati,Ghazal Mousavi,Mohsen Alambardar Meybodi,Somayeh Sadat Mehrnia,Ilker Hacihaliloglu,Arman Rahmim,Mohammad R. Salmanpour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models rely heavily on consistent and accurate performance metrics to evaluate and compare their effectiveness. However, existing libraries often suffer from fragmentation, inconsistent implementations, and insufficient data validation protocols, leading to unreliable results. Existing libraries have often been developed independently and without adherence to a unified standard, particularly concerning the specific tasks they aim to support. As a result, each library tends to adopt its conventions for metric computation, input/output formatting, error handling, and data validation protocols. This lack of standardization leads to both implementation differences (ID) and reporting differences (RD), making it difficult to compare results across frameworks or ensure reliable evaluations. To address these issues, we introduce AllMetrics, an open-source unified Python library designed to standardize metric evaluation across diverse ML tasks, including regression, classification, clustering, segmentation, and image-to-image translation. The library implements class-specific reporting for multi-class tasks through configurable parameters to cover all use cases, while incorporating task-specific parameters to resolve metric computation discrepancies across implementations. Various datasets from domains like healthcare, finance, and real estate were applied to our library and compared with Python, Matlab, and R components to identify which yield similar results. AllMetrics combines a modular Application Programming Interface (API) with robust input validation mechanisms to ensure reproducibility and reliability in model evaluation. This paper presents the design principles, architectural components, and empirical analyses demonstrating the ability to mitigate evaluation errors and to enhance the trustworthiness of ML workflows.

[LG-84] Is (Selective) Round-To-Nearest Quantization All You Need?

链接: https://arxiv.org/abs/2505.15909
作者: Alex Kogan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.

[LG-85] Graph Neural Networks Based Anomalous RSSI Detection

链接: https://arxiv.org/abs/2505.15847
作者: Blaž Bertalanič,Matej Vnučec,Carolina Fortuna
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:In today’s world, modern infrastructures are being equipped with information and communication technologies to create large IoT networks. It is essential to monitor these networks to ensure smooth operations by detecting and correcting link failures or abnormal network behaviour proactively, which can otherwise cause interruptions in business operations. This paper presents a novel method for detecting anomalies in wireless links using graph neural networks. The proposed approach involves converting time series data into graphs and training a new graph neural network architecture based on graph attention networks that successfully detects anomalies at the level of individual measurements of the time series data. The model provides competitive results compared to the state of the art while being computationally more efficient with ~171 times fewer trainable parameters. Comments: 5 pages, 3 figures Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2505.15847 [cs.NI] (or arXiv:2505.15847v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2505.15847 Focus to learn more arXiv-issued DOI via DataCite Journalreference: 2023 International Balkan Conference on Communications and Networking (BalkanCom) Related DOI: https://doi.org/10.1109/BalkanCom58402.2023.10167910 Focus to learn more DOI(s) linking to related resources

[LG-86] Adaptive Tokenization: On the Hop-Overpriority Problem in Tokenized Graph Learning Models

链接: https://arxiv.org/abs/2505.15845
作者: Zhibiao Wang,Yunlong Zhou,Ziwei Zhang,Mengmei Zhang,Shirui Pan,Chunming Hu,Xiao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Transformers, leveraging the global attention to capture long-range dependencies in graph structures, have significantly advanced graph machine learning, but face prohibitive computational complexity. Tokenized Graph Learning Models (TGLMs) address this issue by converting graphs into ordered token lists for scalable processing. Besides, TGLMs also empower Large Language Models (LLMs) to handle text-attributed graphs more effectively and thus are also employed in Graph LLMs. However, existing TGLMs rely on hand-designed token lists and their adaptability to diverse graph learning scenarios remains unexplored. In this paper, we first conduct extensive empirical and theoretical preliminary studies for hand-designed token lists. Surprisingly, we identify an unexplored hop-overpriority problem: the common pre-defined token lists overemphasize nearby nodes and overwhelm the ability of TGLMs to balance local and global signals. This phenomenon is especially harmful for heterophilic graphs. To address this problem, we propose the Learnable Graph Token List (LGTL), a plug-and-play module to replace hand-designed token lists in TGLMs. Specifically, LGTL adaptively adjusts the weights across hops and prioritizes informative nodes within hops through a graph attention gate module and a selection module, respectively. In this way, contextually informative nodes can be adaptively emphasized for both homophilic and heterophilic graphs. Besides, we theoretically show that LGTL can address the hop-overpriority problem. Extensive experiments on benchmarks validate the efficacy of LGTL across both Graph Transformers and Graph LLM backbones.

[LG-87] AH-UGC: Adaptive and Heterogeneous-Universal Graph Coarsening

链接: https://arxiv.org/abs/2505.15842
作者: Mohit Kataria,Shreyash Bhilwade,Sandeep Kumar,Jayadeva
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract: \textbfGraph Coarsening (GC) is a prominent graph reduction technique that compresses large graphs to enable efficient learning and inference. However, existing GC methods generate only one coarsened graph per run and must recompute from scratch for each new coarsening ratio, resulting in unnecessary overhead. Moreover, most prior approaches are tailored to \textithomogeneous graphs and fail to accommodate the semantic constraints of \textitheterogeneous graphs, which comprise multiple node and edge types. To overcome these limitations, we introduce a novel framework that combines Locality Sensitive Hashing (LSH) with Consistent Hashing to enable \textitadaptive graph coarsening . Leveraging hashing techniques, our method is inherently fast and scalable. For heterogeneous graphs, we propose a \textittype isolated coarsening strategy that ensures semantic consistency by restricting merges to nodes of the same type. Our approach is the first unified framework to support both adaptive and heterogeneous coarsening. Extensive evaluations on 23 real-world datasets including homophilic, heterophilic, homogeneous, and heterogeneous graphs demonstrate that our method achieves superior scalability while preserving the structural and semantic integrity of the original graph.

[LG-88] Curriculum Learning in Genetic Programming Guided Local Search for Large-scale Vehicle Routing Problems

链接: https://arxiv.org/abs/2505.15839
作者: Saining Liu,Yi Mei,Mengjie Zhang
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Manually designing (meta-)heuristics for the Vehicle Routing Problem (VRP) is a challenging task that requires significant domain expertise. Recently, data-driven approaches have emerged as a promising solution, automatically learning heuristics that perform well on training instances and generalize to unseen test cases. Such an approach learns (meta-)heuristics that can perform well on the training instances, expecting it to generalize well on the unseen test instances. A recent method, named GPGLS, uses Genetic Programming (GP) to learn the utility function in Guided Local Search (GLS) and solved large scale VRP effectively. However, the selection of appropriate training instances during the learning process remains an open question, with most existing studies including GPGLS relying on random instance selection. To address this, we propose a novel method, CL-GPGLS, which integrates Curriculum Learning (CL) into GPGLS. Our approach leverages a predefined curriculum to introduce training instances progressively, starting with simpler tasks and gradually increasing complexity, enabling the model to better adapt and optimize for large-scale VRP (LSVRP). Extensive experiments verify the effectiveness of CL-GPGLS, demonstrating significant performance improvements over three baseline methods.

[LG-89] Adversarially Robust Spiking Neural Networks with Sparse Connectivity

链接: https://arxiv.org/abs/2505.15833
作者: Mathias Schmolli,Maximilian Baronig,Robert Legenstein,Ozan Özdenizci
类目: Neural and Evolutionary Computing (cs.NE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deployment of deep neural networks in resource-constrained embedded systems requires innovative algorithmic solutions to facilitate their energy and memory efficiency. To further ensure the reliability of these systems against malicious actors, recent works have extensively studied adversarial robustness of existing architectures. Our work focuses on the intersection of adversarial robustness, memory- and energy-efficiency in neural networks. We introduce a neural network conversion algorithm designed to produce sparse and adversarially robust spiking neural networks (SNNs) by leveraging the sparse connectivity and weights from a robustly pretrained artificial neural network (ANN). Our approach combines the energy-efficient architecture of SNNs with a novel conversion algorithm, leading to state-of-the-art performance with enhanced energy and memory efficiency through sparse connectivity and activations. Our models are shown to achieve up to 100x reduction in the number of weights to be stored in memory, with an estimated 8.6x increase in energy efficiency compared to dense SNNs, while maintaining high performance and robustness against adversarial threats.

[LG-90] Sufficient conditions for offline reactivation in recurrent neural networks ICLR2024

链接: https://arxiv.org/abs/2505.17003
作者: Nanda H. Krishna,Colin Bredenberg,Daniel Levenstein,Blake A. Richards,Guillaume Lajoie
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: ICLR 2024

点击查看摘要

Abstract:During periods of quiescence, such as sleep, neural activity in many brain circuits resembles that observed during periods of task engagement. However, the precise conditions under which task-optimized networks can autonomously reactivate the same network states responsible for online behavior is poorly understood. In this study, we develop a mathematical framework that outlines sufficient conditions for the emergence of neural reactivation in circuits that encode features of smoothly varying stimuli. We demonstrate mathematically that noisy recurrent networks optimized to track environmental state variables using change-based sensory information naturally develop denoising dynamics, which, in the absence of input, cause the network to revisit state configurations observed during periods of online activity. We validate our findings using numerical experiments on two canonical neuroscience tasks: spatial position estimation based on self-motion cues, and head direction estimation based on angular velocity cues. Overall, our work provides theoretical support for modeling offline reactivation as an emergent consequence of task optimization in noisy neural circuits.

[LG-91] Critical Points of Random Neural Networks

链接: https://arxiv.org/abs/2505.17000
作者: Simmaco Di Lillo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:This work investigates the expected number of critical points of random neural networks with different activation functions as the depth increases in the infinite-width limit. Under suitable regularity conditions, we derive precise asymptotic formulas for the expected number of critical points of fixed index and those exceeding a given threshold. Our analysis reveals three distinct regimes depending on the value of the first derivative of the covariance evaluated at 1: the expected number of critical points may converge, grow polynomially, or grow exponentially with depth. The theoretical predictions are supported by numerical experiments. Moreover, we provide numerical evidence suggesting that, when the regularity condition is not satisfied (e.g. for neural networks with ReLU as activation function), the number of critical points increases as the map resolution increases, indicating a potential divergence in the number of critical points.

[LG-92] ULiP: Test-time Uncertainty Estimation via Linearization and Weight Perturbation

链接: https://arxiv.org/abs/2505.16923
作者: Yuhui Zhang,Dongshen Wu,Yuichiro Wada,Takafumi Kanamori
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A reliable uncertainty estimation method is the foundation of many modern out-of-distribution (OOD) detectors, which are critical for safe deployments of deep learning models in the open world. In this work, we propose TULiP, a theoretically-driven post-hoc uncertainty estimator for OOD detection. Our approach considers a hypothetical perturbation applied to the network before convergence. Based on linearized training dynamics, we bound the effect of such perturbation, resulting in an uncertainty score computable by perturbing model parameters. Ultimately, our approach computes uncertainty from a set of sampled predictions. We visualize our bound on synthetic regression and classification datasets. Furthermore, we demonstrate the effectiveness of TULiP using large-scale OOD detection benchmarks for image classification. Our method exhibits state-of-the-art performance, particularly for near-distribution samples.

[LG-93] Statistical Test for Saliency Maps of Graph Neural Networks via Selective Inference

链接: https://arxiv.org/abs/2505.16893
作者: Shuichi Nishino,Tomohiro Shiraishi,Teruyuki Katsuoka,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have gained prominence for their ability to process graph-structured data across various domains. However, interpreting GNN decisions remains a significant challenge, leading to the adoption of saliency maps for identifying influential nodes and edges. Despite their utility, the reliability of GNN saliency maps has been questioned, particularly in terms of their robustness to noise. In this study, we propose a statistical testing framework to rigorously evaluate the significance of saliency maps. Our main contribution lies in addressing the inflation of the Type I error rate caused by double-dipping of data, leveraging the framework of Selective Inference. Our method provides statistically valid p -values while controlling the Type I error rate, ensuring that identified salient subgraphs contain meaningful information rather than random artifacts. To demonstrate the effectiveness of our method, we conduct experiments on both synthetic and real-world datasets, showing its effectiveness in assessing the reliability of GNN interpretations.

[LG-94] How high is `high? Rethinking the roles of dimensionality in topological data analysis and manifold learning

链接: https://arxiv.org/abs/2505.16879
作者: Hannah Sansford,Nick Whiteley,Patrick Rubin-Delanchy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a generalised Hanson-Wright inequality and use it to establish new statistical insights into the geometry of data point-clouds. In the setting of a general random function model of data, we clarify the roles played by three notions of dimensionality: ambient intrinsic dimension p_\mathrmint , which measures total variability across orthogonal feature directions; correlation rank, which measures functional complexity across samples; and latent intrinsic dimension, which is the dimension of manifold structure hidden in data. Our analysis shows that in order for persistence diagrams to reveal latent homology and for manifold structure to emerge it is sufficient that p_\mathrmint\gg \log n , where n is the sample size. Informed by these theoretical perspectives, we revisit the ground-breaking neuroscience discovery of toroidal structure in grid-cell activity made by Gardner et al. (Nature, 2022): our findings reveal, for the first time, evidence that this structure is in fact isometric to physical space, meaning that grid cell activity conveys a geometrically faithful representation of the real world.

[LG-95] Experimental robustness benchmark of quantum neural network on a superconducting quantum processor

链接: https://arxiv.org/abs/2505.16714
作者: Hai-Feng Zhang,Zhao-Yun Chen,Peng Wang,Liang-Liang Guo,Tian-Le Wang,Xiao-Yan Yang,Ren-Ze Zhao,Ze-An Zhao,Sheng Zhang,Lei Du,Hao-Ran Tao,Zhi-Long Jia,Wei-Cheng Kong,Huan-Yu Liu,Athanasios V. Vasilakos,Yang Yang,Yu-Chun Wu,Ji Guan,Peng Duan,Guo-Ping Guo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: There are 8 pages with 5 figures in the main text and 15 pages with 14 figures in the supplementary information

点击查看摘要

Abstract:Quantum machine learning (QML) models, like their classical counterparts, are vulnerable to adversarial attacks, hindering their secure deployment. Here, we report the first systematic experimental robustness benchmark for 20-qubit quantum neural network (QNN) classifiers executed on a superconducting processor. Our benchmarking framework features an efficient adversarial attack algorithm designed for QNNs, enabling quantitative characterization of adversarial robustness and robustness bounds. From our analysis, we verify that adversarial training reduces sensitivity to targeted perturbations by regularizing input gradients, significantly enhancing QNN’s robustness. Additionally, our analysis reveals that QNNs exhibit superior adversarial robustness compared to classical neural networks, an advantage attributed to inherent quantum noise. Furthermore, the empirical upper bound extracted from our attack experiments shows a minimal deviation ( 3 \times 10^-3 ) from the theoretical lower bound, providing strong experimental confirmation of the attack’s effectiveness and the tightness of fidelity-based robustness bounds. This work establishes a critical experimental framework for assessing and improving quantum adversarial robustness, paving the way for secure and reliable QML applications.

[LG-96] Sharp concentration of uniform generalization errors in binary linear classification

链接: https://arxiv.org/abs/2505.16713
作者: Shogo Nakakita
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 26 pages, 1 figure

点击查看摘要

Abstract:We examine the concentration of uniform generalization errors around their expectation in binary linear classification problems via an isoperimetric argument. In particular, we establish Poincaré and log-Sobolev inequalities for the joint distribution of the output labels and the label-weighted input vectors, which we apply to derive concentration bounds. The derived concentration bounds are sharp up to moderate multiplicative constants by those under well-balanced labels. In asymptotic analysis, we also show that almost sure convergence of uniform generalization errors to their expectation occurs in very broad settings, such as proportionally high-dimensional regimes. Using this convergence, we establish uniform laws of large numbers under dimension-free conditions.

[LG-97] Learning non-equilibrium diffusions with Schrödinger bridges: from exactly solvable to simulation-free

链接: https://arxiv.org/abs/2505.16644
作者: Stephen Y. Zhang,Michael P H Stumpf
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:We consider the Schrödinger bridge problem which, given ensemble measurements of the initial and final configurations of a stochastic dynamical system and some prior knowledge on the dynamics, aims to reconstruct the “most likely” evolution of the system compatible with the data. Most existing literature assume Brownian reference dynamics and are implicitly limited to potential-driven dynamics. We depart from this regime and consider reference processes described by a multivariate Ornstein-Uhlenbeck process with generic drift matrix \mathbfA \in \mathbbR^d \times d . When \mathbfA is asymmetric, this corresponds to a non-equilibrium system with non-conservative forces at play: this is important for applications to biological systems, which are naturally exist out-of-equilibrium. In the case of Gaussian marginals, we derive explicit expressions that characterise the solution of both the static and dynamic Schrödinger bridge. For general marginals, we propose mvOU-OTFM, a simulation-free algorithm based on flow and score matching for learning the Schrödinger bridge. In application to a range of problems based on synthetic and real single cell data, we demonstrate that mvOU-OTFM achieves higher accuracy compared to competing methods, whilst being significantly faster to train.

[LG-98] Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping

链接: https://arxiv.org/abs/2505.16329
作者: Simone Bombari,Inbar Seroussi,Marco Mondelli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several works aimed at obtaining improved error rates. A common approach is to set the clipping constant much larger than the expected norm of the per-sample gradients. While simplifying the analysis, this is however in sharp contrast with what empirical evidence suggests to optimize performance. Our work bridges this gap between theory and practice: we provide sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently. Specifically, we consider the setting where the data is multivariate Gaussian, the number of training samples n is proportional to the input dimension d , and the algorithm guarantees constant-order zero concentrated DP. Our method relies on establishing a deterministic equivalent for the trajectory of DP-SGD in terms of a family of ordinary differential equations (ODEs). As a consequence, the risk of DP-SGD is bounded between two ODEs, with upper and lower bounds matching for isotropic data. By studying these ODEs when n / d is large enough, we demonstrate the optimality of aggressive clipping, and we uncover the benefits of decaying learning rate and private noise scheduling.

[LG-99] Learning novel representations of variable sources from multi-modal textitGaia data via autoencoders

链接: https://arxiv.org/abs/2505.16320
作者: P. Huijse,J. De Ridder,L. Eyer,L. Rimoldini,B. Holl,N. Chornay,J. Roquette,K. Nienartowicz,G. Jevardat de Fombelle,D. J. Fritzewski,A. Kemp,V. Vanlaer,M. Vanrespaille,H. Wang,M.I. Carnerero,C.M. Raiteri,G. Marton,M. Madarász,G. Clementini,P. Gavras,C. Aerts
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Manuscript resubmitted to Astronomy Astrophysics after positive referee report, 20 pages, 20 figures, 2 tables

点击查看摘要

Abstract:Gaia Data Release 3 (DR3) published for the first time epoch photometry, BP/RP (XP) low-resolution mean spectra, and supervised classification results for millions of variable sources. This extensive dataset offers a unique opportunity to study their variability by combining multiple Gaia data products. In preparation for DR4, we propose and evaluate a machine learning methodology capable of ingesting multiple Gaia data products to achieve an unsupervised classification of stellar and quasar variability. A dataset of 4 million Gaia DR3 sources is used to train three variational autoencoders (VAE), which are artificial neural networks (ANNs) designed for data compression and generation. One VAE is trained on Gaia XP low-resolution spectra, another on a novel approach based on the distribution of magnitude differences in the Gaia G band, and the third on folded Gaia G band light curves. Each Gaia source is compressed into 15 numbers, representing the coordinates in a 15-dimensional latent space generated by combining the outputs of these three models. The learned latent representation produced by the ANN effectively distinguishes between the main variability classes present in Gaia DR3, as demonstrated through both supervised and unsupervised classification analysis of the latent space. The results highlight a strong synergy between light curves and low-resolution spectral data, emphasising the benefits of combining the different Gaia data products. A two-dimensional projection of the latent variables reveals numerous overdensities, most of which strongly correlate with astrophysical properties, showing the potential of this latent space for astrophysical discovery. We show that the properties of our novel latent representation make it highly valuable for variability analysis tasks, including classification, clustering and outlier detection.

[LG-100] Generator-Mediated Bandits: Thompson Sampling for GenAI-Powered Adaptive Interventions

链接: https://arxiv.org/abs/2505.16311
作者: Marc Brooks,Gabriel Durham,Kihyuk Hong,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 39 pages, 12 figures

点击查看摘要

Abstract:Recent advances in generative artificial intelligence (GenAI) models have enabled the generation of personalized content that adapts to up-to-date user context. While personalized decision systems are often modeled using bandit formulations, the integration of GenAI introduces new structure into otherwise classical sequential learning problems. In GenAI-powered interventions, the agent selects a query, but the environment experiences a stochastic response drawn from the generative model. Standard bandit methods do not explicitly account for this structure, where actions influence rewards only through stochastic, observed treatments. We introduce generator-mediated bandit-Thompson sampling (GAMBITTS), a bandit approach designed for this action/treatment split, using mobile health interventions with large language model-generated text as a motivating case study. GAMBITTS explicitly models both the treatment and reward generation processes, using information in the delivered treatment to accelerate policy learning relative to standard methods. We establish regret bounds for GAMBITTS by decomposing sources of uncertainty in treatment and reward, identifying conditions where it achieves stronger guarantees than standard bandit approaches. In simulation studies, GAMBITTS consistently outperforms conventional algorithms by leveraging observed treatments to more accurately estimate expected rewards.

[LG-101] Higher-Order Asymptotics of Test-Time Adaptation for Batch Normalization Statistics

链接: https://arxiv.org/abs/2505.16257
作者: Masanari Kimura
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study develops a higher-order asymptotic framework for test-time adaptation (TTA) of Batch Normalization (BN) statistics under distribution shift by integrating classical Edgeworth expansion and saddlepoint approximation techniques with a novel one-step M-estimation perspective. By analyzing the statistical discrepancy between training and test distributions, we derive an Edgeworth expansion for the normalized difference in BN means and obtain an optimal weighting parameter that minimizes the mean-squared error of the adapted statistic. Reinterpreting BN TTA as a one-step M-estimator allows us to derive higher-order local asymptotic normality results, which incorporate skewness and other higher moments into the estimator’s behavior. Moreover, we quantify the trade-offs among bias, variance, and skewness in the adaptation process and establish a corresponding generalization bound on the model risk. The refined saddlepoint approximations further deliver uniformly accurate density and tail probability estimates for the BN TTA statistic. These theoretical insights provide a comprehensive understanding of how higher-order corrections and robust one-step updating can enhance the reliability and performance of BN layers in adapting to changing data distributions.

[LG-102] Graph-Smoothed Bayesian Black-Box Shift Estimator and Its Information Geometry

链接: https://arxiv.org/abs/2505.16251
作者: Masanari Kimura
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label shift adaptation aims to recover target class priors when the labelled source distribution P and the unlabelled target distribution Q share P(X \mid Y) = Q(X \mid Y) but P(Y) \neq Q(Y) . Classical black-box shift estimators invert an empirical confusion matrix of a frozen classifier, producing a brittle point estimate that ignores sampling noise and similarity among classes. We present Graph-Smoothed Bayesian BBSE (GS-B ^3 SE), a fully probabilistic alternative that places Laplacian-Gaussian priors on both target log-priors and confusion-matrix columns, tying them together on a label-similarity graph. The resulting posterior is tractable with HMC or a fast block Newton-CG scheme. We prove identifiability, N^-1/2 contraction, variance bounds that shrink with the graph’s algebraic connectivity, and robustness to Laplacian misspecification. We also reinterpret GS-B ^3 SE through information geometry, showing that it generalizes existing shift estimators.

[LG-103] Generalized Power Priors for Improved Bayesian Inference with Historical Data

链接: https://arxiv.org/abs/2505.16244
作者: Masanari Kimura,Howard Bondell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The power prior is a class of informative priors designed to incorporate historical data alongside current data in a Bayesian framework. It includes a power parameter that controls the influence of historical data, providing flexibility and adaptability. A key property of the power prior is that the resulting posterior minimizes a linear combination of KL divergences between two pseudo-posterior distributions: one ignoring historical data and the other fully incorporating it. We extend this framework by identifying the posterior distribution as the minimizer of a linear combination of Amari’s \alpha -divergence, a generalization of KL divergence. We show that this generalization can lead to improved performance by allowing for the data to adapt to appropriate choices of the \alpha parameter. Theoretical properties of this generalized power posterior are established, including behavior as a generalized geodesic on the Riemannian manifold of probability distributions, offering novel insights into its geometric interpretation.

[LG-104] Integral Imprecise Probability Metrics

链接: https://arxiv.org/abs/2505.16156
作者: Siu Lun Chau,Michele Caprio,Krikamol Muandet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 50 pages, 2 figures

点击查看摘要

Abstract:Quantifying differences between probability distributions is fundamental to statistics and machine learning, primarily for comparing statistical uncertainty. In contrast, epistemic uncertainty (EU) – due to incomplete knowledge – requires richer representations than those offered by classical probability. Imprecise probability (IP) theory offers such models, capturing ambiguity and partial belief. This has driven growing interest in imprecise probabilistic machine learning (IPML), where inference and decision-making rely on broader uncertainty models – highlighting the need for metrics beyond classical probability. This work introduces the Integral Imprecise Probability Metric (IIPM) framework, a Choquet integral-based generalisation of classical Integral Probability Metric (IPM) to the setting of capacities – a broad class of IP models encompassing many existing ones, including lower probabilities, probability intervals, belief functions, and more. Theoretically, we establish conditions under which IIPM serves as a valid metric and metrises a form of weak convergence of capacities. Practically, IIPM not only enables comparison across different IP models but also supports the quantification of epistemic uncertainty within a single IP model. In particular, by comparing an IP model with its conjugate, IIPM gives rise to a new class of EU measures – Maximum Mean Imprecision – which satisfy key axiomatic properties proposed in the Uncertainty Quantification literature. We validate MMI through selective classification experiments, demonstrating strong empirical performance against established EU measures, and outperforming them when classical methods struggle to scale to a large number of classes. Our work advances both theory and practice in IPML, offering a principled framework for comparing and quantifying epistemic uncertainty under imprecision.

[LG-105] Exponential Convergence of CAVI for Bayesian PCA

链接: https://arxiv.org/abs/2505.16145
作者: Arghya Datta,Philippe Gagnon,Florian Maire
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:Probabilistic principal component analysis (PCA) and its Bayesian variant (BPCA) are widely used for dimension reduction in machine learning and statistics. The main advantage of probabilistic PCA over the traditional formulation is allowing uncertainty quantification. The parameters of BPCA are typically learned using mean-field variational inference, and in particular, the coordinate ascent variational inference (CAVI) algorithm. So far, the convergence speed of CAVI for BPCA has not been characterized. In our paper, we fill this gap in the literature. Firstly, we prove a precise exponential convergence result in the case where the model uses a single principal component (PC). Interestingly, this result is established through a connection with the classical \textitpower iteration algorithm and it indicates that traditional PCA is retrieved as points estimates of the BPCA parameters. Secondly, we leverage recent tools to prove exponential convergence of CAVI for the model with any number of PCs, thus leading to a more general result, but one that is of a slightly different flavor. To prove the latter result, we additionally needed to introduce a novel lower bound for the symmetric Kullback–Leibler divergence between two multivariate normal distributions, which, we believe, is of independent interest in information theory.

[LG-106] Machine Learning the 6d Supergravity Landscape

链接: https://arxiv.org/abs/2505.16131
作者: Nathan Brady,David Tennyson,Thomas Vandermeulen
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG)
*备注: 49 pages; code and data available at this https URL

点击查看摘要

Abstract:In this paper, we apply both supervised and unsupervised machine learning algorithms to the study of the string landscape and swampland in 6-dimensions. Our data are the (almost) anomaly-free 6-dimensional \mathcalN = (1,0) supergravity models, characterised by the Gram matrix of anomaly coefficients. Our work demonstrates the ability of machine learning algorithms to efficiently learn highly complex features of the landscape and swampland. Employing an autoencoder for unsupervised learning, we provide an auto-classification of these models by compressing the Gram matrix data to 2-dimensions. Through compression, similar models cluster together, and we identify prominent features of these clusters. The autoencoder also identifies outlier models which are difficult to reconstruct. One of these outliers proves to be incredibly difficult to combine with other models such that the \texttrR^4 anomaly vanishes, making its presence in the landscape extremely rare. Further, we utilise supervised learning to build two classifiers predicting (1) model consistency under probe string insertion (precision: 0.78, predicting consistency for 214,837 models with reasonable certainty) and (2) inconsistency under anomaly inflow (precision: 0.91, predicting inconsistency for 1,909,359 models). Notably, projecting these predictions onto the autoencoder’s 2-dimensional latent layer shows consistent models clustering together, further indicating that the autoencoder has learnt interesting and complex features of the set of models and potentially offers a novel approach to mapping the landscape and swampland of 6-dimensional supergravity theories.

[LG-107] Dimension-adapted Momentum Outscales SGD

链接: https://arxiv.org/abs/2505.16098
作者: Damien Ferbach,Katie Everett,Gauthier Gidel,Elliot Paquette,Courtney Paquette
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA’s improved loss exponents over SGD hold in a practical setting.

[LG-108] Oh SnapMMD! Forecasting Stochastic Dynamics Beyond the Schrödinger Bridges End

链接: https://arxiv.org/abs/2505.16082
作者: Renato Berlinghieri,Yunyi Shen,Jialong Jiang,Tamara Broderick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 43 pages, 26 figures, 21 tables

点击查看摘要

Abstract:Scientists often want to make predictions beyond the observed time horizon of “snapshot” data following latent stochastic dynamics. For example, in time course single-cell mRNA profiling, scientists have access to cellular transcriptional state measurements (snapshots) from different biological replicates at different time points, but they cannot access the trajectory of any one cell because measurement destroys the cell. Researchers want to forecast (e.g.) differentiation outcomes from early state measurements of stem cells. Recent Schrödinger-bridge (SB) methods are natural for interpolating between snapshots. But past SB papers have not addressed forecasting – likely since existing methods either (1) reduce to following pre-set reference dynamics (chosen before seeing data) or (2) require the user to choose a fixed, state-independent volatility since they minimize a Kullback-Leibler divergence. Either case can lead to poor forecasting quality. In the present work, we propose a new framework, SnapMMD, that learns dynamics by directly fitting the joint distribution of both state measurements and observation time with a maximum mean discrepancy (MMD) loss. Unlike past work, our method allows us to infer unknown and state-dependent volatilities from the observed data. We show in a variety of real and synthetic experiments that our method delivers accurate forecasts. Moreover, our approach allows us to learn in the presence of incomplete state measurements and yields an R^2 -style statistic that diagnoses fit. We also find that our method’s performance at interpolation (and general velocity-field reconstruction) is at least as good as (and often better than) state-of-the-art in almost all of our experiments.

[LG-109] PO-Flow: Flow-based Generative Models for Sampling Potential Outcomes and Counterfactuals

链接: https://arxiv.org/abs/2505.16051
作者: Dongze Wu,David I. Inouye,Yao Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose PO-Flow, a novel continuous normalizing flow (CNF) framework for causal inference that jointly models potential outcomes and counterfactuals. Trained via flow matching, PO-Flow provides a unified framework for individualized potential outcome prediction, counterfactual predictions, and uncertainty-aware density learning. Among generative models, it is the first to enable density learning of potential outcomes without requiring explicit distributional assumptions (e.g., Gaussian mixtures), while also supporting counterfactual prediction conditioned on factual outcomes in general observational datasets. On benchmarks such as ACIC, IHDP, and IBM, it consistently outperforms prior methods across a range of causal inference tasks. Beyond that, PO-Flow succeeds in high-dimensional settings, including counterfactual image generation, demonstrating its broad applicability.

[LG-110] Multimodal Biomarkers for Schizophrenia: Towards Individual Symptom Severity Estimation INTERSPEECH2025

链接: https://arxiv.org/abs/2505.16044
作者: Gowtham Premananth,Philip Resnik,Sonia Bansal,Deanna L.Kelly,Carol Espy-Wilson
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: Accepted to be presented at Interspeech 2025

点击查看摘要

Abstract:Studies on schizophrenia assessments using deep learning typically treat it as a classification task to detect the presence or absence of the disorder, oversimplifying the condition and reducing its clinical applicability. This traditional approach overlooks the complexity of schizophrenia, limiting its practical value in healthcare settings. This study shifts the focus to individual symptom severity estimation using a multimodal approach that integrates speech, video, and text inputs. We develop unimodal models for each modality and a multimodal framework to improve accuracy and robustness. By capturing a more detailed symptom profile, this approach can help in enhancing diagnostic precision and support personalized treatment, offering a scalable and objective tool for mental health assessment.

[LG-111] Physics-based machine learning for mantle convection simulations

链接: https://arxiv.org/abs/2505.16041
作者: Siddhant Agarwal,Ali Can Bekar,Christian Hüttig,David S. Greenberg,Nicola Tosi
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mantle convection simulations are an essential tool for understanding how rocky planets evolve. However, the poorly known input parameters to these simulations, the non-linear dependence of transport properties on pressure and temperature, and the long integration times in excess of several billion years all pose a computational challenge for numerical solvers. We propose a physics-based machine learning approach that predicts creeping flow velocities as a function of temperature while conserving mass, thereby bypassing the numerical solution of the Stokes problem. A finite-volume solver then uses the predicted velocities to advect and diffuse the temperature field to the next time-step, enabling autoregressive rollout at inference. For training, our model requires temperature-velocity snapshots from a handful of simulations (94). We consider mantle convection in a two-dimensional rectangular box with basal and internal heating, pressure- and temperature-dependent viscosity. Overall, our model is up to 89 times faster than the numerical solver. We also show the importance of different components in our convolutional neural network architecture such as mass conservation, learned paddings on the boundaries, and loss scaling for the overall rollout performance. Finally, we test our approach on unseen scenarios to demonstrate some of its strengths and weaknesses.

[LG-112] Diffusion Probabilistic Generative Models for Accelerated in-NICU Permanent Magnet Neonatal MRI

链接: https://arxiv.org/abs/2505.15984
作者: Yamin Arefeen,Brett Levac,Bhairav Patel,Chang Ho,Jonathan I. Tamir
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Purpose: Magnetic Resonance Imaging (MRI) enables non-invasive assessment of brain abnormalities during early life development. Permanent magnet scanners operating in the neonatal intensive care unit (NICU) facilitate MRI of sick infants, but have long scan times due to lower signal-to-noise ratios (SNR) and limited receive coils. This work accelerates in-NICU MRI with diffusion probabilistic generative models by developing a training pipeline accounting for these challenges. Methods: We establish a novel training dataset of clinical, 1 Tesla neonatal MR images in collaboration with Aspect Imaging and Sha’are Zedek Medical Center. We propose a pipeline to handle the low quantity and SNR of our real-world dataset (1) modifying existing network architectures to support varying resolutions; (2) training a single model on all data with learned class embedding vectors; (3) applying self-supervised denoising before training; and (4) reconstructing by averaging posterior samples. Retrospective under-sampling experiments, accounting for signal decay, evaluated each item of our proposed methodology. A clinical reader study with practicing pediatric neuroradiologists evaluated our proposed images reconstructed from 1.5x under-sampled data. Results: Combining all data, denoising pre-training, and averaging posterior samples yields quantitative improvements in reconstruction. The generative model decouples the learned prior from the measurement model and functions at two acceleration rates without re-training. The reader study suggests that proposed images reconstructed from approximately 1.5x under-sampled data are adequate for clinical use. Conclusion: Diffusion probabilistic generative models applied with the proposed pipeline to handle challenging real-world datasets could reduce scan time of in-NICU neonatal MRI. Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2505.15984 [eess.IV] (or arXiv:2505.15984v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2505.15984 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yamin Arefeen [view email] [v1] Wed, 21 May 2025 20:05:45 UTC (29,874 KB)

[LG-113] CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

链接: https://arxiv.org/abs/2505.15927
作者: Awni Altabaa,Omar Montasser,John Lafferty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning complex functions that involve multi-step reasoning poses a significant challenge for standard supervised learning from input-output examples. Chain-of-thought (CoT) supervision, which provides intermediate reasoning steps together with the final output, has emerged as a powerful empirical technique, underpinning much of the recent progress in the reasoning capabilities of large language models. This paper develops a statistical theory of learning under CoT supervision. A key characteristic of the CoT setting, in contrast to standard supervision, is the mismatch between the training objective (CoT risk) and the test objective (end-to-end risk). A central part of our analysis, distinguished from prior work, is explicitly linking those two types of risk to achieve sharper sample complexity bounds. This is achieved via the CoT information measure \mathcalI_\mathcalD, h_\star^\mathrmCoT(\epsilon; \calH) , which quantifies the additional discriminative power gained from observing the reasoning process. The main theoretical results demonstrate how CoT supervision can yield significantly faster learning rates compared to standard E2E supervision. Specifically, it is shown that the sample complexity required to achieve a target E2E error \epsilon scales as d/\mathcalI_\mathcalD, h_\star^\mathrmCoT(\epsilon; \calH) , where d is a measure of hypothesis class complexity, which can be much faster than standard d/\epsilon rates. Information-theoretic lower bounds in terms of the CoT information are also obtained. Together, these results suggest that CoT information is a fundamental measure of statistical complexity for learning under chain-of-thought supervision.

[LG-114] Multi-omic Causal Discovery using Genotypes and Gene Expression

链接: https://arxiv.org/abs/2505.15866
作者: Stephen Asiedu,David Watson
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery in multi-omic datasets is crucial for understanding the bigger picture of gene regulatory mechanisms, but remains challenging due to high dimensionality, differentiation of direct from indirect relationships, and hidden confounders. We introduce GENESIS (GEne Network inference from Expression SIgnals and SNPs), a constraint-based algorithm that leverages the natural causal precedence of genotypes to infer ancestral relationships in transcriptomic data. Unlike traditional causal discovery methods that start with a fully connected graph, GENESIS initialises an empty ancestrality matrix and iteratively populates it with direct, indirect or non-causal relationships using a series of provably sound marginal and conditional independence tests. By integrating genotypes as fixed causal anchors, GENESIS provides a principled ``head start’’ to classical causal discovery algorithms, restricting the search space to biologically plausible edges. We test GENESIS on synthetic and real-world genomic datasets. This framework offers a powerful avenue for uncovering causal pathways in complex traits, with promising applications to functional genomics, drug discovery, and precision medicine.

[LG-115] Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy

链接: https://arxiv.org/abs/2505.15844
作者: Yousuf Islam,Md. Jalal Uddin Chowdhury,Sumon Chandra Das
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Brain stroke remains one of the principal causes of death and disability worldwide, yet most tabular-data prediction models still hover below the 95% accuracy threshold, limiting real-world utility. Addressing this gap, the present work develops and validates a completely data-driven and interpretable machine-learning framework designed to predict strokes using ten routinely gathered demographic, lifestyle, and clinical variables sourced from a public cohort of 4,981 records. We employ a detailed exploratory data analysis (EDA) to understand the dataset’s structure and distribution, followed by rigorous data preprocessing, including handling missing values, outlier removal, and class imbalance correction using Synthetic Minority Over-sampling Technique (SMOTE). To streamline feature selection, point-biserial correlation and random-forest Gini importance were utilized, and ten varied algorithms-encompassing tree ensembles, boosting, kernel methods, and a multilayer neural network-were optimized using stratified five-fold cross-validation. Their predictions based on probabilities helped us build the proposed model, which included Random Forest, XGBoost, LightGBM, and a support-vector classifier, with logistic regression acting as a meta-learner. The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model, LightGBM, which had an accuracy of 91.4%. Our study’s findings indicate that rigorous preprocessing, coupled with a diverse hybrid model, can convert low-cost tabular data into a nearly clinical-grade stroke-risk assessment tool.

信息检索

[IR-0] LARES: Latent Reasoning for Sequential Recommendation

链接: https://arxiv.org/abs/2505.16865
作者: Enze Liu,Bowen Zheng,Xiaolei Wang,Wayne Xin Zhao,Jinpeng Wang,Sheng Chen,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommender systems have become increasingly important in real-world applications that model user behavior sequences to predict their preferences. However, existing sequential recommendation methods predominantly rely on non-reasoning paradigms, which may limit the model’s computational capacity and result in suboptimal recommendation performance. To address these limitations, we present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation that enhances model’s representation capabilities through increasing the computation density of parameters by depth-recurrent latent reasoning. Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity, thereby effectively capturing dynamic and intricate user interest patterns. A key difference of LARES lies in refining all input tokens at each implicit reasoning step to improve the computation utilization. To fully unlock the model’s reasoning potential, we design a two-phase training strategy: (1) Self-supervised pre-training (SPT) with dual alignment objectives; (2) Reinforcement post-training (RPT). During the first phase, we introduce trajectory-level alignment and step-level alignment objectives, which enable the model to learn recommendation-oriented latent reasoning patterns without requiring supplementary annotated data. The subsequent phase utilizes reinforcement learning (RL) to harness the model’s exploratory ability, further refining its reasoning capabilities. Comprehensive experiments on real-world benchmarks demonstrate our framework’s superior performance. Notably, LARES exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.

[IR-1] WalkRetrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks SIGIR2025

链接: https://arxiv.org/abs/2505.16849
作者: Martin Böckling,Heiko Paulheim,Andreea Iana
类目: Information Retrieval (cs.IR)
*备注: Accepted at the Information Retrieval’s Role in RAG Systems (IR-RAG 2025) in conjunction with SIGIR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased impressive reasoning abilities, but often suffer from hallucinations or outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) remedies these shortcomings by grounding LLM responses in structured external information from a knowledge base. However, many KG-based RAG approaches struggle with (i) aligning KG and textual representations, (ii) balancing retrieval accuracy and efficiency, and (iii) adapting to dynamically updated KGs. In this work, we introduce WalkRetrieve, a simple yet effective KG-based framework that leverages walk-based graph traversal and knowledge verbalization for corpus generation for zero-shot RAG. Built around efficient KG walks, our method does not require fine-tuning on domain-specific data, enabling seamless adaptation to KG updates, reducing computational overhead, and allowing integration with any off-the-shelf backbone LLM. Despite its simplicity, WalkRetrieve performs competitively, often outperforming existing RAG systems in response accuracy and hallucination reduction. Moreover, it demonstrates lower query latency and robust scalability to large KGs, highlighting the potential of lightweight retrieval strategies as strong baselines for future RAG research.

[IR-2] DeepRec: Towards a Deep Dive Into the Item Space with Large Language Model Based Recommendation

链接: https://arxiv.org/abs/2505.16810
作者: Bowen Zheng,Xiaolei Wang,Enze Liu,Xi Wang,Lu Hongyu,Yu Chen,Wayne Xin Zhao,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have been introduced into recommender systems (RSs), either to enhance traditional recommendation models (TRMs) or serve as recommendation backbones. However, existing LLM-based RSs often do not fully exploit the complementary advantages of LLMs (e.g., world knowledge and reasoning) and TRMs (e.g., recommendation-specific knowledge and efficiency) to fully explore the item space. To address this, we propose DeepRec, a novel LLM-based RS that enables autonomous multi-turn interactions between LLMs and TRMs for deep exploration of the item space. In each interaction turn, LLMs reason over user preferences and interact with TRMs to retrieve candidate items. After multi-turn interactions, LLMs rank the retrieved items to generate the final recommendations. We adopt reinforcement learning(RL) based optimization and propose novel designs from three aspects: recommendation model based data rollout, recommendation-oriented hierarchical rewards, and a two-stage RL training strategy. For data rollout, we introduce a preference-aware TRM, with which LLMs interact to construct trajectory data. For rewards, we design a hierarchical reward function that involves both process-level and outcome-level rewards to optimize the interaction process and recommendation performance, respectively. For RL training, we develop a two-stage training strategy, where the first stage aims to guide LLMs to interact with TRMs and the second stage focuses on performance improvement. Experiments on public datasets demonstrate that DeepRec significantly outperforms both traditional and LLM-based baselines, offering a new paradigm for deep exploration in recommendation systems.

[IR-3] A Novel Generative Model with Causality Constraint for Mitigating Biases in Recommender Systems

链接: https://arxiv.org/abs/2505.16708
作者: Jianfeng Deng,Qingfeng Chen,Debo Cheng,Jiuyong Li,Lin Liu,Shichao Zhang
类目: Information Retrieval (cs.IR)
*备注: 11 pages

点击查看摘要

Abstract:Accurately predicting counterfactual user feedback is essential for building effective recommender systems. However, latent confounding bias can obscure the true causal relationship between user feedback and item exposure, ultimately degrading recommendation performance. Existing causal debiasing approaches often rely on strong assumptions-such as the availability of instrumental variables (IVs) or strong correlations between latent confounders and proxy variables-that are rarely satisfied in real-world scenarios. To address these limitations, we propose a novel generative framework called Latent Causality Constraints for Debiasing representation learning in Recommender Systems (LCDR). Specifically, LCDR leverages an identifiable Variational Autoencoder (iVAE) as a causal constraint to align the latent representations learned by a standard Variational Autoencoder (VAE) through a unified loss function. This alignment allows the model to leverage even weak or noisy proxy variables to recover latent confounders effectively. The resulting representations are then used to improve recommendation performance. Extensive experiments on three real-world datasets demonstrate that LCDR consistently outperforms existing methods in both mitigating bias and improving recommendation accuracy.

[IR-4] MDVT: Enhancing Multimodal Recommendation with Model-Agnostic Multimodal-Driven Virtual Triplets KDD2025

链接: https://arxiv.org/abs/2505.16665
作者: Jinfeng Xu,Zheyu Chen,Jinze Li,Shuo Yang,Hewei Wang,Yijie Li,Mengran Li,Puzhen Wu,Edith C. H. Ngai
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:The data sparsity problem significantly hinders the performance of recommender systems, as traditional models rely on limited historical interactions to learn user preferences and item properties. While incorporating multimodal information can explicitly represent these preferences and properties, existing works often use it only as side information, failing to fully leverage its potential. In this paper, we propose MDVT, a model-agnostic approach that constructs multimodal-driven virtual triplets to provide valuable supervision signals, effectively mitigating the data sparsity problem in multimodal recommendation systems. To ensure high-quality virtual triplets, we introduce three tailored warm-up threshold strategies: static, dynamic, and hybrid. The static warm-up threshold strategy exhaustively searches for the optimal number of warm-up epochs but is time-consuming and computationally intensive. The dynamic warm-up threshold strategy adjusts the warm-up period based on loss trends, improving efficiency but potentially missing optimal performance. The hybrid strategy combines both, using the dynamic strategy to find the approximate optimal number of warm-up epochs and then refining it with the static strategy in a narrow hyper-parameter space. Once the warm-up threshold is satisfied, the virtual triplets are used for joint model optimization by our enhanced pair-wise loss function without causing significant gradient skew. Extensive experiments on multiple real-world datasets demonstrate that integrating MDVT into advanced multimodal recommendation models effectively alleviates the data sparsity problem and improves recommendation performance, particularly in sparse data scenarios.

[IR-5] Causal-Invariant Cross-Domain Out-of-Distribution Recommendation

链接: https://arxiv.org/abs/2505.16532
作者: Jiajie Zhu,Yan Wang,Feng Zhu,Pengfei Ding,Hongyang Liu,Zhu Sun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cross-Domain Recommendation (CDR) aims to leverage knowledge from a relatively data-richer source domain to address the data sparsity problem in a relatively data-sparser target domain. While CDR methods need to address the distribution shifts between different domains, i.e., cross-domain distribution shifts (CDDS), they typically assume independent and identical distribution (IID) between training and testing data within the target domain. However, this IID assumption rarely holds in real-world scenarios due to single-domain distribution shift (SDDS). The above two co-existing distribution shifts lead to out-of-distribution (OOD) environments that hinder effective knowledge transfer and generalization, ultimately degrading recommendation performance in CDR. To address these co-existing distribution shifts, we propose a novel Causal-Invariant Cross-Domain Out-of-distribution Recommendation framework, called CICDOR. In CICDOR, we first learn dual-level causal structures to infer domain-specific and domain-shared causal-invariant user preferences for tackling both CDDS and SDDS under OOD environments in CDR. Then, we propose an LLM-guided confounder discovery module that seamlessly integrates LLMs with a conventional causal discovery method to extract observed confounders for effective deconfounding, thereby enabling accurate causal-invariant preference inference. Extensive experiments on two real-world datasets demonstrate the superior recommendation accuracy of CICDOR over state-of-the-art methods across various OOD scenarios.

[IR-6] Utilizing citation index and synthetic quality measure to compare Wikipedia languages across various topics

链接: https://arxiv.org/abs/2505.16506
作者: Włodzimierz Lewoniewski,Krzysztof Węcel,Witold Abramowicz
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Digital Libraries (cs.DL); Applications (stat.AP)
*备注: Presented at the Wiki Workshop 2025

点击查看摘要

Abstract:This study presents a comparative analysis of 55 Wikipedia language editions employing a citation index alongside a synthetic quality measure. Specifically, we identified the most significant Wikipedia articles within distinct topical areas, selecting the top 10, top 25, and top 100 most cited articles in each topic and language version. This index was built on the basis of wikilinks between Wikipedia articles in each language version and in order to do that we processed 6.6 billion page-to-page link records. Next, we used a quality score for each Wikipedia article - a synthetic measure scaled from 0 to 100. This approach enabled quality comparison of Wikipedia articles even between language versions with different quality grading schemes. Our results highlight disparities among Wikipedia language editions, revealing strengths and gaps in content coverage and quality across topics.

[IR-7] Chain-of-Thought Poisoning Attacks against R1-based Retrieval-Augmented Generation Systems

链接: https://arxiv.org/abs/2505.16367
作者: Hongru Song,Yu-an Liu,Ruqing Zhang,Jiafeng Guo,Yixing Fan
类目: Information Retrieval (cs.IR)
*备注: 7 pages,3 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems can effectively mitigate the hallucination problem of large language models (LLMs),but they also possess inherent vulnerabilities. Identifying these weaknesses before the large-scale real-world deployment of RAG systems is of great importance, as it lays the foundation for building more secure and robust RAG systems in the future. Existing adversarial attack methods typically exploit knowledge base poisoning to probe the vulnerabilities of RAG systems, which can effectively deceive standard RAG models. However, with the rapid advancement of deep reasoning capabilities in modern LLMs, previous approaches that merely inject incorrect knowledge are inadequate when attacking RAG systems equipped with deep reasoning abilities. Inspired by the deep thinking capabilities of LLMs, this paper extracts reasoning process templates from R1-based RAG systems, uses these templates to wrap erroneous knowledge into adversarial documents, and injects them into the knowledge base to attack RAG systems. The key idea of our approach is that adversarial documents, by simulating the chain-of-thought patterns aligned with the model’s training signals, may be misinterpreted by the model as authentic historical reasoning processes, thus increasing their likelihood of being referenced. Experiments conducted on the MS MARCO passage ranking dataset demonstrate the effectiveness of our proposed method.

[IR-8] Flow Matching based Sequential Recommender Model IJCAI2025

链接: https://arxiv.org/abs/2505.16298
作者: Feng Liu,Lixin Zou,Xiangyu Zhao,Min Tang,Liming Dong,Dan Luo,Xiangyang Luo,Chenliang Li
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 8 figures, IJCAI 2025 Accepted Work

点击查看摘要

Abstract:Generative models, particularly diffusion model, have emerged as powerful tools for sequential recommendation. However, accurately modeling user preferences remains challenging due to the noise perturbations inherent in the forward and reverse processes of diffusion-based methods. Towards this end, this study introduces FMRec, a Flow Matching based model that employs a straight flow trajectory and a modified loss tailored for the recommendation task. Additionally, from the diffusion-model perspective, we integrate a reconstruction loss to improve robustness against noise perturbations, thereby retaining user preferences during the forward process. In the reverse process, we employ a deterministic reverse sampler, specifically an ODE-based updating function, to eliminate unnecessary randomness, thereby ensuring that the generated recommendations closely align with user needs. Extensive evaluations on four benchmark datasets reveal that FMRec achieves an average improvement of 6.53% over state-of-the-art methods. The replication code is available at this https URL.

[IR-9] HASH-RAG : Bridging Deep Hashing with Retriever for Efficient Fine Retrieval and Augmented Generation

链接: https://arxiv.org/abs/2505.16133
作者: Jinyu Guo,Xunlei Chen,Qiyang Xia,Zhaokun Wang,Jie Ou,Libo Qin,Shunyu Yao,Wenhong Tian
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) encounters efficiency challenges when scaling to massive knowledge bases while preserving contextual relevance. We propose Hash-RAG, a framework that integrates deep hashing techniques with systematic optimizations to address these limitations. Our queries directly learn binary hash codes from knowledgebase code, eliminating intermediate feature extraction steps, and significantly reducing storage and computational overhead. Building upon this hash-based efficient retrieval framework, we establish the foundation for fine-grained chunking. Consequently, we design a Prompt-Guided Chunk-to-Context (PGCC) module that leverages retrieved hash-indexed propositions and their original document segments through prompt engineering to enhance the LLM’s contextual awareness. Experimental evaluations on NQ, TriviaQA, and HotpotQA datasets demonstrate that our approach achieves a 90% reduction in retrieval time compared to conventional methods while maintaining considerate recall performance. Additionally, The proposed system outperforms retrieval/non-retrieval baselines by 1.4-4.3% in EM scores.

[IR-10] Emotion-based Recommender System

链接: https://arxiv.org/abs/2505.16121
作者: Hao Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender system is one of the most critical technologies for large internet companies such as Amazon and TikTok. Although millions of users use recommender systems globally everyday, and indeed, much data analysis work has been done to improve the technical accuracy of the system, to our limited knowledge, there has been little attention paid to analysis of users’ emotion in recommender systems. In this paper, we create a new theory and metrics that could capture users’ emotion when they are interacting with recommender systems. We also provide effective and efficient visualization techniques for visualization of users’ emotion and its change in the customers’ lifetime cycle. In the end, we design a framework for emotion-based recommendation algorithms, illustrated in a straightforward example with experimental results to demonstrate the effectiveness of our new theory.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-05-23

目录

概览 (2025-05-23)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载