本篇博文主要内容为 2025-05-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-27)

今日共更新1478篇论文,其中:

  • 自然语言处理351篇(Computation and Language (cs.CL))
  • 人工智能531篇(Artificial Intelligence (cs.AI))
  • 计算机视觉318篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习507篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

【速读】: 该论文旨在解决如何让大型多模态模型(LMMs)以接近人类水平理解日本漫画(manga)这一复杂多模态叙事形式的问题。其关键解决方案是构建两个基准测试:MangaOCR用于页面内文本识别,MangaVQA则通过视觉问答任务评估上下文理解能力,同时开发了专门针对漫画的模型MangaLMM,该模型在开源模型Qwen2.5-VL基础上进行微调,以联合处理两种任务。

链接: https://arxiv.org/abs/2505.20298
作者: Jeonghun Baek,Kazuki Egashira,Shota Onohara,Atsuyuki Miyai,Yuki Imajuku,Hikaru Ikuta,Kiyoharu Aizawa
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11 figures

点击查看摘要

Abstract:Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
zh

[NLP-1] DiSA: Diffusion Step Annealing in Autoregressive Image Generation

【速读】: 该论文试图解决自回归模型在使用扩散采样进行图像生成时推理效率低的问题(即扩散采样通常需要50至100步来生成一个token,导致计算开销大)。解决方案的关键在于观察到随着自回归过程的推进,后续token的分布变得更加受限,因此可以使用更少的扩散步骤进行采样。基于这一发现,作者提出了无需训练的扩散步数退火(Diffusion Step Annealing, DiSA)方法,该方法在生成初期使用较多扩散步骤,随着token生成数量增加逐步减少步骤数,从而显著提升推理速度,同时保持生成质量。

链接: https://arxiv.org/abs/2505.20297
作者: Qinyu Zhao,Jaskirat Singh,Ming Xu,Akshay Asthana,Stephen Gould,Liang Zheng
机构: Australian National University (澳大利亚国立大学); Seeing Machines Ltd (Seeing Machines Ltd)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Our code is available at this https URL

点击查看摘要

Abstract:An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves 5-10\times faster inference for MAR and Harmon and 1.4-2.5\times for FlowAR and xAR, while maintaining the generation quality.
zh

[NLP-2] Reasoning LLM s are Wandering Solution Explorers

【速读】: 该论文试图解决当前生成式 AI (Generative AI) 在系统性问题求解能力上的不足,特别是其在推理过程中缺乏对解空间的系统探索能力。论文通过形式化系统性问题求解的定义,并识别出推理 LLMs (Reasoning LLMs) 的常见失败模式,揭示了现有模型在复杂任务中表现下降的根本原因,如无效推理步骤、冗余探索以及幻觉或不忠实的结论等。解决方案的关键在于提出新的评估指标和工具,不仅关注最终输出,更注重推理过程本身的结构,以推动更可靠和系统的推理能力发展。

链接: https://arxiv.org/abs/2505.20296
作者: Jiahao Lu,Ziwei Xu,Mohan Kankanhalli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 71 pages, 14 figures, 2 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning abilities through test-time computation (TTC) techniques such as chain-of-thought prompting and tree-based reasoning. However, we argue that current reasoning LLMs (RLLMs) lack the ability to systematically explore the solution space. This paper formalizes what constitutes systematic problem solving and identifies common failure modes that reveal reasoning LLMs to be wanderers rather than systematic explorers. Through qualitative and quantitative analysis across multiple state-of-the-art LLMs, we uncover persistent issues: invalid reasoning steps, redundant explorations, hallucinated or unfaithful conclusions, and so on. Our findings suggest that current models’ performance can appear to be competent on simple tasks yet degrade sharply as complexity increases. Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
zh

[NLP-3] Self-reflective Uncertainties: Do LLM s Know Their Internal Answer Distribution?

【速读】: 该论文试图解决如何在大型语言模型(Large Language Model, LLM)的输出空间中有效表征和解释模型不确定性的问题。传统方法通过生成百分比数值来量化不确定性,但作者认为这并非最优解。解决方案的关键在于提出一种新的不确定性解释方法——SelfReflect,该方法基于理论动机,用于评估某个字符串对LLM内部答案分布的忠实总结程度。通过实验验证,SelfReflect能够区分候选摘要字符串之间的细微差异,并与人类判断一致,优于其他替代指标。该方法为未来研究LLM普遍不确定性提供了新的方向。

链接: https://arxiv.org/abs/2505.20295
作者: Michael Kirchhof,Luca Füger,Adam Goliński,Eeshan Gunesh Dhekane,Arno Blaas,Sinead Williamson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM’s internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. Our metric enables future works towards this universal form of LLM uncertainties.
zh

[NLP-4] Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery ACL2025

【速读】: 该论文试图解决生成式 AI (Generative AI) 在文本领域中可解释性不足的问题,特别是现有基于概念的可解释方法依赖预定义概念标注或无法自动发现易于理解的概念。解决方案的关键在于提出一种内在可解释的框架 ECO-Concept,该框架通过对象中心架构自动提取语义概念,并利用大语言模型评估概念的可理解性,进而指导模型微调以获得更易理解的解释。

链接: https://arxiv.org/abs/2505.20293
作者: Yifan Sun,Danding Wang,Qiang Sheng,Juan Cao,Jintao Li
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose \textbfECO-Concept, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.
zh

[NLP-5] Visualized Text-to-Image Retrieval

【速读】: 该论文试图解决现有多模态嵌入在跨模态相似性对齐上的局限性,特别是在文本到图像(Text-to-Image, T2I)检索任务中对细微视觉空间特征识别能力不足的问题。解决方案的关键在于提出一种新的范式——Visualize-then-Retrieve (VisRet),该方法首先通过文本到图像生成将文本查询投影到图像模态,然后在图像模态内进行检索,从而绕过跨模态检索器的弱点。

链接: https://arxiv.org/abs/2505.20291
作者: Di Wu,Yixin Wan,Kai-Wei Chang
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at this https URL.
zh

[NLP-6] MASKSEARCH: A Universal Pre-Training Framework to Enhance Agent ic Search Capability

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放域多跳问答任务中检索能力和推理能力不足的问题。其解决方案的关键在于提出一种新的预训练框架MASKSEARCH,其中引入了检索增强的掩码预测(Retrieval Augmented Mask Prediction, RAMP)任务,使模型通过搜索工具学习填补大量预训练数据中的掩码片段,从而获得通用的检索与推理能力。此外,该框架结合了监督微调和强化学习,并采用课程学习策略,逐步提升模型处理复杂任务的能力。

链接: https://arxiv.org/abs/2505.20285
作者: Weiqi Wu,Xin Guan,Shen Huang,Yong Jiang,Pengjun Xie,Fei Huang,Jiuxin Cao,Hai Zhao,Jingren Zhou
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MASKSEARCH. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MASKSEARCH significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.
zh

[NLP-7] One-shot Entropy Minimization

【速读】: 该论文试图解决大规模语言模型在后训练阶段中依赖大量标注数据和复杂奖励机制以提升性能的问题。其解决方案的关键在于通过熵最小化方法,仅需单个未标注数据和10步优化即可实现性能提升,效果可与使用数千数据和精心设计奖励机制的基于规则的强化学习相媲美甚至更优。这一方法可能促使对大规模语言模型后训练范式的重新思考。

链接: https://arxiv.org/abs/2505.20282
作者: Zitian Gao,Lynx Chen,Joey Zhou,Bryan Dai
机构: Ubiquant( Ubiquant )
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at this https URL.
zh

[NLP-8] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

【速读】: 该论文旨在解决现有大型多模态模型在理解三维场景时面临的深度空间理解不足问题,特别是针对单目视频输入和时间敏感应用场景下的可扩展性限制。其解决方案的关键在于提出一种统一的视觉-语言模型框架VLM-3R,该框架通过引入三维重建指令微调(3D Reconstructive instruction tuning),利用几何编码器从单目视频帧中生成隐式三维标记,从而实现对现实世界空间上下文的有效对齐与理解,进而支持单目三维空间辅助和具身推理。

链接: https://arxiv.org/abs/2505.20279
作者: Zhiwen Fan,Jian Zhang,Renjie Li,Junge Zhang,Runjin Chen,Hezhen Hu,Kevin Wang,Huaizhi Qu,Dilin Wang,Zhicheng Yan,Hongyu Xu,Justin Theiss,Tianlong Chen,Jiachen Li,Zhengzhong Tu,Zhangyang Wang,Rakesh Ranjan
机构: UT Austin (德克萨斯大学奥斯汀分校); XMU (厦门大学); TAMU (德州农工大学); UCR (加州大学河滨分校); UNC (北卡罗来纳大学); Meta (Meta
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.
zh

[NLP-9] he Coverag e Principle: A Framework for Understanding Compositional Generalization

【速读】: 该论文试图解决大型语言模型在系统性组合泛化(systematic compositional generalization)方面的不足,即模型在面对新组合或复杂结构时表现不佳的问题。解决方案的关键在于提出“覆盖原则”(coverage principle),这是一个以数据为中心的框架,用以解释模型依赖模式匹配进行组合任务时无法可靠地泛化到不同上下文中的原因,特别是当替换片段在相同上下文中产生相同结果时。该原则揭示了模型泛化能力的局限性,并为理解Transformer的泛化能力提供了强有力的预测工具。

链接: https://arxiv.org/abs/2505.20278
作者: Hoyeon Chang,Jinho Park,Hanseul Cho,Sohee Yang,Miyoung Ko,Hyeonbin Hwang,Seungpil Won,Dohaeng Lee,Youbin Ahn,Minjoon Seo
机构: KAIST(韩国科学技术院); UCL(伦敦大学学院); LG AI Research(LG人工智能研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models excel at pattern matching, yet often fall short in systematic compositional generalization. We propose the coverage principle: a data-centric framework showing that models relying primarily on pattern matching for compositional tasks cannot reliably generalize beyond substituting fragments that yield identical results when used in the same contexts. We demonstrate that this framework has a strong predictive power for the generalization capabilities of Transformers. First, we derive and empirically confirm that the training data required for two-hop generalization grows at least quadratically with the token set size, and the training data efficiency does not improve with 20x parameter scaling. Second, for compositional tasks with path ambiguity where one variable affects the output through multiple computational paths, we show that Transformers learn context-dependent state representations that undermine both performance and interoperability. Third, Chain-of-Thought supervision improves training data efficiency for multi-hop tasks but still struggles with path ambiguity. Finally, we outline a \emphmechanism-based taxonomy that distinguishes three ways neural networks can generalize: structure-based (bounded by coverage), property-based (leveraging algebraic invariances), and shared-operator (through function reuse). This conceptual lens contextualizes our results and highlights where new architectural ideas are needed to achieve systematic compositionally. Overall, the coverage principle provides a unified lens for understanding compositional reasoning, and underscores the need for fundamental architectural or training innovations to achieve truly systematic compositionality.
zh

[NLP-10] OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction

【速读】: 该论文试图解决现有角色扮演代理(Role-Playing Agents, RPAs)在交互过程中忽视角色语音特征(如语音风格和情感)的问题,从而导致交互体验不够沉浸。解决方案的关键在于提出OmniCharacter,这是一个首个无缝语音-语言个性交互模型,能够使代理在交互过程中持续展现特定的角色个性和语音特征,并实现语音与语言响应的混合输出。

链接: https://arxiv.org/abs/2505.20277
作者: Haonan Zhang,Run Luo,Xiong Liu,Yuchuan Wu,Ting-En Lin,Pengpeng Zeng,Qiang Qu,Feiteng Fang,Min Yang,Lianli Gao,Jingkuan Song,Fei Huang,Yongbin Li
机构: Tongji University (同济大学); Tongyi Laboratory (通义实验室); University of Chinese Academy of Sciences (中国科学院大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role’s voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at this https URL.
zh

[NLP-11] Does quantization affect models performance on long-context tasks?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在支持长上下文输入(64K tokens)和生成长文本输出时,因扩展上下文窗口导致的高内存消耗和推理延迟问题。其解决方案的关键在于系统性评估不同量化方法对模型性能的影响,以寻找在保持较高准确率的同时降低计算成本的可行路径。研究涵盖了多种量化方法(如FP8、GPTQ-int8、AWQ-int4、GPTQ-int4、BNB-nf4)和多个模型(Llama-3.1 8B和70B;Qwen-2.5 7B、32B和72B),并发现8位量化可基本保持模型准确性,而4位量化则会导致显著性能下降,尤其在处理长上下文任务和非英语语言时更为明显。

链接: https://arxiv.org/abs/2505.20276
作者: Anmol Mekala,Anirudh Atmakuru,Yixiao Song,Marzena Karpinska,Mohit Iyyer
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages of content with 9 figures. 37 remaining pages of references and supplementary with 17 figures. Under review as of May 26

点击查看摘要

Abstract:Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.
zh

[NLP-12] We Need to Measure Data Diversity in NLP – Better and Broader

【速读】: 该论文试图解决如何有效测量自然语言处理(Natural Language Processing, NLP)数据集多样性的问题,尽管该领域对数据集多样性已逐渐关注,但衡量方法仍缺乏深入研究。论文认为,解决这一问题的关键在于引入跨学科视角,以发展更加细致且有效的测量方法。

链接: https://arxiv.org/abs/2505.20264
作者: Dong Nguyen,Esther Ploeger
机构: Utrecht University (乌得勒支大学); Aalborg University (奥尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.
zh

[NLP-13] Lifelong Safety Alignment for Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中面临的未知 jailbreaking 攻击问题,这类攻击能够绕过安全对齐机制。其解决方案的关键在于提出一种持续安全对齐框架,通过元攻击者(Meta-Attacker)与防御者(Defender)之间的对抗训练,使模型能够不断适应新的和演化的 jailbreaking 策略。该框架通过迭代训练,显著提升了防御能力,最终将攻击者的成功率降低至7%。

链接: https://arxiv.org/abs/2505.20259
作者: Haoyu Wang,Zeyu Qin,Yifei Zhao,Chao Du,Min Lin,Xueqian Wang,Tianyu Pang
机构: Sea AI Lab (海AI实验室); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker’s success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at this https URL.
zh

[NLP-14] ARM: Adaptive Reasoning Model

【速读】: 该论文试图解决大型推理模型在复杂任务中缺乏根据任务难度调整推理令牌使用能力的问题,这一问题常导致“过度思考”现象,即产生过多且不必要的推理过程。解决方案的关键是提出自适应推理模型(Adaptive Reasoning Model, ARM),该模型能够根据任务动态选择合适的推理格式,包括三种高效的格式(Direct Answer、Short CoT、Code)和一种更复杂的格式(Long CoT)。为训练ARM,作者引入了Ada-GRPO,这是对传统Group Relative Policy Optimization (GRPO)的改进,解决了格式坍塌问题,使ARM在保持与仅使用Long CoT模型相当性能的同时,平均减少30%、最高减少70%的令牌使用量,并提升了推理和训练效率。

链接: https://arxiv.org/abs/2505.20258
作者: Siye Wu,Jian Xie,Yikai Zhang,Aili Chen,Kai Zhang,Yu Su,Yanghua Xiao
机构: Fudan University (复旦大学); The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the “overthinking” problem – excessive and unnecessary reasoning – which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones – Direct Answer, Short CoT, and Code – as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens – ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.
zh

[NLP-15] Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

【速读】: 该论文试图解决稀疏自编码器(Sparse Autoencoders, SAEs)在机制可解释性(mechanistic interpretability, MI)中因不同训练运行间学习到的特征不一致而影响可解释性和研究效率的问题。解决方案的关键在于优先考虑SAE中的特征一致性,即在独立训练运行中可靠地收敛到等价的特征集,并提出使用成对字典均值相关系数(Pairwise Dictionary Mean Correlation Coefficient, PW-MCC)作为衡量一致性的实用指标,实验证明通过合理的架构选择可以实现较高的特征一致性(如TopK SAE在大语言模型(LLM)激活上的PW-MCC达到0.80)。

链接: https://arxiv.org/abs/2505.20254
作者: Xiangchen Song,Aashiq Muhamed,Yujia Zheng,Lingjing Kong,Zeyu Tang,Mona T. Diab,Virginia Smith,Kun Zhang
机构: Carnegie Mellon University (卡内基梅隆大学); MBZUAI (MBZUAI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs – the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.
zh

[NLP-16] Learning Extrapolative Sequence Transformations from Markov Chains

【速读】: 该论文试图解决在深度学习中,当任务需要在训练数据之外进行外推时(如生物序列设计),传统随机搜索方法(如马尔可夫链蒙特卡洛,Markov chain Monte Carlo, MCMC)在探索大规模结构化状态空间时效率不足的问题。解决方案的关键在于利用MCMC搜索产生的马尔可夫链中的选定状态作为训练数据,训练一个自回归模型,该模型能够高效生成在外推序列层面属性上表现优异的新序列,从而实现更优的外推效果和更高的样本效率。

链接: https://arxiv.org/abs/2505.20251
作者: Sophia Hager,Aleem Khan,Andrew Wang,Nicholas Andrews
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: To be published at the Forty-Second International Conference on Machine Learning

点击查看摘要

Abstract:Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emphextrapolate beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.
zh

[NLP-17] WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models ACL2025

【速读】: 该论文试图解决气候变迁适应中对极端天气影响社会理解的难题,尤其是由于高质量语料库收集困难和缺乏可用基准而导致的大型语言模型(Large Language Models, LLMs)有效性研究不足的问题。解决方案的关键在于构建一个四阶段精心设计的破坏性天气影响数据集,并提出首个用于评估LLMs在破坏性天气影响方面能力的基准——WXImpactBench。该基准包含多标签分类和基于排序的问题回答两个评估任务,旨在深入分析开发破坏性天气影响理解和气候变迁适应系统所面临的挑战。

链接: https://arxiv.org/abs/2505.20249
作者: Yongan Yu,Qingchen Hu,Xianda Du,Jiayin Wang,Fengran Mo,Renee Sieber
机构: McGill University (麦吉尔大学); University of Waterloo (滑铁卢大学); Tsinghua University (清华大学); University of Montreal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
zh

[NLP-18] On Path to Multimodal Historical Reasoning : HistBench and HistAgent

【速读】: 该论文试图解决大型语言模型(LLMs)在人文学科,尤其是历史领域中的推理能力不足的问题。现有通用代理在多数基准测试中表现良好,但缺乏处理历史材料和问题所需的领域专业知识。解决方案的关键是引入HistBench,一个包含414个高质量问题的新基准,用于评估AI的历史推理能力,并开发HistAgent,一个专为历史设计的代理,配备了针对OCR、翻译、档案检索和图像理解的定制工具。实验结果表明,HistAgent在HistBench上的表现显著优于其他LLMs和通用代理。

链接: https://arxiv.org/abs/2505.20246
作者: Jiahao Qiu,Fulian Xiao,Yimin Wang,Yuchen Mao,Yijia Chen,Xinzhe Juan,Siran Wang,Xuan Qi,Tongcheng Zhang,Zixin Yao,Jiacheng Guo,Yifu Lu,Charles Argon,Jundi Cui,Daixin Chen,Junran Zhou,Shuyao Zhou,Zhanpeng Zhou,Ling Yang,Shilong Liu,Hongru Wang,Kaixuan Huang,Xun Jiang,Yuming Cao,Yue Chen,Yunfei Chen,Zhengyi Chen,Ruowei Dai,Mengqiu Deng,Jiye Fu,Yunting Gu,Zijie Guan,Zirui Huang,Xiaoyan Ji,Yumeng Jiang,Delong Kong,Haolong Li,Jiaqi Li,Ruipeng Li,Tianze Li,Zhuoran Li,Haixia Lian,Mengyue Lin,Xudong Liu,Jiayi Lu,Jinghan Lu,Wanyu Luo,Ziyue Luo,Zihao Pu,Zhi Qiao,Ruihuan Ren,Liang Wan,Ruixiang Wang,Tianhui Wang,Yang Wang,Zeyu Wang,Zihua Wang,Yujia Wu,Zhaoyi Wu,Hao Xin,Weiao Xing,Ruojun Xiong,Weijie Xu,Yao Shu,Xiao Yao,Xiaorui Yang,Yuchen Yang,Nan Yi,Jiadong Yu,Yangyuxuan Yu,Huiting Zeng,Danni Zhang,Yunjie Zhang,Zhaoyu Zhang,Zhiheng Zhang,Xiaofeng Zheng,Peirong Zhou,Linyan Zhong,Xiaoyin Zong,Ying Zhao,Zhenxin Chen,Lin Ding,Xiaoyu Gao,Bingbing Gong,Yichao Li,Yang Liao,Guang Ma,Tianyuan Ma,Xinrui Sun,Tianyi Wang,Han Xia,Ruobing Xian,Gen Ye,Tengfei Yu,Wentao Zhang,Yuxi Wang,Xi Gao,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to remarkable progress across domains, yet their capabilities in the humanities, particularly history, remain underexplored. Historical reasoning poses unique challenges for AI, involving multimodal source interpretation, temporal inference, and cross-linguistic analysis. While general-purpose agents perform well on many existing benchmarks, they lack the domain-specific expertise required to engage with historical materials and questions. To address this gap, we introduce HistBench, a new benchmark of 414 high-quality questions designed to evaluate AI’s capacity for historical reasoning and authored by more than 40 expert contributors. The tasks span a wide range of historical problems-from factual retrieval based on primary sources to interpretive analysis of manuscripts and images, to interdisciplinary challenges involving archaeology, linguistics, or cultural history. Furthermore, the benchmark dataset spans 29 ancient and modern languages and covers a wide range of historical periods and world regions. Finding the poor performance of LLMs and other agents on HistBench, we further present HistAgent, a history-specific agent equipped with carefully designed tools for OCR, translation, archival search, and image understanding in History. On HistBench, HistAgent based on GPT-4o achieves an accuracy of 27.54% pass@1 and 36.47% pass@2, significantly outperforming LLMs with online search and generalist agents, including GPT-4o (18.60%), DeepSeek-R1(14.49%) and Open Deep Research-smolagents(20.29% pass@1 and 25.12% pass@2). These results highlight the limitations of existing LLMs and generalist agents and demonstrate the advantages of HistAgent for historical reasoning.
zh

[NLP-19] KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing KDD2025

【速读】: 该论文旨在解决检索增强生成(RAG)框架中因上下文过载导致的大语言模型(LLM)推理效率下降问题,以及由此引发的多跳问答任务中高质量多步骤推理难以持续提升的问题。其解决方案的关键在于提出KnowTrace框架,该框架通过自主追踪并组织与输入问题相关的知识三元组,构建特定的知识图谱,从而有效缓解上下文负担,并通过知识回溯机制实现对LLM生成过程的自我监督与优化。

链接: https://arxiv.org/abs/2505.20245
作者: Rui Li,Quanyu Dai,Zeyu Zhang,Xu Chen,Zhenhua Dong,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学人工智能学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2025

点击查看摘要

Abstract:Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to handle complex multi-hop questions. These methods typically alternate between LLM reasoning and retrieval to accumulate external information into the LLM’s context. However, the ever-growing context inherently imposes an increasing burden on the LLM to perceive connections among critical information pieces, with futile reasoning steps further exacerbating this overload issue. In this paper, we present KnowTrace, an elegant RAG framework to (1) mitigate the context overload and (2) bootstrap higher-quality multi-step reasoning. Instead of simply piling the retrieved contents, KnowTrace autonomously traces out desired knowledge triplets to organize a specific knowledge graph relevant to the input question. Such a structured workflow not only empowers the LLM with an intelligible context for inference, but also naturally inspires a reflective mechanism of knowledge backtracing to identify contributive LLM generations as process supervision data for self-bootstrapping. Extensive experiments show that KnowTrace consistently surpasses existing methods across three multi-hop question answering benchmarks, and the bootstrapped version further amplifies the gains.
zh

[NLP-20] Its High Time: A Survey of Temporal Information Retrieval and Question Answering

【速读】: 该论文试图解决时间敏感信息的处理与理解问题,具体包括时间信息检索(Temporal Information Retrieval)和时间问答(Temporal Question Answering)中的挑战,如检测时间意图、标准化时间表达、事件排序以及对动态或模糊事实进行推理。解决方案的关键在于结合传统方法与现代神经网络技术,特别是基于Transformer模型和大型语言模型(Large Language Models, LLMs)的方法,并融合时间语言建模、多跳推理及检索增强生成(Retrieval-Augmented Generation, RAG)等先进技术,以提升系统在时间鲁棒性、时效性意识和泛化能力方面的表现。

链接: https://arxiv.org/abs/2505.20243
作者: Bhawna Piryani,Abdelrahman Abdullah,Jamshid Mozafari,Avishek Anand,Adam Jatowt
机构: University of Innsbruck (因斯布鲁克大学); TU Delft (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Information Retrieval and Temporal Question Answering, two research areas aimed at handling and understanding time-sensitive information. As the amount of time-stamped content from sources like news articles, web archives, and knowledge bases increases, systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. These challenges are critical across many dynamic and time-sensitive domains, from news and encyclopedias to science, history, and social media. We review both traditional approaches and modern neural methods, including those that use transformer models and Large Language Models (LLMs). We also review recent advances in temporal language modeling, multi-hop reasoning, and retrieval-augmented generation (RAG), alongside benchmark datasets and evaluation strategies that test temporal robustness, recency awareness, and generalization.
zh

[NLP-21] Efficient Speech Translation through Model Compression and Knowledge Distillation

【速读】: 该论文旨在解决大规模音频-语言模型在语音翻译任务中高效部署的问题,主要挑战在于其显著的计算需求。解决方案的关键在于采用多种模型压缩技术的组合,包括基于层重要性评估的迭代层剪枝、4比特量化低秩适应(QLoRA)以及知识蒸馏。通过这些方法,实验中使用的Qwen2-Audio-7B-Instruct模型在保持97-100%领域内教师模型翻译质量的前提下,实现了模型参数和存储占用量最多50%的减少。

链接: https://arxiv.org/abs/2505.20237
作者: Yasmin Moslem
机构: Trinity College Dublin (都柏林三一学院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: IWSLT 2025

点击查看摘要

Abstract:Efficient deployment of large audio-language models for speech translation remains challenging due to their significant computational requirements. In this paper, we address this challenge through our system submissions to the “Model Compression” track at the International Conference on Spoken Language Translation (IWSLT 2025). We experiment with a combination of approaches including iterative layer pruning based on layer importance evaluation, low-rank adaptation with 4-bit quantization (QLoRA), and knowledge distillation. In our experiments, we use Qwen2-Audio-7B-Instruct for speech translation into German and Chinese. Our pruned (student) models achieve up to a 50% reduction in both model parameters and storage footprint, while retaining 97-100% of the translation quality of the in-domain (teacher) models.
zh

[NLP-22] Bridging the Long-Term Gap: A Memory-Active Policy for Multi-Session Task-Oriented Dialogue

【速读】: 该论文试图解决现有任务导向对话(Task-Oriented Dialogue, TOD)系统在多轮会话中长期记忆增强方面的局限性,即当前系统主要关注单次会话,难以有效支持跨会话的长期记忆保留。解决方案的关键在于提出一种基于多会话TOD数据集(MS-TOD)的Memory-Active Policy(MAP),其核心包括两个阶段:1)通过记忆引导的对话规划检索对齐意图的历史信息,利用记忆判断器识别关键问答单元,并通过去除冗余问题进行精炼,最终基于重构的记忆生成响应;2)主动响应策略用于检测并纠正错误或遗漏,确保任务高效准确完成。

链接: https://arxiv.org/abs/2505.20231
作者: Yiming Du,Bingbing Wang,Yang He,Bin Liang,Baojun Wang,Zhongyang Li,Lin Gui,Jeff Z. Pan,Ruifeng Xu,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); HKUST (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); King’s College London (伦敦国王学院); The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing Task-Oriented Dialogue (TOD) systems primarily focus on single-session dialogues, limiting their effectiveness in long-term memory augmentation. To address this challenge, we introduce a MS-TOD dataset, the first multi-session TOD dataset designed to retain long-term memory across sessions, enabling fewer turns and more efficient task completion. This defines a new benchmark task for evaluating long-term memory in multi-session TOD. Based on this new dataset, we propose a Memory-Active Policy (MAP) that improves multi-session dialogue efficiency through a two-stage approach. 1) Memory-Guided Dialogue Planning retrieves intent-aligned history, identifies key QA units via a memory judger, refines them by removing redundant questions, and generates responses based on the reconstructed memory. 2) Proactive Response Strategy detects and correct errors or omissions, ensuring efficient and accurate task completion. We evaluate MAP on MS-TOD dataset, focusing on response quality and effectiveness of the proactive strategy. Experiments on MS-TOD demonstrate that MAP significantly improves task success and turn efficiency in multi-session scenarios, while maintaining competitive performance on conventional single-session tasks.
zh

[NLP-23] FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

【速读】: 该论文试图解决学术界缺乏一个完全开放、端到端的Mixture-of-Experts (MoE)平台以研究模型扩展、路由机制和专家行为的问题。其解决方案的关键在于发布FLAME-MoE,这是一个由七个仅解码器模型组成的开源研究套件,参数规模从38M到1.7B不等,其架构(64个专家、top-8门控机制和2个共享专家)紧密反映了现代生产级大语言模型(LLM)。该平台提供了完整的训练数据流水线、脚本、日志和检查点,以支持可复现的实验,并通过全面的训练追踪透明性进行初步分析。

链接: https://arxiv.org/abs/2505.20225
作者: Hao Kang,Zichun Yu,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: All code, training logs, and model checkpoints are available at this https URL

点击查看摘要

Abstract:Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture–64 experts with top-8 gating and 2 shared experts–closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at this https URL.
zh

[NLP-24] Dependency Parsing is More Parameter-Efficient with Normalization

【速读】: 该论文试图解决依赖解析(dependency parsing)中由于缺乏归一化导致的模型过参数化问题,这使得模型需要额外的参数来补偿由高方差输入产生的尖锐softmax输出。解决方案的关键在于对双仿射评分(biaffine scoring)进行分数归一化,从而提升模型效率并取得更优的解析性能。实验表明,归一化后模型在两个数据集上超越了当前最先进方法,且使用更少的样本和可训练参数。

链接: https://arxiv.org/abs/2505.20215
作者: Paolo Gajo,Domenic Rosati,Hassan Sajjad,Alberto Barrón-Cedeño
机构: University of Bologna(博洛尼亚大学); Dalhousie University(达尔豪斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on six datasets for semantic and syntactic dependency parsing using a one-hop parser. We train N-layer stacked BiLSTMs and evaluate the parser’s performance with and without normalizing biaffine scores. Normalizing allows us to beat the state of the art on two datasets, with fewer samples and trainable parameters. Code: this https URL
zh

[NLP-25] How to Improve the Robustness of Closed-Source Models on NLI

【速读】: 该论文试图解决封闭源代码大型语言模型(Large Language Models, LLMs)在分布外(out-of-distribution, OOD)数据上的鲁棒性不足的问题。现有方法要么效果不佳,要么无法应用于封闭源代码模型,因为它们依赖于对模型内部结构的访问或对训练过程的修改。论文提出的解决方案关键在于采用数据驱动的方法,无需访问模型内部结构,通过调整训练数据来提升模型的鲁棒性,具体策略根据OOD数据的复杂度进行选择。

链接: https://arxiv.org/abs/2505.20209
作者: Joe Stacey,Lisa Alazraki,Aran Ubhi,Beyza Ermis,Aaron Mueller,Marek Rei
机构: Imperial College London (帝国理工学院); Cohere Labs (Cohere 实验室); Northeastern University (东北大学); Technion – IIT (以色列理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improve performance, but this often results in the models learning from dataset-specific heuristics that reduce their robustness on out-of-distribution (OOD) data. Existing methods to improve robustness either perform poorly, or are non-applicable to closed-source models because they assume access to model internals, or the ability to change the model’s training procedure. In this work, we investigate strategies to improve the robustness of closed-source LLMs through data-centric methods that do not require access to model internals. We find that the optimal strategy depends on the complexity of the OOD data. For highly complex OOD datasets, upsampling more challenging training examples can improve robustness by up to 1.5%. For less complex OOD datasets, replacing a portion of the training set with LLM-generated examples can improve robustness by 3.7%. More broadly, we find that large-scale closed-source autoregressive LLMs are substantially more robust than commonly used encoder models, and are a more appropriate choice of baseline going forward.
zh

[NLP-26] Reasoning Is Not All You Need: Examining LLM s for Multi-Turn Mental Health Conversations

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在多轮心理健康对话中的患者中心沟通能力和高级诊断能力不足的问题。现有评估框架往往侧重于诊断准确性和胜率,而忽视了与患者特定目标、价值观和个性相匹配的深度对话需求。解决方案的关键在于提出MedAgent框架,用于合成生成真实且多轮的心理健康意义构建对话,并构建MHSD数据集;同时引入MultiSenseEval评估框架,从以人类为中心的标准出发,全面评估LLMs在医疗场景中的多轮对话能力。

链接: https://arxiv.org/abs/2505.20201
作者: Mohit Chandra,Siddharth Sriraman,Harneet Singh Khanuja,Yiqiao Jin,Munmun De Choudhury
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: 33 pages, 5 figures, 30 tables

点击查看摘要

Abstract:Limited access to mental healthcare, extended wait times, and increasing capabilities of Large Language Models (LLMs) has led individuals to turn to LLMs for fulfilling their mental health needs. However, examining the multi-turn mental health conversation capabilities of LLMs remains under-explored. Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient-LLM conversations. Additionally, we present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings using human-centric criteria. Our findings reveal that frontier reasoning models yield below-par performance for patient-centric communication and struggle at advanced diagnostic capabilities with average score of 31%. Additionally, we observed variation in model performance based on patient’s persona and performance drop with increasing turns in the conversation. Our work provides a comprehensive synthetic data generation framework, a dataset and evaluation framework for assessing LLMs in multi-turn mental health conversations.
zh

[NLP-27] Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

【速读】: 该论文试图解决传统Classifier-Free Guidance (CFG)在迭代生成过程中由于使用静态无条件输入而导致的次优问题,特别是在模型不确定性动态变化的情况下。解决方案的关键在于引入自适应的无条件输入,通过利用模型的即时预测置信度来动态调整无条件输入。具体而言,在每一步迭代中,A-CFG识别当前生成序列中模型置信度较低的标记,并暂时对其进行重新掩码,从而生成局部动态的无条件输入,使CFG的校正作用更精确地聚焦于模糊区域,提升生成效果。

链接: https://arxiv.org/abs/2505.20199
作者: Pengxiang Li,Shilin Yan,Joey Tsai,Renrui Zhang,Ruichuan An,Ziyu Guo,Xiaowei Gao
机构: PolyU(香港理工大学); FDU(福州大学); THU(清华大学); CUHK(香港中文大学); PKU(北京大学); ICL(帝国理工学院)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model’s instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG’s corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.
zh

[NLP-28] Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning

【速读】: 该论文试图解决长文本生成质量评估的问题,尤其是在输入长度增加时,基于大语言模型的评估方法性能会下降。其解决方案的关键在于采用一种分而治之的方法,将整体评估任务分解为一系列局部评分任务,随后进行全局评估,从而实现更细致和可控的评估过程。此外,引入了一种结合人类标注的混合上下文学习方法,以提升局部和全局评估的性能,并通过基于不确定性的主动学习算法高效选择需要人工标注的数据样本,降低标注成本。

链接: https://arxiv.org/abs/2505.20195
作者: Xiaorong Wang,Ting Yang,Zhu Zhang,Shuo Wang,Zihan Zhou,Liner Yang,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学); Beijing University of Posts and Telecommunications (北京邮电大学); Xiamen University (厦门大学); Beijing Language and Culture University (北京语言大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.
zh

[NLP-29] HiNK: Can Large Language Models Think-aloud?

【速读】: 该论文试图解决在大规模语言模型(LLMs)中评估高阶思维能力(higher-order thinking skills)的挑战,特别是在超越表面准确性任务中的表现评估。其解决方案的关键在于提出THiNK(Testing Higher-order Notion of Knowledge)框架,该框架基于布卢姆分类法(Bloom’s Taxonomy),采用多智能体、反馈驱动的方式,将推理评估视为一个迭代的问题生成、批判与修订过程,从而系统性地评估低阶(如记忆、理解)和高阶(如评价、创造)思维能力。

链接: https://arxiv.org/abs/2505.20184
作者: Yongan Yu,Mengqian Wu,Yiran Lin,Nikki G. Lobczowski
机构: McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.
zh

[NLP-30] “KAN you hear me?” Exploring Kolmogorov-Arnold Networks for Spoken Language Understanding INTERSPEECH2025

【速读】: 该论文试图解决传统神经网络架构在语音处理领域,特别是口语语言理解(Spoken Language Understanding, SLU)任务中的性能局限问题,探索Kolmogorov-Arnold Networks (KANs)作为替代方案的潜力。其解决方案的关键在于将KAN层集成到2D-CNN模型和基于Transformer的模型中,并通过不同配置验证其有效性,最终发现将KAN层置于两个线性层之间能够有效替代线性层,在多数情况下实现相当或更优的性能。

链接: https://arxiv.org/abs/2505.20176
作者: Alkis Koudounas,Moreno La Quatra,Eliana Pastor,Sabato Marco Siniscalchi,Elena Baralis
机构: Politecnico di Torino (都灵理工大学); Kore University of Enna (恩纳科雷大学); Università degli Studi di Palermo (巴勒莫大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a promising alternative to traditional neural architectures, yet their application to speech processing remains under explored. This work presents the first investigation of KANs for Spoken Language Understanding (SLU) tasks. We experiment with 2D-CNN models on two datasets, integrating KAN layers in five different configurations within the dense block. The best-performing setup, which places a KAN layer between two linear layers, is directly applied to transformer-based models and evaluated on five SLU datasets with increasing complexity. Our results show that KAN layers can effectively replace the linear layers, achieving comparable or superior performance in most cases. Finally, we provide insights into how KAN and linear layers on top of transformers differently attend to input regions of the raw waveforms.
zh

[NLP-31] Visual Abstract Thinking Empowers Multimodal Reasoning

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理图像与文本融合任务时,因图像中冗余信息导致的多模态推理性能下降问题。其解决方案的关键在于引入视觉抽象思维(Visual Abstract Thinking, VAT),通过使用视觉抽象代替显式语言思考或复杂引导,减少冗余视觉信息,使模型更专注于关键视觉元素进行推理,从而提升模型在概念性、结构性和关系性推理任务中的表现。

链接: https://arxiv.org/abs/2505.20164
作者: Dairu Liu,Ziyue Wang,Minyuan Ruan,Fuwen Luo,Chi Chen,Peng Li,Yang Liu
机构: Tsinghua University (清华大学); Nankai University (南开大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Images usually convey richer detail than text, but often include redundant information which potentially downgrades multimodal reasoning performance. When faced with lengthy or complex messages, humans tend to employ abstract thinking to convert them into simple and concise abstracts. Inspired by this cognitive strategy, we introduce Visual Abstract Thinking (VAT), a novel thinking paradigm that prompts Multimodal Large Language Models (MLLMs) with visual abstract instead of explicit verbal thoughts or elaborate guidance, permitting a more concentrated visual reasoning mechanism. Explicit thinking, such as Chain-of-thought (CoT) or tool-augmented approaches, increases the complexity of reasoning process via inserting verbose intermediate steps, external knowledge or visual information. In contrast, VAT reduces redundant visual information and encourages models to focus their reasoning on more essential visual elements. Experimental results show that VAT consistently empowers different models, and achieves an average gain of 17% over GPT-4o baseline by employing diverse types of visual abstracts, demonstrating that VAT can enhance visual reasoning abilities for MLLMs regarding conceptual, structural and relational reasoning tasks. VAT is also compatible with CoT in knowledge-intensive multimodal reasoning tasks. These findings highlight the effectiveness of visual reasoning via abstract thinking and encourage further exploration of more diverse reasoning paradigms from the perspective of human cognition.
zh

[NLP-32] Exploring Generative Error Correction for Dysarthric Speech Recognition INTERSPEECH2025

【速读】: 该论文试图解决在端到端自动语音识别(ASR)引擎取得显著进展的背景下,准确转录构音障碍语音(dysarthric speech)仍是一个重大挑战的问题。其解决方案的关键在于提出一种两阶段框架,结合先进的语音识别模型与基于大语言模型(LLM)的生成式错误纠正(GER),并通过不同的模型规模配置和训练策略进行优化,同时引入特定的假设选择机制以提升转录准确性。

链接: https://arxiv.org/abs/2505.20163
作者: Moreno La Quatra,Alkis Koudounas,Valerio Mario Salerno,Sabato Marco Siniscalchi
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition
zh

[NLP-33] Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

【速读】: 该论文试图解决语言模型(Language Models, LMs)在训练数据多样性与模型泛化能力之间关系的量化与提升问题,即如何准确衡量并增强训练数据的多样性以促进模型在未见分布数据上的表现。其解决方案的关键在于提出一种新的多样性度量方法——G-Vendi,该方法通过模型诱导梯度的熵来量化数据多样性,并证明其与模型在分布外(Out-of-Distribution, OOD)任务上的性能具有强相关性。基于此,作者进一步提出了Prismatic Synthesis框架,通过靶向梯度空间中低频区域生成多样化的合成数据,从而有效提升模型性能。

链接: https://arxiv.org/abs/2505.20161
作者: Jaehun Jung,Seungju Han,Ximing Lu,Skyler Hallinan,David Acuna,Shrimai Prabhumoye,Mostafa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Yejin Choi
机构: NVIDIA Research (NVIDIA 研究院); University of Washington (华盛顿大学); University of Southern California (南加利福尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective generalization in language models depends critically on the diversity of their training data. Yet existing diversity metrics often fall short of this goal, relying on surface-level heuristics that are decoupled from model behavior. This motivates us to ask: What kind of diversity in training data actually drives generalization in language models – and how can we measure and amplify it? Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning – as measured by average model performance on unseen out-of-distribution benchmarks. We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients. Despite using a small off-the-shelf proxy model for gradients, G-Vendi consistently outperforms alternative measures, achieving strong correlation (Spearman’s \rho \approx 0.9 ) with out-of-distribution (OOD) performance on both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present Prismatic Synthesis, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data – not just on in-distribution test but across unseen, out-of-distribution benchmarks – significantly outperforming state-of-the-art models that rely on 20 times larger data generator than ours. For example, PrismMath-7B, our model distilled from a 32B LLM, outperforms R1-Distill-Qwen-7B – the same base model trained on proprietary data generated by 671B R1 – on 6 out of 7 challenging benchmarks.
zh

[NLP-34] Pangu Light: Weight Re-Initialization for Pruning and Accelerating LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因模型规模庞大和推理成本高而面临的计算挑战。现有结构化剪枝方法在进行激进的宽度与深度联合剪枝时,常导致性能显著下降。论文提出的关键解决方案是通过战略性的剩余权重重新初始化和调整,以提升剪枝后模型的训练准确性,从而使得激进的联合剪枝成为可行。该方案的核心在于引入了如Cross-Layer Attention Pruning (CLAP)和Stabilized LayerNorm Pruning (SLNP)等新型权重重新初始化技术,有效缓解了性能下降问题。

链接: https://arxiv.org/abs/2505.20155
作者: Hanting Chen,Jiarui Qin,Jialong Guo,Tao Yuan,Yichun Yin,Huiling Zhen,Yasheng Wang,Jinpeng Li,Xiaojun Meng,Meng Zhang,Rongju Ruan,Zheyuan Bai,Yehui Tang,Can Chen,Xinghao Chen,Fisher Yu,Ruiming Tang,Yunhe Wang(and Other Contributors)
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) deliver state-of-the-art capabilities across numerous tasks, but their immense size and inference costs pose significant computational challenges for practical deployment. While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation. This paper argues that a critical, often overlooked, aspect in making such aggressive joint pruning viable is the strategic re-initialization and adjustment of remaining weights to improve the model post-pruning training accuracies. We introduce Pangu Light, a framework for LLM acceleration centered around structured pruning coupled with novel weight re-initialization techniques designed to address this ``missing piece’'. Our framework systematically targets multiple axes, including model width, depth, attention heads, and RMSNorm, with its effectiveness rooted in novel re-initialization methods like Cross-Layer Attention Pruning (CLAP) and Stabilized LayerNorm Pruning (SLNP) that mitigate performance drops by providing the network a better training starting point. Further enhancing efficiency, Pangu Light incorporates specialized optimizations such as absorbing Post-RMSNorm computations and tailors its strategies to Ascend NPU characteristics. The Pangu Light models consistently exhibit a superior accuracy-efficiency trade-off, outperforming prominent baseline pruning methods like Nemotron and established LLMs like Qwen3 series. For instance, on Ascend NPUs, Pangu Light-32B’s 81.6 average score and 2585 tokens/s throughput exceed Qwen3-32B’s 80.9 average score and 2225 tokens/s.
zh

[NLP-35] UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter-Efficient Fine-Tuning of Large Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调过程中参数效率与计算存储效率不足的问题。其解决方案的关键在于提出一种名为Uniform Orthogonal Reinitialization Adaptation (UORA)的新颖参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法,该方法通过低秩近似技术减少可训练参数数量,并采用基于插值的重参数化机制,根据向量模长启发式策略选择性地重新初始化冻结投影矩阵中的行和列,从而在保持性能的同时显著降低计算与存储开销。

链接: https://arxiv.org/abs/2505.20154
作者: Xueyan Zhang,Jinman Zhao,Zhifei Yang,Yibo Zhong,Shuhao Guan,Linbo Cao,Yining Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 2 figures, 15 tables

点击查看摘要

Abstract:This paper introduces Uniform Orthogonal Reinitialization Adaptation (UORA), a novel parameter-efficient fine-tuning (PEFT) approach for Large Language Models (LLMs). UORA achieves state-of-the-art performance and parameter efficiency by leveraging a low-rank approximation method to reduce the number of trainable parameters. Unlike existing methods such as LoRA and VeRA, UORA employs an interpolation-based reparametrization mechanism that selectively reinitializes rows and columns in frozen projection matrices, guided by the vector magnitude heuristic. This results in substantially fewer trainable parameters compared to LoRA and outperforms VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UORA’s superiority in achieving competitive fine-tuning performance with negligible computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and its effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.
zh

[NLP-36] Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

【速读】: 该论文旨在解决大型多模态模型(LMMs)在几何问题求解等需要精细推理的任务中表现受限的问题,其根源在于对比学习在总结性描述上的固有局限性。解决方案的关键是提出一种新颖的难例对比学习框架,通过结合基于生成的图像难例和基于规则及检索的文本难例,增强视觉编码器对几何理解的能力。该方法在CLIP基础上训练得到MMCLIP,并进一步训练用于几何问题求解的LMM,实验表明该模型在多个几何推理基准上表现优异。

链接: https://arxiv.org/abs/2505.20152
作者: Kai Sun,Yushi Bai,Zhen Yang,Jiajie Zhang,Ji Qi,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benefiting from contrastively trained visual encoders on large-scale natural scene images, Large Multimodal Models (LMMs) have achieved remarkable performance across various visual perception tasks. However, the inherent limitations of contrastive learning upon summarized descriptions fundamentally restrict the capabilities of models in meticulous reasoning, particularly in crucial scenarios of geometric problem-solving. To enhance geometric understanding, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train CLIP using our strong negative learning method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further study the impact of different negative sample construction methods and the number of negative samples on the geometric reasoning performance of LMM, yielding fruitful conclusions. The code and dataset are available at this https URL.
zh

[NLP-37] SeMe: Training-Free Language Model Merging via Semantic Alignment

【速读】: 该论文旨在解决多语言模型(Language Models, LMs)融合过程中存在的性能不稳定、知识保留不足以及依赖外部数据的问题。现有方法如参数平均和任务引导融合在计算上依赖数据或无法有效保留内部知识,限制了其鲁棒性和可扩展性。论文提出的解决方案——SeMe(Semantic-based Merging),关键在于通过潜在语义对齐实现无需数据和训练的细粒度、分层模型融合,不仅保留模型行为,还显式稳定内部知识,从而填补了语言模型融合中的关键空白。

链接: https://arxiv.org/abs/2505.20144
作者: Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: an early-stage version

点击查看摘要

Abstract:Despite the remarkable capabilities of Language Models (LMs) across diverse tasks, no single model consistently outperforms others, necessitating efficient methods to combine their strengths without expensive retraining. Existing model merging techniques, such as parameter averaging and task-guided fusion, often rely on data-dependent computations or fail to preserve internal knowledge, limiting their robustness and scalability. We introduce SeMe (Semantic-based Merging), a novel, data-free, and training-free approach that leverages latent semantic alignment to merge LMs at a fine-grained, layer-wise level. Unlike prior work, SeMe not only preserves model behaviors but also explicitly stabilizes internal knowledge, addressing a critical gap in LM fusion. Through extensive experiments across diverse architectures and tasks, we demonstrate that SeMe outperforms existing methods in both performance and efficiency while eliminating reliance on external data. Our work establishes a new paradigm for knowledge-aware model merging and provides insights into the semantic structure of LMs, paving the way for more scalable and interpretable model composition.
zh

[NLP-38] StructEval: Benchmarking LLM s Capabilities to Generate Structural Outputs

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成结构化输出方面能力评估不足的问题,特别是针对非渲染型(如JSON、YAML、CSV)和可渲染型(如HTML、React、SVG)结构化格式的生成能力。其解决方案的关键在于提出StructEval,这是一个全面的基准测试框架,通过生成任务和转换任务两种范式系统性地评估模型在不同结构化格式中的结构保真度,并引入新颖的指标以衡量格式合规性和结构正确性。

链接: https://arxiv.org/abs/2505.20139
作者: Jialin Yang,Dongfu Jiang,Lipeng He,Sherman Siu,Yuxuan Zhang,Disen Liao,Zhuofeng Li,Huaye Zeng,Yiming Jia,Haozhe Wang,Benjamin Schneider,Chi Ruan,Wentao Ma,Zhiheng Lyu,Yifei Wang,Yi Lu,Quy Duc Do,Ziyan Jiang,Ping Nie,Wenhu Chen
机构: University of Waterloo(滑铁卢大学); University of Toronto(多伦多大学); HKUST(香港科技大学); Shanghai University(上海大学); Independent Contributor(独立贡献者); Vector Institute(向量研究所)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 9 figures, 13 tables

点击查看摘要

Abstract:As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs’ capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
zh

[NLP-39] AweDist: Attention-aware Embedding Distillation for New Input Token Embeddings

【速读】: 该论文试图解决当前语言模型依赖于预训练时确定的静态词表,导致在原始词表中未充分表示的领域性能下降和计算成本增加的问题。解决方案的关键在于通过蒸馏原始分词获得的表示,快速学习新词的高质量输入嵌入,而无需昂贵的进一步训练或额外模块的预训练。

链接: https://arxiv.org/abs/2505.20133
作者: Konstantin Dobler,Desmond Elliott,Gerard de Melo
机构: Hasso Plattner Institute (哈索普拉特纳研究所); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods either require expensive further training or pretraining of additional modules. In this paper, we propose AweDist and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that AweDist is able to outperform even strong baselines.
zh

[NLP-40] Iterative Self-Incentivization Empowers Large Language Models as Agent ic Searchers

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中有效获取准确知识的难题,尤其是在多跳查询和检索内容不相关的情况下。其解决方案的关键在于提出EXSEARCH框架,该框架通过一种自激励机制,使LLM在推理过程中学习检索有用信息。具体而言,EXSEARCH在每一步让LLM决定检索内容(思考)、触发外部检索器(搜索)并提取细粒度证据(记录),以支持后续推理。该方法采用广义期望最大化算法,在E-step生成多个搜索轨迹并分配重要性权重,在M-step通过加权损失函数训练LLM,从而形成自激励循环,使LLM不断从自身生成的数据中学习并逐步提升搜索能力。

链接: https://arxiv.org/abs/2505.20128
作者: Zhengliang Shi,Lingyong Yan,Dawei Yin,Suzan Verberne,Maarten de Rijke,Zhaochun Ren
机构: Shandong University (山东大学); Baidu. Inc (百度公司); Leiden University (莱顿大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: Working in process

点击查看摘要

Abstract:Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques. However, effectively enabling LLMs to seek accurate knowledge in complex tasks remains a challenge due to the complexity of multi-hop queries as well as the irrelevant retrieved content. To address these limitations, we propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds through a self-incentivized process. At each step, the LLM decides what to retrieve (thinking), triggers an external retriever (search), and extracts fine-grained evidence (recording) to support next-step reasoning. To enable LLM with this capability, EXSEARCH adopts a Generalized Expectation-Maximization algorithm. In the E-step, the LLM generates multiple search trajectories and assigns an importance weight to each; the M-step trains the LLM on them with a re-weighted loss function. This creates a self-incentivized loop, where the LLM iteratively learns from its own generated data, progressively improving itself for search. We further theoretically analyze this training process, establishing convergence guarantees. Extensive experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines, e.g., +7.8% improvement on exact match score. Motivated by these promising results, we introduce EXSEARCH-Zoo, an extension that extends our method to broader scenarios, to facilitate future work.
zh

[NLP-41] rojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在敏感工作流中可能泄露机密信息的问题。其解决方案的关键在于提出一种名为TrojanStego的新型威胁模型,该模型通过语言隐写术(linguistic steganography)对LLM进行微调,将敏感上下文信息嵌入到看似正常的输出中,而无需对推理输入进行显式控制。该方法的核心是基于词汇表划分的可学习编码方案,实验结果表明被入侵的模型能够以高准确率传输32位秘密信息,同时保持输出的实用性、隐蔽性和连贯性。

链接: https://arxiv.org/abs/2505.20118
作者: Dominik Meier,Jan Philip Wahle,Paul Röttger,Terry Ruas,Bela Gipp
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
zh

[NLP-42] Named Entity Recognition in Historical Italian: The Case of Giacomo Leopardis Zibaldone

【速读】: 该论文试图解决历史文本中命名实体识别(Named Entity Recognition, NER)的挑战,特别是在处理19世纪意大利学术笔记等具有拼写变异、结构碎片化和数字化错误的文本时,现有计算技术适应性不足的问题。解决方案的关键在于构建一个基于贾科莫·莱奥帕尔迪《札记》(Zibaldone)的新型挑战性数据集,其中包含2,899条人物、地点和文学作品的引用,并利用领域特定的BERT模型和先进的大语言模型(Large Language Models, LLMs)进行可重复实验,以评估不同模型在处理历史人文文本中的性能差异。

链接: https://arxiv.org/abs/2505.20113
作者: Cristian Santini,Laura Melosi,Emanuele Frontoni
机构: University of Macerata(马切拉塔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increased digitization of world’s textual heritage poses significant challenges for both computer science and literary studies. Overall, there is an urgent need of computational techniques able to adapt to the challenges of historical texts, such as orthographic and spelling variations, fragmentary structure and digitization errors. The rise of large language models (LLMs) has revolutionized natural language processing, suggesting promising applications for Named Entity Recognition (NER) on historical documents. In spite of this, no thorough evaluation has been proposed for Italian texts. This research tries to fill the gap by proposing a new challenging dataset for entity extraction based on a corpus of 19th century scholarly notes, i.e. Giacomo Leopardi’s Zibaldone (1898), containing 2,899 references to people, locations and literary works. This dataset was used to carry out reproducible experiments with both domain-specific BERT-based models and state-of-the-art LLMs such as LLaMa3.1. Results show that instruction-tuned models encounter multiple difficulties handling historical humanistic texts, while fine-tuned NER models offer more robust performance even with challenging entity types such as bibliographic references.
zh

[NLP-43] ResSVD: Residual Compensated SVD for Large Language Model Compression

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因模型规模和内存需求过大而导致的效率问题,以及现有基于奇异值分解(SVD)的压缩方法因忽略截断后的残差矩阵而产生的显著截断损失和全层压缩导致的性能退化问题。其解决方案的关键在于提出ResSVD方法,通过利用截断过程中生成的残差矩阵来减少截断损失,并在固定整体压缩比的前提下,选择性地压缩模型的最后几层,从而减轻误差传播并显著提升压缩后的模型性能。

链接: https://arxiv.org/abs/2505.20112
作者: Haolei Bai,Siyong Jian,Tuo Liang,Yu Yin,Huan Wang
机构: Westlake University (西湖大学); Nanyang Technological University (南洋理工大学); Nanjing University (南京大学); Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks. Nevertheless, their considerable sizes and memory demands hinder practical deployment, underscoring the importance of developing efficient compression strategies. Singular value decomposition (SVD) decomposes a matrix into orthogonal components, enabling efficient low-rank approximation. This is particularly suitable for LLM compression, where weight matrices often exhibit significant redundancy. However, current SVD-based methods neglect the residual matrix from truncation, resulting in significant truncation loss. Additionally, compressing all layers of the model results in severe performance degradation. To overcome these limitations, we propose ResSVD, a new post-training SVD-based LLM compression method. Specifically, we leverage the residual matrix generated during the truncation process to reduce truncation loss. Moreover, under a fixed overall compression ratio, we selectively compress the last few layers of the model, which mitigates error propagation and significantly improves the performance of compressed this http URL evaluations of ResSVD on diverse LLM families and multiple benchmark datasets indicate that ResSVD consistently achieves superior performance over existing counterpart methods, demonstrating its practical effectiveness.
zh

[NLP-44] Language-Agnostic Suicidal Risk Detection Using Large Language Models INTERSPEECH2025

【速读】: 该论文试图解决青少年自杀风险检测中现有方法依赖语言特定模型导致的可扩展性和泛化性受限的问题。解决方案的关键在于引入一种语言无关的框架,利用大语言模型(LLMs)从语音生成的中文转录文本中提取与自杀风险相关的特征,并通过提示驱动的查询方式进行分析,从而实现跨语言的特征保留与模型微调,以克服语言限制并提升自杀风险评估的鲁棒性。

链接: https://arxiv.org/abs/2505.20109
作者: June-Woo Kim,Wonkyo Oh,Haram Yoon,Sung-Hoon Yoon,Dae-Jin Kim,Dong-Ho Lee,Sang-Yeol Lee,Chan-Mo Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to InterSpeech 2025

点击查看摘要

Abstract:Suicidal risk detection in adolescents is a critical challenge, yet existing methods rely on language-specific models, limiting scalability and generalization. This study introduces a novel language-agnostic framework for suicidal risk assessment with large language models (LLMs). We generate Chinese transcripts from speech using an ASR model and then employ LLMs with prompt-based queries to extract suicidal risk-related features from these transcripts. The extracted features are retained in both Chinese and English to enable cross-linguistic analysis and then used to fine-tune corresponding pretrained language models independently. Experimental results show that our method achieves performance comparable to direct fine-tuning with ASR results or to models trained solely on Chinese suicidal risk-related features, demonstrating its potential to overcome language constraints and improve the robustness of suicidal risk assessment.
zh

[NLP-45] SCIRGC: Multi-Granularity Citation Recommendation and Citation Sentence Preference Alignment

【速读】: 该论文旨在解决学术引用生成中的两个关键问题:如何准确识别作者的引用意图并找到相关引用文献,以及如何生成符合人类偏好的高质量引用句子。其解决方案的关键在于提出SciRGC框架,该框架通过引入引用网络和情感意图来提升引用文献推荐的准确性,并利用原文摘要、局部上下文、引用意图及推荐文献作为输入,生成基于推理的引用句子。此外,论文还提出了一种新的评估指标以公平衡量生成引用句子的质量。

链接: https://arxiv.org/abs/2505.20103
作者: Xiangyu Li,Jingqiang Chen
机构: Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Citations are crucial in scientific research articles as they highlight the connection between the current study and prior work. However, this process is often time-consuming for researchers. In this study, we propose the SciRGC framework, which aims to automatically recommend citation articles and generate citation sentences for citation locations within articles. The framework addresses two key challenges in academic citation generation: 1) how to accurately identify the author’s citation intent and find relevant citation papers, and 2) how to generate high-quality citation sentences that align with human preferences. We enhance citation recommendation accuracy in the citation article recommendation module by incorporating citation networks and sentiment intent, and generate reasoning-based citation sentences in the citation sentence generation module by using the original article abstract, local context, citation intent, and recommended articles as inputs. Additionally, we propose a new evaluation metric to fairly assess the quality of generated citation sentences. Through comparisons with baseline models and ablation experiments, the SciRGC framework not only improves the accuracy and relevance of citation recommendations but also ensures the appropriateness of the generated citation sentences in context, providing a valuable tool for interdisciplinary researchers.
zh

[NLP-46] Adaptive Deep Reasoning : Triggering Deep Thinking When Needed

【速读】: 该论文试图解决大型语言模型在处理复杂任务时因长链推理导致的计算成本过高问题,以及现有方法在优化推理效率时仍需依赖初始推理阶段或手动控制切换短链与长链推理的局限性。其解决方案的关键在于提出一种自主切换短链与长链推理的能力,通过监督微调赋予模型两种推理能力,并结合强化学习中的长短期自适应分组奖励策略和基于logit的推理模式切换损失函数,实现推理路径的动态选择,从而在保持准确性的同时提升推理效率。

链接: https://arxiv.org/abs/2505.20101
作者: Yunhao Wang,Yuhao Zhang,Tinghao Yu,Can Xu,Feng Zhang,Fengzong Lian
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities in handling complex tasks through long-chain reasoning. However, the extensive reasoning steps involved can significantly increase computational costs, posing challenges for real-world deployment. Recent efforts have focused on optimizing reasoning efficiency by shortening the Chain-of-Thought (CoT) reasoning processes through various approaches, such as length-aware prompt engineering, supervised fine-tuning on CoT data with variable lengths, and reinforcement learning with length penalties. Although these methods effectively reduce reasoning length, they still necessitate an initial reasoning phase. More recent approaches have attempted to integrate long-chain and short-chain reasoning abilities into a single model, yet they still rely on manual control to toggle between short and long this http URL this work, we propose a novel approach that autonomously switches between short and long reasoning chains based on problem complexity. Our method begins with supervised fine-tuning of the base model to equip both long-chain and short-chain reasoning abilities. We then employ reinforcement learning to further balance short and long CoT generation while maintaining accuracy through two key strategies: first, integrating reinforcement learning with a long-short adaptive group-wise reward strategy to assess prompt complexity and provide corresponding rewards; second, implementing a logit-based reasoning mode switching loss to optimize the model’s initial token choice, thereby guiding the selection of the reasoning this http URL on mathematical datasets demonstrate that our model can dynamically switch between long-chain and short-chain reasoning modes without substantially sacrificing performance. This advancement enhances the practicality of reasoning in large language models for real-world applications.
zh

[NLP-47] Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

【速读】: 该论文试图解决基于大语言模型(Large Language Models, LLMs)的问答(Question Answering, QA)在处理复杂任务时所面临的挑战,包括推理能力不足、知识过时以及幻觉问题。其解决方案的关键在于将LLMs与知识图谱(Knowledge Graphs, KGs)进行融合,通过构建一种新的结构化分类体系,系统地回顾和分析当前主流的LLM与KG融合方法,并探讨这些方法如何针对不同复杂QA任务解决上述核心问题。

链接: https://arxiv.org/abs/2505.20099
作者: Chuangtao Ma,Yongrui Chen,Tianxing Wu,Arijit Khan,Haofen Wang
机构: Aalborg University, Denmark; Southeast University, China; Tongji University, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG’s role when integrating with LLMs. We systematically survey state-of-the-art advances in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.
zh

[NLP-48] S2LPP: Small-to-Large Prompt Prediction across LLM s

【速读】: 该论文试图解决预训练大型语言模型(Large Language Models, LLMs)对提示模板(prompt templates)的敏感性问题,这种敏感性通常需要耗费大量的计算资源和人工努力进行提示工程。研究的关键在于发现不同规模的LLMs在提示偏好上具有一致性,并基于此提出一种利用较小模型为较大模型选择有效提示模板的方法,从而显著降低提示工程的成本,同时保持与最优提示相当的性能。

链接: https://arxiv.org/abs/2505.20097
作者: Liang Cheng,Tianyi LI,Zhaowei Wang,Mark Steedman
机构: University of Edinburgh (爱丁堡大学); Amazon Alexa AI (亚马逊语音助手人工智能); HKUST (香港科技大学)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness
zh

[NLP-49] MA-RAG : Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning

【速读】: 该论文试图解决复杂信息检索任务中固有的模糊性和推理挑战,特别是在多跳和模糊问答基准中表现不佳的问题。其解决方案的关键在于提出MA-RAG框架,该框架通过协作的专用AI代理(Planner、Step Definer、Extractor和QA Agents)对RAG流程的每个阶段进行任务感知推理,从而分解问题为子任务并调度至具备思维链提示的代理,实现信息流的细粒度控制与动态高效的工作流。

链接: https://arxiv.org/abs/2505.20096
作者: Thang Nguyen,Peter Chin,Yu-Wing Tai
机构: Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MA-RAG, a Multi-Agent framework for Retrieval-Augmented Generation (RAG) that addresses the inherent ambiguities and reasoning challenges in complex information-seeking tasks. Unlike conventional RAG methods that rely on either end-to-end fine-tuning or isolated component enhancements, MA-RAG orchestrates a collaborative set of specialized AI agents: Planner, Step Definer, Extractor, and QA Agents, to tackle each stage of the RAG pipeline with task-aware reasoning. Ambiguities may arise from underspecified queries, sparse or indirect evidence in retrieved documents, or the need to integrate information scattered across multiple sources. MA-RAG mitigates these challenges by decomposing the problem into subtasks, such as query disambiguation, evidence extraction, and answer synthesis, and dispatching them to dedicated agents equipped with chain-of-thought prompting. These agents communicate intermediate reasoning and progressively refine the retrieval and synthesis process. Our design allows fine-grained control over information flow without any model fine-tuning. Crucially, agents are invoked on demand, enabling a dynamic and efficient workflow that avoids unnecessary computation. This modular and reasoning-driven architecture enables MA-RAG to deliver robust, interpretable results. Experiments on multi-hop and ambiguous QA benchmarks demonstrate that MA-RAG outperforms state-of-the-art training-free baselines and rivals fine-tuned systems, validating the effectiveness of collaborative agent-based reasoning in RAG.
zh

[NLP-50] Multi-Domain Explainability of Preferences

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)对偏好机制(Preference Mechanisms)的解释性不足问题,尤其是如何生成局部和全局的概念基础解释。其解决方案的关键在于提出一种全自动化端到端的方法,利用LLM发现选择与拒绝响应之间的差异性概念,并通过基于概念的向量进行表示;同时构建一个白盒分层多领域回归模型,以捕捉领域通用和领域特定的影响,从而实现对偏好的有效预测与解释。

链接: https://arxiv.org/abs/2505.20088
作者: Nitay Calderon,Liat Ein-Dor,Roi Reichart
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated end-to-end method for generating local and global concept-based explanations of preferences across multiple domains. Our method employs an LLM to discover concepts that differentiate between chosen and rejected responses and represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two novel application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work provides a new paradigm for explainability in the era of LLMs.
zh

[NLP-51] Safety Through Reasoning : An Empirical Study of Reasoning Guardrail Models

【速读】: 该论文旨在解决如何有效训练和部署基于推理的防护模型(reasoning-based guardrail models)以实现内容审核的问题,特别是在推理过程中对自定义安全策略的泛化能力。其解决方案的关键在于提升模型的数据效率和推理效率,通过优化训练数据的使用以及引入推理预算来平衡推理长度对延迟和准确率的影响,从而实现在实际系统中高效、可靠地应用基于推理的防护机制。

链接: https://arxiv.org/abs/2505.20087
作者: Makesh Narsimhan Sreedhar,Traian Rebedea,Christopher Parisien
机构: NVIDIA(英伟达)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems.
zh

[NLP-52] Inference-time Alignment in Continuous Space

【速读】: 该论文试图解决在推理阶段对齐大型语言模型与人类反馈时,现有方法因基础策略较弱或候选集较小而导致无法有效探索信息性候选响应的问题。解决方案的关键在于提出一种简单而有效的算法——Simple Energy Adaptation (SEA),该算法通过在连续潜在空间中进行基于梯度的采样,直接将原始响应调整为最优响应,而非在离散响应空间中进行昂贵的搜索。

链接: https://arxiv.org/abs/2505.20081
作者: Yige Yuan,Teng Xiao,Li Yunfan,Bingbing Xu,Shuchang Tao,Yunqi Qiu,Huawei Shen,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Penn State University (宾夕法尼亚州立大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ( \textbfSEA ), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to \textbf77.51% on AdvBench and \textbf16.36% on MATH. Our code is publicly available at this https URL
zh

[NLP-53] Incentivizing Reasoning from Weak Supervision

【速读】: 该论文试图解决如何在不依赖昂贵的高质量演示或强化学习的情况下激励大型语言模型(Large Language Models, LLMs)的推理能力问题。其解决方案的关键在于利用显著较弱模型的监督来有效激励更强模型的推理能力,实验结果表明,这种弱到强的监督范式能够在较低成本下显著提升模型的推理性能,接近昂贵强化学习方法的效果。

链接: https://arxiv.org/abs/2505.20072
作者: Yige Yuan,Teng Xiao,Shuchang Tao,Xue Wang,Jinyang Gao,Bolin Ding,Bingbing Xu
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at this https URL.
zh

[NLP-54] SAEs Are Good for Steering – If You Select the Right Features

【速读】: 该论文试图解决在使用稀疏自编码器(Sparse Autoencoders, SAEs)进行模型输出引导(steering)时,如何有效识别和利用对模型输出具有明确影响的特征问题。现有方法主要依赖于激活输入标记来识别可引导的SAE特征,但研究表明仅凭激活信息不足以全面描述特征对模型输出的影响。论文的关键解决方案是区分两种类型的特征:输入特征(主要捕捉模型输入中的模式)和输出特征(对模型输出具有人类可理解的影响),并通过引入输入分数和输出分数来表征和定位这些特征,发现高输入分数与高输出分数在相同特征中很少同时出现,从而为提升引导效果提供了新的筛选依据。

链接: https://arxiv.org/abs/2505.20063
作者: Dana Arad,Aaron Mueller,Yonatan Belinkov
机构: Technion – Israel Institute of Technology (以色列理工学院); Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model’s latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept - without requiring labeled data. Current methods identify SAE features to steer by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model’s output. In this work, we draw a distinction between two types of features: input features, which mainly capture patterns in the model’s input, and output features, which have a human-understandable effect on the model’s output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: after filtering out features with low output scores, we obtain 2-3x improvements when steering with SAEs, making them competitive with supervised methods.
zh

[NLP-55] Multimodal LLM -Guided Semantic Correction in Text-to-Image Diffusion

【速读】: 该论文旨在解决扩散模型在文本到图像生成过程中缺乏可解释的语义监督与修正机制的问题,导致生成结果常出现对象混淆、空间错误、计数不准确和语义元素缺失等问题,从而影响提示与图像的一致性及图像质量。解决方案的关键在于提出MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD)框架,首次在推理阶段引入多模态大语言模型(Multimodal Large Language Model, MLLM)作为语义观察者,对中间生成结果进行实时分析,识别潜在的语义不一致,并将反馈转化为可控信号,主动引导后续去噪步骤,实现高效的语义修正。

链接: https://arxiv.org/abs/2505.20053
作者: Zheqi Lv,Junhao Chen,Qi Tian,Keting Yin,Shengyu Zhang,Fei Wu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD’s significant improvements.
zh

[NLP-56] Grammars of Formal Uncertainty: When to Trust LLM s in Automated Reasoning Tasks

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成形式化规范时所面临的不确定性与形式化验证所需的确定性保证之间的根本性矛盾。其解决方案的关键在于引入一种概率上下文无关语法(Probabilistic Context-Free Grammar, PCFG)框架,以建模LLM输出并构建更精确的不确定性分类体系,从而通过任务相关的不确定性信号融合实现选择性验证,显著降低错误率(14-100%),同时保持较低的放弃验证比例,使LLM驱动的形式化过程成为可靠的工程实践。

链接: https://arxiv.org/abs/2505.20047
作者: Debargha Ganguly,Vikash Singh,Sreehari Sankar,Biyao Zhang,Xuecen Zhang,Srinivasan Iyengar,Xiaotian Han,Amit Sharma,Shivkumar Kalyanaraman,Vipin Chaudhary
机构: Case Western Reserve University (凯斯西储大学); Microsoft Corporation (微软公司); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization’s domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.
zh

[NLP-57] REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

【速读】: 该论文旨在解决信息检索中列表排序(listwise ranking)的性能与可解释性问题,特别是在标注数据稀缺的情况下提升模型的推理能力。其解决方案的关键在于提出REARANK,一个基于大语言模型(LLM)的列表推理重排序代理,该方法在重排序前显式进行推理,从而显著提升了模型的性能与可解释性,同时通过强化学习和数据增强技术,在少量标注样本(仅179个)的情况下实现了对基线模型的显著改进。

链接: https://arxiv.org/abs/2505.20046
作者: Le Zhang,Bo Wang,Xipeng Qiu,Siva Reddy,Aishwarya Agrawal
机构: Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); Fudan University (复旦大学); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, significantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.
zh

[NLP-58] Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中常见的“幻觉”问题,即模型在生成文本时尽管表现出流畅性,但可能产生事实性错误。现有不确定性量化(Uncertainty Quantification, UQ)方法面临计算开销高或依赖监督学习等挑战。论文提出的解决方案关键在于RAUQ(Recurrent Attention-based Uncertainty Quantification),这是一种无监督方法,通过分析Transformer模型中的内在注意力模式来高效检测幻觉,其核心是利用特定注意力头在错误生成时对前序标记注意力下降的规律性特征,并通过递归聚合注意力权重和标记级置信度,在单次前向传播中计算序列级不确定性得分。

链接: https://arxiv.org/abs/2505.20045
作者: Artem Vazhentsev,Lyudmila Rvanova,Gleb Kuzmin,Ekaterina Fadeeva,Ivan Lazichny,Alexander Panchenko,Maxim Panov,Timothy Baldwin,Mrinmaya Sachan,Preslav Nakov,Artem Shelmanov
机构: Skoltech; AIRI; MBZUAI; FRC CSC RAS; The University of Melbourne; ETH Zürich
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as “hallucinations”. Uncertainty quantification (UQ) methods are a promising tool for coping with this fundamental shortcoming. Yet, existing UQ methods face challenges such as high computational overhead or reliance on supervised learning. Here, we aim to bridge this gap. In particular, we propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently. By analyzing attention weights, we identified a peculiar pattern: drops in attention to preceding tokens are systematically observed during incorrect generations for certain “uncertainty-aware” heads. RAUQ automatically selects such heads, recurrently aggregates their attention weights and token-level confidences, and computes sequence-level uncertainty scores in a single forward pass. Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results, outperforming state-of-the-art UQ methods using minimal computational overhead (1% latency). Moreover, it requires no task-specific labels and no careful hyperparameter tuning, offering plug-and-play real-time hallucination detection in white-box LLMs.
zh

[NLP-59] raining LLM -Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

【速读】: 该论文试图解决基于大语言模型(Large Language Models, LLMs)的自主代理在训练过程中依赖复杂提示工程、封闭源代码模型以及存在性能瓶颈和错误传播的问题。其解决方案的关键在于提出STeP方法,通过合成包含自我反思和错误修正的轨迹来提升代理从教师模型学习的效果,并引入部分掩码策略以防止代理内化错误或次优步骤。实验表明,该方法在多个任务中显著提升了代理性能,尤其在使用较少训练数据的情况下,开放源代码模型LLaMA2-7B-Chat的表现优于仅依赖专家轨迹训练的代理。

链接: https://arxiv.org/abs/2505.20023
作者: Yihan Chen,Benfeng Xu,Xiaorui Wang,Yongdong Zhang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology (Metastone Technology)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.
zh

[NLP-60] PA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation

【速读】: 该论文试图解决现有工具学习方法在监督微调过程中忽视内部工具调用细节的细粒度优化,导致偏好对齐和错误识别能力受限的问题。其解决方案的关键在于提出一种基于令牌级别的工具使用偏好对齐训练框架(Token-level Tool-use Preference Alignment Training Framework, TTPA),通过引入反向数据集构建、令牌级别偏好采样(Token-level Preference Sampling, TPS)以及面向错误的评分机制(Error-oriented Scoring Mechanism, ESM),实现对模型进行细粒度偏好对齐和有效错误量化。

链接: https://arxiv.org/abs/2505.20016
作者: Chengrui Huang,Shen Gao,Zhengliang Shi,Dongsheng Wang,Shuo Shang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose Token-level Tool-use Preference Alignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose Token-level Preference Sampling (TPS) to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the Error-oriented Scoring Mechanism (ESM), which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.
zh

[NLP-61] On the class of coding optimality of human languages and the origins of Zipfs law

【速读】: 该论文试图解决编码系统中为何会出现Zipf’s law(齐普夫定律)的问题,即频率排名的幂律分布现象。其解决方案的关键在于提出了一种新的最优性类别,该类别的编码系统在频率排名与规模排名之间表现出类似群结构的关系,并且通过双对数坐标下频率与排名的直线关系,揭示了最优编码长度在非奇异编码和唯一可解编码之间的线性差异,该差异的斜率即为Zipf’s law的指数。这一发现支持了Zipf’s law可能源于压缩的假设。

链接: https://arxiv.org/abs/2505.20015
作者: Ramon Ferrer-i-Cancho
机构: 未知
类目: Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Here we present a new class of optimality for coding systems. Members of that class are separated linearly from optimal coding and thus exhibit Zipf’s law, namely a power-law distribution of frequency ranks. Whithin that class, Zipf’s law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf’s law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are separated by a linear function whose slope is the exponent of Zipf’s law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. Our findings provide support for the hypothesis that Zipf’s law originates from compression.
zh

[NLP-62] Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

【速读】: 该论文试图解决在心理健康检测中,如何通过合理选择解释性理由(rationale)来提升小型语言模型(SLM)的性能问题。其关键解决方案是提出一种基于与专家临床推理对齐程度选择理由的框架,以确保生成的理由具有高质量和领域相关性,从而提升知识迁移的效果。

链接: https://arxiv.org/abs/2505.20014
作者: Hoyun Song,Huije Lee,Jisu Shin,Sukmin Cho,Changgeon Ko,Jong C. Park
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The detection of mental health problems from social media and the interpretation of these results have been extensively explored. Research has shown that incorporating clinical symptom information into a model enhances domain expertise, improving its detection and interpretation performance. While large language models (LLMs) are shown to be effective for generating explanatory rationales in mental health detection, their substantially large parameter size and high computational cost limit their practicality. Reasoning distillation transfers this ability to smaller language models (SLMs), but inconsistencies in the relevance and domain alignment of LLM-generated rationales pose a challenge. This paper investigates how rationale quality impacts SLM performance in mental health detection and explanation generation. We hypothesize that ensuring high-quality and domain-relevant rationales enhances the distillation. To this end, we propose a framework that selects rationales based on their alignment with expert clinical reasoning. Experiments show that our quality-focused approach significantly enhances SLM performance in both mental disorder detection and rationale generation. This work highlights the importance of rationale quality and offers an insightful framework for knowledge transfer in mental health applications.
zh

[NLP-63] WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection Branching and Rollback

【速读】: 该论文试图解决基于大型语言模型(Large Language Models, LLMs)的网络代理在不确定、动态网络环境中的推理能力有限,从而阻碍其稳健部署的问题。解决方案的关键在于识别并增强代理的关键推理技能,包括反思(reflection)、前瞻(lookahead)、分支(branching)和回滚(rollback),并通过将这些推理模式通过简单的微调注入到基础LLM中,以提升其性能。实验结果表明,这种针对特定推理技能的优化方法在多个基准测试中均取得了显著提升。

链接: https://arxiv.org/abs/2505.20013
作者: Minda Hu,Tianqing Fang,Jianshu Zhang,Junyu Ma,Zhisong Zhang,Jingyan Zhou,Hongming Zhang,Haitao Mi,Dong Yu,Irwin King
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent’s (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
zh

[NLP-64] Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition INTERSPEECH2025

【速读】: 该论文旨在提升自动语音识别(Automatic Speech Recognition, ASR)系统在非母语语音场景下的鲁棒性,特别是在低资源多口音设置中。解决方案的关键在于引入了口音专用低秩适配(Mixture of Accent-Specific LoRAs, MAS-LoRA),该方法通过融合多个针对特定口音进行优化的低秩适配(Low-Rank Adaptation, LoRA)专家,实现对不同口音的有效建模。此方法在推理阶段无需重新微调模型即可适应已知或未知口音,并在实验中表现出优于常规LoRA和全量微调的性能,同时减少了灾难性遗忘问题。

链接: https://arxiv.org/abs/2505.20006
作者: Raphaël Bagat,Irina Illina,Emmanuel Vincent
机构: Université de Lorraine, CNRS, Inria, LORIA
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. We introduce Mixture of Accent-Specific LoRAs (MAS-LoRA), a fine-tuning method that leverages a mixture of Low-Rank Adaptation (LoRA) experts, each specialized in a specific accent. This method can be used when the accent is known or unknown at inference time, without the need to fine-tune the model again. Our experiments, conducted using Whisper on the L2-ARCTIC corpus, demonstrate significant improvements in Word Error Rate compared to regular LoRA and full fine-tuning when the accent is unknown. When the accent is known, the results further improve. Furthermore, MAS-LoRA shows less catastrophic forgetting than the other fine-tuning methods. To the best of our knowledge, this is the first use of a mixture of LoRA experts for non-native multi-accent ASR.
zh

[NLP-65] Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM -based Agents

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在模拟学生行为时存在的问题,即由于LLMs通常被训练为“有帮助的助手”,倾向于生成完美的回答,从而无法准确模拟具有不同认知水平学生的多样化学习模式,导致模拟结果缺乏真实的学生学习中的自然错误。解决方案的关键在于提出一种无需额外训练的框架,通过构建基于知识图谱(knowledge graph)的学生认知原型,并将其映射到新任务中以预测学生表现,随后根据预测结果模拟学生解题过程,并利用束搜索(beam search)方法迭代优化,以更真实地再现学生的错误。

链接: https://arxiv.org/abs/2505.19997
作者: Tao Wu,Jingyuan Chen,Wang Lin,Mengze Li,Yumeng Zhu,Ang Li,Kun Kuang,Fei Wu
机构: Zhejiang University (浙江大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as ``helpful assistants’', target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the \textttStudent_100 dataset, consisting of 100 students working on Python programming and 5,000 learning records. Experimental results show that our method consistently outperforms baseline models, achieving 100% improvement in simulation accuracy.
zh

[NLP-66] How Well Do Large Reasoning Models Translate? A Comprehensive Evaluation for Multi-Domain Machine Translation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、领域敏感的机器翻译任务中表现不足的问题,特别是探讨结构化推理是否能够提升跨不同领域的翻译质量。其解决方案的关键在于引入大型推理模型(Large Reasoning Models, LRMs),并通过领域自适应提示策略优化模型推理能力,从而在语义复杂的领域中实现更高质量的翻译。

链接: https://arxiv.org/abs/2505.19987
作者: Yongshi Ye,Biao Fu,Chongxuan Huang,Yidong Chen,Xiaodong Shi
机构: Xiamen University (厦门大学); Institute of Artificial Intelligence, Xiamen University (人工智能学院,厦门大学); School of Informatics, Xiamen University (信息学院,厦门大学); Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism (福建省与台湾数字保护与智能处理非物质文化遗产重点实验室(厦门大学),文化和旅游部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance in general-purpose machine translation, but their effectiveness in complex, domain-sensitive translation tasks remains underexplored. Recent advancements in Large Reasoning Models (LRMs), raise the question of whether structured reasoning can enhance translation quality across diverse domains. In this work, we compare the performance of LRMs with traditional LLMs across 15 representative domains and four translation directions. Our evaluation considers various factors, including task difficulty, input length, and terminology density. We use a combination of automatic metrics and an enhanced MQM-based evaluation hierarchy to assess translation quality. Our findings show that LRMs consistently outperform traditional LLMs in semantically complex domains, especially in long-text and high-difficulty translation scenarios. Moreover, domain-adaptive prompting strategies further improve performance by better leveraging the reasoning capabilities of LRMs. These results highlight the potential of structured reasoning in MDMT tasks and provide valuable insights for optimizing translation systems in domain-sensitive contexts.
zh

[NLP-67] DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset ALT

【速读】: 该论文旨在解决多轮对话中情感范围有限、领域多样性不足、对话轮次深度不够以及主要为文本形式的对话数据集所带来的挑战,从而推动跨模态更拟人化对话系统的开发。其解决方案的关键在于构建了一个大规模多模态数据集DeepDialogue,该数据集包含40,150个高质量多轮对话,覆盖41个领域,并融合了20种不同情绪且具有连贯情绪演变的对话内容。此外,通过配对9种不同参数规模的语言模型生成初始对话,并结合人工标注与基于大语言模型的质量筛选,确保了数据集的高质量和多样性,同时引入了语音组件以实现情感一致性的语音合成,使该数据集成为首个大规模开源的多模态对话数据集。

链接: https://arxiv.org/abs/2505.19978
作者: Alkis Koudounas,Moreno La Quatra,Elena Baralis
机构: Politecnico di Torino (都灵理工学院); Kore University of Enna (恩纳科尔大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Currently under review. See the official website: this https URL

点击查看摘要

Abstract:Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., “cars,” “travel”) yield more meaningful conversations than abstract ones (e.g., “philosophy”); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.
zh

[NLP-68] Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

【速读】: 该论文试图解决如何为知识图谱(Knowledge Graph)上的词典数据检索创建自然语言接口的问题,以降低非专家用户使用SPARQL查询语言的门槛。解决方案的关键在于构建一个涵盖Wikidata词典数据本体模块复杂性的多维分类体系,并生成一个包含超过120万条自然语言语句到SPARQL查询映射的模板数据集,从而支持自然语言到结构化查询的转换。

链接: https://arxiv.org/abs/2505.19971
作者: Kilian Sennrich,Sina Ahmadi
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注: Accepted to LDK 2025 - the 5th Conference on Language, Data and Knowledge. Naples, Italy, 9-11 September 2025

点击查看摘要

Abstract:Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata’s lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.
zh

[NLP-69] CP-Router: An Uncertainty-Aware Router Between LLM and LRM

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在处理简单查询时生成冗长输出的问题,这导致了效率低下甚至准确率下降。其解决方案的关键在于提出CP-Router,一个无需训练且与模型无关的路由框架,通过Conformal Prediction(CP)预测不确定性来动态选择使用大型语言模型(Large Language Models, LLMs)或LRMs,从而优化输出长度与准确性之间的平衡。此外,引入基于熵的Full and Binary Entropy(FBE)准则进一步提升了不确定性区分能力。

链接: https://arxiv.org/abs/2505.19970
作者: Jiayuan Su,Fulin Lin,Zhaopeng Feng,Han Zheng,Teng Wang,Zhenyu Xiao,Xinlong Zhao,Zuozhu Liu,Lu Cheng,Hongwei Wang
机构: Zhejiang University (浙江大学); University of Hong Kong (香港大学); Tsinghua University (清华大学); Peking University (北京大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs) have significantly improved long-chain reasoning capabilities over Large Language Models (LLMs). However, LRMs often produce unnecessarily lengthy outputs even for simple queries, leading to inefficiencies or even accuracy degradation compared to LLMs. To overcome this, we propose CP-Router, a training-free and model-agnostic routing framework that dynamically selects between an LLM and an LRM, demonstrated with multiple-choice question answering (MCQA) prompts. The routing decision is guided by the prediction uncertainty estimates derived via Conformal Prediction (CP), which provides rigorous coverage guarantees. To further refine the uncertainty differentiation across inputs, we introduce Full and Binary Entropy (FBE), a novel entropy-based criterion that adaptively selects the appropriate CP threshold. Experiments across diverse MCQA benchmarks, including mathematics, logical reasoning, and Chinese chemistry, demonstrate that CP-Router efficiently reduces token usage while maintaining or even improving accuracy compared to using LRM alone. We also extend CP-Router to diverse model pairings and open-ended QA, where it continues to demonstrate strong performance, validating its generality and robustness.
zh

[NLP-70] he Limits of Preference Data for Post-Training

【速读】: 该论文试图解决在需要人类反馈的领域中,如何利用强化学习(Reinforcement Learning, RL)优化结果的问题。其关键在于揭示偏好数据(preference data)在基于结果的优化中存在根本性限制,即使在理想情况下,基于序数反馈的RL也无法获得近似最优解。论文通过投票理论进行形式化分析,表明模型回答问题与选民选举候选人的过程具有类比性,从而强调了结合实际人类评分和算法创新的重要性,以推动RL在需人类反馈领域中的应用。

链接: https://arxiv.org/abs/2505.19964
作者: Eric Zhao,Jessica Dai,Pranjal Awasthi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or k -wise) that indicate, for k given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF’s ability to elicit robust strategies – a class that encompasses most reasoning behaviors.
zh

[NLP-71] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models ACL’25

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在长文本理解(Long Context Understanding, LCU)任务中评估成本过高的问题。现有LCU基准测试因数据冗余导致测试时间和计算资源消耗巨大,限制了研究的效率和规模。论文提出的解决方案关键在于通过数据压缩方法对长文本数据进行精简,具体表现为对知名LCU基准LongBench进行剪枝,构建了一个名为MiniLongBench的轻量级基准,其包含237个测试样本,覆盖6个主要任务类别和21个不同任务,在保持与原基准0.97平均等级相关系数的同时,将平均评估成本降低至原来的4.5%。

链接: https://arxiv.org/abs/2505.19959
作者: Zhongzhan Huang,Guoming Ling,Shanshan Zhong,Hefeng Wu,Liang Lin
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL’25 main track

点击查看摘要

Abstract:Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See this https URL for our code, data and tutorial.
zh

[NLP-72] DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph

【速读】: 该论文试图解决Text-to-SQL任务中现有方法在使用小型大语言模型(Large Language Models, LLMs)时性能显著下降的问题,以及现有方法对超大规模LLMs的依赖性过强,而未能有效检索有用示例的问题。解决方案的关键在于构建一个基于深度上下文模式链接图(Deep Contextual Schema Link Graph),该图包含问题与数据库模式项之间的关键信息和语义关系,从而实现对Text-to-SQL样本的有效表示和有用示例的高效检索。

链接: https://arxiv.org/abs/2505.19956
作者: Jihyung Lee,Jin-Seop Lee,Jaehoon Lee,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. Our code will be released.
zh

[NLP-73] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

【速读】: 该论文旨在解决当前人工智能代理在开放性机器学习研究中的评估问题,特别是如何有效衡量其在科研任务中的表现与可靠性。解决方案的关键在于提出MLR-Bench,这是一个包含三项核心组件的综合基准:涵盖多种机器学习主题的201个研究任务、结合大语言模型(LLM)评审与精心设计评分标准的MLR-Judge自动化评估框架,以及能够通过想法生成、提案制定、实验和论文撰写四个阶段完成研究任务的MLR-Agent模块化代理架构。该框架支持分阶段评估与端到端论文评估,为AI代理在科学研究中的能力提供了系统性的衡量标准。

链接: https://arxiv.org/abs/2505.19955
作者: Hui Chen,Miao Xiong,Yujie Lu,Wei Han,Ailin Deng,Yufei He,Jiaying Wu,Yibo Li,Yue Liu,Bryan Hooi
机构: National University of Singapore (新加坡国立大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 40 pages, 7 figures

点击查看摘要

Abstract:Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results–posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.
zh

[NLP-74] An Explainable Diagnostic Framework for Neurodegenerative Dementias via Reinforcement-Optimized LLM Reasoning

【速读】: 该论文旨在解决神经退行性痴呆的鉴别诊断问题,这一问题在临床上具有挑战性,主要由于症状表现的重叠以及结构神经影像学中模式的相似性。其解决方案的关键在于提出一个整合两个核心组件的框架:首先,构建一个模块化流程,将3D T1加权脑部MRI转换为文本放射学报告;其次,探索现代大型语言模型(Large Language Models, LLMs)在基于生成报告的额颞叶痴呆亚型、阿尔茨海默病和正常衰老之间的鉴别诊断中的潜力。通过强化学习激励LLMs进行诊断推理,该方法在无需监督推理轨迹或从大模型中蒸馏的情况下,生成基于神经影像学发现的结构化诊断依据,从而实现预测准确性与可解释性的结合。

链接: https://arxiv.org/abs/2505.19954
作者: Andrew Zamai,Nathanael Fijalkow,Boris Mansencal,Laurent Simon,Eloi Navet,Pierrick Coupe
机构: Univ. Bordeaux(波尔多大学); CNRS(法国国家科学研究中心); Bordeaux INP(波尔多工程师学院); LaBRI(波尔多计算机科学与应用实验室); UMR 5800(法国国家科学研究中心联合研究所5800); F-33400 Talence(法国塔莱斯33400); France(法国)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The differential diagnosis of neurodegenerative dementias is a challenging clinical task, mainly because of the overlap in symptom presentation and the similarity of patterns observed in structural neuroimaging. To improve diagnostic efficiency and accuracy, deep learning-based methods such as Convolutional Neural Networks and Vision Transformers have been proposed for the automatic classification of brain MRIs. However, despite their strong predictive performance, these models find limited clinical utility due to their opaque decision making. In this work, we propose a framework that integrates two core components to enhance diagnostic transparency. First, we introduce a modular pipeline for converting 3D T1-weighted brain MRIs into textual radiology reports. Second, we explore the potential of modern Large Language Models (LLMs) to assist clinicians in the differential diagnosis between Frontotemporal dementia subtypes, Alzheimer’s disease, and normal aging based on the generated reports. To bridge the gap between predictive accuracy and explainability, we employ reinforcement learning to incentivize diagnostic reasoning in LLMs. Without requiring supervised reasoning traces or distillation from larger models, our approach enables the emergence of structured diagnostic rationales grounded in neuroimaging findings. Unlike post-hoc explainability methods that retrospectively justify model decisions, our framework generates diagnostic rationales as part of the inference process-producing causally grounded explanations that inform and guide the model’s decision-making process. In doing so, our framework matches the diagnostic performance of existing deep learning methods while offering rationales that support its diagnostic conclusions.
zh

[NLP-75] Can Visual Encoder Learn to See Arrows? CVPR2025

【速读】: 该论文试图解决视觉语言模型(Vision Language Models, VLMs)在识别图像中边(edges)方面的不足问题,这一问题限制了其对领域特定知识的理解能力。解决方案的关键在于通过在无文本和位置偏差的图表数据集上进行训练,使图像编码器学习到显式的边缘特征,从而减少对文本和位置偏倚的依赖,提升模型对边的识别准确性。

链接: https://arxiv.org/abs/2505.19944
作者: Naoyuki Terashita,Yusuke Tozaki,Hideaki Omote,Congkha Nguyen,Ryosuke Nakamoto,Yuta Koreeda,Hiroaki Ozaki
机构: Hitachi, Ltd. (日立有限公司); Kyoto Sangyo University (京都经济大学); Gifu University (岐阜大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work has been accepted for poster presentation at the Second Workshop on Visual Concepts in CVPR 2025

点击查看摘要

Abstract:The diagram is a visual representation of a relationship illustrated with edges (lines or arrows), which is widely used in industrial and scientific communication. Although recognizing diagrams is essential for vision language models (VLMs) to comprehend domain-specific knowledge, recent studies reveal that many VLMs fail to identify edges in images. We hypothesize that these failures stem from an over-reliance on textual and positional biases, preventing VLMs from learning explicit edge features. Based on this idea, we empirically investigate whether the image encoder in VLMs can learn edge representation through training on a diagram dataset in which edges are biased neither by textual nor positional information. To this end, we conduct contrastive learning on an artificially generated diagram–caption dataset to train an image encoder and evaluate its diagram-related features on three tasks: probing, image retrieval, and captioning. Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task. These findings confirm that eliminating textual and positional biases fosters accurate edge recognition in VLMs, offering a promising path for advancing diagram understanding.
zh

[NLP-76] ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLM s

【速读】: 该论文试图解决在生成式 AI(Generative AI)中,如何评估音频与文本模态之间跨模态对齐质量的问题。现有方法虽能融合多模态信息,但缺乏统一的标准度量指标。解决方案的关键在于提出一种新的度量标准 ALAS(Automatic Latent Alignment Score),通过分析不同 Transformer 层中音频与文本表征的相关性,来评估跨模态对齐的质量,并验证其在不同任务中的有效性。

链接: https://arxiv.org/abs/2505.19937
作者: Pooneh Mousavi,Yingzhi Wang,Mirco Ravanelli,Cem Subakan
机构: Concordia University (康考迪亚大学); Mila-Quebec AI Institute (Mila-魁北克人工智能研究所); Centrale Supélec (中央理工学院); Laval University (拉瓦尔大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in Spoken Language Understanding (SLU). Recent SLU models process audio directly by adapting speech input into LLMs for better multimodal learning. A key consideration for these models is the cross-modal alignment between text and audio modalities, which is a telltale sign as to whether or not LLM is able to associate semantic meaning to audio segments. While various methods exist for fusing these modalities, there is no standard metric to evaluate alignment quality in LLMs. In this work, we propose a new metric, ALAS (Automatic Latent Alignment Score). Our study examines the correlation between audio and text representations across transformer layers, for two different tasks (Spoken Question Answering and Emotion Recognition). We showcase that our metric behaves as expected across different layers and different tasks.
zh

[NLP-77] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理需要人类直觉和无领域知识的谜题任务时表现不足的问题。其关键解决方案是提出Enigmata,这是一个专为提升LLMs谜题推理能力而设计的综合性工具集,包含36个跨七类任务,每个任务均配备可控制难度的生成器和基于规则的验证器,支持可扩展的多任务强化学习训练与精细分析,并实现了与强化学习结合的可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的无缝集成。

链接: https://arxiv.org/abs/2505.19914
作者: Jiangjie Chen,Qianyu He,Siyu Yuan,Aili Chen,Zhicheng Cai,Weinan Dai,Hongli Yu,Qiying Yu,Xuefeng Li,Jiaze Chen,Hao Zhou,Mingxuan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as OpenAI’s o1 and DeepSeek’s R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at this https URL.
zh

[NLP-78] APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization

【速读】: 该论文试图解决在有限计算资源下,如何高效地将大语言模型适配到特定任务的问题。解决方案的关键在于提出一种名为相邻可能探索(Adjacent Possible Exploration, APE)的方法,该方法通过在少量精心选择的数据批次(200个示例)上进行迭代微调,并仅保留性能提升的部分,从而以极低的计算成本实现模型性能的显著提升。

链接: https://arxiv.org/abs/2505.19912
作者: Javier Marín
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Adjacent Possible Exploration (APE), a simple yet effective method for adapting large language models to specific tasks using minimal computational resources. Unlike traditional fine-tuning that requires extensive compute, APE iteratively fine-tunes models on small, carefully selected data batches (200 examples), retaining only improvements. On news summarization, APE achieves 40 percent BLEU improvement using just a T4 GPU in 60 minutes, matching or exceeding more complex methods like LoRA while remaining conceptually simple. Our approach is particularly valuable for researchers and practitioners with limited computational resources. We provide open-source code and demonstrate APE’s effectiveness through both automatic metrics and human evaluation. While inspired by evolutionary theory’s “adjacent possible”, APE’s core insight has a very practical application: small, iterative data perturbations can efficiently guide LLMs toward task-specific performance without expensive retraining.
zh

[NLP-79] ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂科学工作流中难以可靠辅助科学家的问题,特别是针对自动化科学问题求解和研究流程中的常规任务。其解决方案的关键在于提出ScienceBoard,这是一个包含动态且视觉丰富的多领域科学工作流环境,集成了专业软件,允许代理通过不同接口自主交互以加速复杂研究任务;同时,还构建了一个由人类精心筛选的169个高质量、经过严格验证的真实世界任务的基准,用于评估和改进代理在科学发现中的表现。

链接: https://arxiv.org/abs/2505.19897
作者: Qiushi Sun,Zhoumianze Liu,Chang Ma,Zichen Ding,Fangzhi Xu,Zhangyue Yin,Haiteng Zhao,Zhenyu Wu,Kanzhi Cheng,Zhaoyang Liu,Jianing Wang,Qintong Li,Xiangru Tang,Tianbao Xie,Xiachong Feng,Xiang Li,Ben Kao,Wenhai Wang,Biqing Qi,Lingpeng Kong,Zhiyong Wu
机构: The University of Hong Kong(香港大学); Shanghai AI Laboratory(上海人工智能实验室); Fudan University(复旦大学); Peking University(北京大学); Nanjing University(南京大学); East China Normal University(华东师范大学); Yale University(耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers’ workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at this https URL.
zh

[NLP-80] Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program

【速读】: 该论文旨在将大型语言模型(Large Language Models, LLMs)作为自主代理应用于空间控制领域,以在非合作空间操作中实现卫星的自主决策。其解决方案的关键在于利用提示工程(prompt engineering)、少样本提示(few-shot prompting)和微调(fine-tuning)技术,构建一个基于LLM的高效代理,并在Kerbal Space Program Differential Games(KSPDG)挑战中取得了第二名的成绩。该研究首次将LLM代理引入空间研究领域,具有开创性意义。

链接: https://arxiv.org/abs/2505.19896
作者: Alejandro Carrasco,Victor Rodriguez-Fernandez,Richard Linares
机构: 未知
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computation and Language (cs.CL)
备注: Non revised version for paper going to be published in Journal of Advances in Space Research

点击查看摘要

Abstract:Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \hrefthis https URLGitHub, while the trained models and datasets are available on \hrefthis https URLHugging Face. Additionally, experiment tracking and detailed results can be reviewed on \hrefthis https URLWeights \ Biases
zh

[NLP-81] ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining

【速读】: 该论文旨在解决大规模语言模型预训练过程中计算资源浪费的问题,即许多token对学习贡献有限,导致训练效率低下。其解决方案的关键在于提出一种风险感知的算法——高效选择性语言建模(Efficient Selective Language Modeling, ESLM),通过在线进行token级别的批量选择,利用每个token的统计信息(如熵或损失)并应用风险价值阈值筛选,仅保留最具信息量的token,从而提升训练效率和分布鲁棒性。

链接: https://arxiv.org/abs/2505.19893
作者: Melis Ilayda Bal,Volkan Cevher,Michael Muehlebach
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); LIONS, EPFL (LIONS, EPFL); AGI Foundations, Amazon (AGI Foundations, 亚马逊)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency. We introduce Efficient Selective Language Modeling (ESLM), a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection. ESLM leverages per-token statistics (e.g., entropy or loss) and applies value-at-risk thresholding to retain only the most informative tokens per batch. This data-centric mechanism reshapes the training loss, prioritizing high-risk tokens and eliminating redundant gradient computation. We frame ESLM as a bilevel game: the model competes with a masking adversary that selects worst-case token subsets under a constrained thresholding rule. In the loss-based setting, ESLM recovers conditional value-at-risk loss minimization, providing a principled connection to distributionally robust optimization. We extend our approach to Ada-ESLM, which adaptively tunes the selection confidence during training. Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines. Our approach also scales across model sizes, pretraining corpora, and integrates naturally with knowledge distillation.
zh

[NLP-82] HS-STAR: Hierarchical Sampling for Self-Taught Reason ers via Difficulty Estimation and Budget Reallocation

【速读】: 该论文试图解决在自教推理(Self-Taught Reasoners, STaRs)训练过程中,由于对所有问题分配相同的采样预算而导致的资源浪费问题,即未能考虑到不同难度问题的学习效用差异。解决方案的关键在于提出一种分层采样框架HS-STaR,该框架通过奖励引导的难度估计策略进行轻量级预采样,以高效识别位于语言模型推理能力边界的问题,并在再采样阶段动态调整剩余预算,从而最大化高质量训练数据的生成。

链接: https://arxiv.org/abs/2505.19866
作者: Feng Xiong,Hongling Xu,Yifei Wang,Runxi Cheng,Yong Wang,Xiangxiang Chu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM’s reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.
zh

[NLP-83] REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂任务中表现出的“过度思考”问题,这一问题导致了较高的推理成本。现有方法虽然通过生成较短的推理响应来训练LRMs,但由于数据生成和过滤过程耗时,难以高效应用于在线场景;而在线强化学习主要依赖长度奖励来鼓励简短的响应,却容易削弱模型的反思能力并影响性能。论文提出的解决方案是REA-RL,其关键在于引入一个小型反思模型,实现在线训练中的高效扩展,并提供并行采样与序列修订机制;同时设计了反思奖励,以防止LRMs偏好简短但缺乏反思的响应。实验表明,该方法在提升推理效率的同时保持或增强了模型性能。

链接: https://arxiv.org/abs/2505.19862
作者: Hexuan Deng,Wenxiang Jiao,Xuebo Liu,Jun Rao,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in Progress

点击查看摘要

Abstract:Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at this https URL.
zh

[NLP-84] Beyond Specialization: Benchmarking LLM s for Transliteration of Indian Languages

【速读】: 该论文试图解决多语言自然语言处理中跨文字系统映射的音译问题(transliteration),特别是在语言多样性显著的语境下,如印度。其解决方案的关键在于评估大型语言模型(LLMs)在无需专门任务训练的情况下,是否能够超越现有的专业音译模型(如IndicXlit)的表现,并通过微调进一步提升特定语言的性能。研究结果表明,GPT家族模型在多数情况下优于其他LLMs和IndicXlit,且在噪声条件下展现出更强的鲁棒性,证明了基础模型在多种专用任务中的有效性。

链接: https://arxiv.org/abs/2505.19851
作者: Gulfarogh Azam,Mohd Sadique,Saif Ali,Mohammad Nadeem,Erik Cambria,Shahab Saquib Sohail,Mohammad Sultan Alam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transliteration, the process of mapping text from one script to another, plays a crucial role in multilingual natural language processing, especially within linguistically diverse contexts such as India. Despite significant advancements through specialized models like IndicXlit, recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training. The current work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages. Experiments utilized standard benchmarks, including Dakshina and Aksharantar datasets, with performance assessed via Top-1 Accuracy and Character Error Rate. Our findings reveal that while GPT family models generally outperform other LLMs and IndicXlit for most instances. Additionally, fine-tuning GPT-4o improves performance on specific languages notably. An extensive error analysis and robustness testing under noisy conditions further elucidate strengths of LLMs compared to specialized models, highlighting the efficacy of foundational models for a wide spectrum of specialized applications with minimal overhead.
zh

[NLP-85] Improving Multilingual Math Reasoning for African Languages

【速读】: 该论文旨在解决低资源语言(low-resource languages)在大型语言模型(Large Language Models, LLMs)适应性方面的挑战,特别是针对非洲语言的适配问题。其关键解决方案在于系统性地评估不同适配策略的有效性,包括数据类型(翻译数据与合成生成数据)、训练阶段(预训练与后训练)及其他模型适配配置,以确定在扩展现有LLMs至非洲语言时表现最优的组合方式。

链接: https://arxiv.org/abs/2505.19848
作者: Odunayo Ogundepo,Akintunde Oladipo,Kelechi Ogueji,Esther Adenuga,David Ifeoluwa Adelani,Jimmy Lin
机构: The African Research Collective; Masakhane NLP; University of Waterloo; Mila, McGill University; Canada CIFAR AI Chair
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computational resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.
zh

[NLP-86] FoodTaxo: Generating Food Taxonomies with Large Language Models ACL2025

【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)自动生成和补全食品技术行业的分类体系(taxonomy)的问题。其解决方案的关键在于通过迭代方式,借助最新的提示技术,从种子分类体系或已知概念集合中完成或生成分类体系,但实验结果表明,正确放置内部节点仍存在较大挑战。

链接: https://arxiv.org/abs/2505.19838
作者: Pascal Wullschleger,Majid Zarharan,Donnacha Daly,Marc Pouly,Jennifer Foster
机构: ADAPT Centre, School of Computing, Dublin City University (ADAPT中心,计算机学院,都柏林城市大学); Lucerne School of Computer Science and Information Technology (HSLU) (卢塞恩计算机科学与信息技术学院(HSLU))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in ACL 2025 Industry Track. Paper website: this https URL

点击查看摘要

Abstract:We investigate the utility of Large Language Models for automated taxonomy generation and completion specifically applied to taxonomies from the food technology industry. We explore the extent to which taxonomies can be completed from a seed taxonomy or generated without a seed from a set of known concepts, in an iterative fashion using recent prompting techniques. Experiments on five taxonomies using an open-source LLM (Llama-3), while promising, point to the difficulty of correctly placing inner nodes.
zh

[NLP-87] Deciphering Trajectory-Aided LLM Reasoning : An Optimization Perspective

【速读】: 该论文试图解决如何理解大型语言模型(Large Language Models, LLMs)的推理能力问题,其核心在于通过元学习(meta-learning)的视角来揭示LLM推理机制的本质。解决方案的关键在于将推理轨迹建模为对LLM参数的伪梯度下降更新,并将每个问题视为一个独立的任务,推理轨迹作为内循环优化过程用于适应模型参数。这种框架不仅形式化了推理任务的训练过程,还表明经过多样化问题训练的LLM能够发展出可泛化到未见问题的基本推理能力。

链接: https://arxiv.org/abs/2505.19815
作者: Junnan Liu,Hongwei Liu,Linchen Xiao,Shudong Liu,Taolin Zhang,Zihan Ma,Songyang Zhang,Kai Chen
机构: Shanghai AI Laboratory(上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM’s parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.
zh

[NLP-88] Exploring Consciousness in LLM s: A Systematic Survey of Theories Implementations and Frontier Risks

【速读】: 该论文试图解决关于大型语言模型(Large Language Models, LLMs)是否具备意识(consciousness)的问题,以及由此引发的理论与实践挑战。其解决方案的关键在于厘清混淆的术语(如LLM意识与LLM觉察),系统梳理并整合现有理论与实证研究,并识别潜在的前沿风险,从而为该新兴领域的发展提供方向与框架。

链接: https://arxiv.org/abs/2505.19806
作者: Sirui Chen,Shuqin Ma,Shu Yu,Hanwang Zhang,Shengjie Zhao,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at this https URL.
zh

[NLP-89] Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation

【速读】: 该论文试图解决金融监管法规在转换为可执行合规逻辑时存在的劳动密集型和理解困难问题,特别是在中文金融监管法规中,现有监管科技(RegTech)和大型语言模型(LLMs)的性能表现不佳。其关键解决方案是构建一个面向领域且面向代码的高质量数据集——Compliance-to-Code,该数据集是首个专注于金融监管合规的大规模中文数据集,包含1,159个标注条款,涵盖10个类别,并以主体、条件、约束和上下文信息四个逻辑元素进行模块化结构化,同时提供确定性的Python代码映射和详细的代码推理与解释,以支持自动化审计和合规代码生成。

链接: https://arxiv.org/abs/2505.19804
作者: Siyuan Li,Jian Chen,Rui Yao,Xuming Hu,Peilin Zhou,Weihua Qiu,Simin Zhang,Chucheng Dong,Zhiyao Li,Qipeng Xie,Zixuan Yuan
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Sun Yat-Sen University (中山大学); University of California, Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.
zh

[NLP-90] MOLE: Metadata Extraction and Validation in Scientific Papers Using LLM s

【速读】: 该论文试图解决科学文献中元数据(Metadata)自动提取的问题,特别是在非阿拉伯语语言的数据集相关论文中,以支持数据集的目录编制、保存以及研究发现和可重复性。解决方案的关键在于提出MOLE框架,该框架利用大型语言模型(Large Language Models, LLMs)实现跨多种输入格式的文档级元数据属性自动提取,并通过模式驱动的方法和稳健的验证机制确保输出的一致性。

链接: https://arxiv.org/abs/2505.19800
作者: Zaid Alyafeai,Maged S. Al-Shaibani,Bernard Ghanem
机构: KAUST; SDAIA-KFUPM Joint Research Center for AI, KFUPM
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets’ scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: this https URL and dataset: this https URL for the research community.
zh

[NLP-91] he Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants

【速读】: 该论文试图解决在大型专有语言模型主导的背景下,小型开源语言模型是否能够在广泛任务中保持竞争力的问题。其解决方案的关键在于提出了一种名为Avengers的框架,该框架通过四个轻量级操作实现:嵌入(将查询编码为向量)、聚类(根据语义相似性分组查询)、评分(评估每个模型在每个聚类中的性能)以及投票(通过重复采样和投票提升输出质量)。该方法利用开源小型模型的集体智能,在多个任务上实现了超越GPT-4.1的表现。

链接: https://arxiv.org/abs/2505.19797
作者: Yiqun Zhang,Hao Li,Chenxu Wang,Linyao Chen,Qiaosheng Zhang,Peng Ye,Shi Feng,Daling Wang,Zhen Wang,Xinrun Wang,Jia Xu,Lei Bai,Wanli Ouyang,Shuyue Hu
机构: Northeastern University (东北大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Northwestern Polytechnical University (西北工业大学); Beijing Institute of Technology (北京理工大学); The University of Tokyo (东京大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 6 tables, supplementary material (appendix) included separately

点击查看摘要

Abstract:As proprietary giants increasingly dominate the race for ever-larger language models, a pressing question arises for the open-source community: can smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers–a simple recipe that effectively leverages the collective intelligence of open-source, smaller language models. Our framework is built upon four lightweight operations: (i) embedding: encode queries using a text embedding model; (ii) clustering: group queries based on their semantic similarity; (iii) scoring: scores each model’s performance within each cluster; and (iv) voting: improve outputs via repeated sampling and voting. At inference time, each query is embedded and assigned to its nearest cluster. The top-performing model(s) within that cluster are selected to generate the response using the Self-Consistency or its multi-model variant. Remarkably, with 10 open-source models (~7B parameters each), the Avengers collectively outperforms GPT-4.1 on 10 out of 15 datasets (spanning mathematics, code, logic, knowledge, and affective tasks). In particular, it surpasses GPT-4.1 on mathematics tasks by 18.21% and on code tasks by 7.46%. Furthermore, the Avengers delivers superior out-of-distribution generalization, and remains robust across various embedding models, clustering algorithms, ensemble strategies, and values of its sole parameter–the number of clusters. We have open-sourced the code on GitHub: this https URL
zh

[NLP-92] Analyzing Political Bias in LLM s via Target-Oriented Sentiment Classification ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中编码的政治偏见对下游应用可能产生的负面影响问题。其解决方案的关键在于利用LLM在相同句子中对不同目标实体的情感预测存在变化这一现象,通过定义基于熵的不一致性度量来量化这种预测变异,并在多种语言和模型中进行验证,从而揭示政治倾向性偏见的存在及其特征。

链接: https://arxiv.org/abs/2505.19776
作者: Akram Elbouanani,Evan Dufraisse,Adrian Popescu
机构: Université Paris-Saclay, CEA, List(巴黎-萨克雷大学,法国原子能委员会,列表)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts.
zh

[NLP-93] What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLM s ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长上下文场景下的安全漏洞问题,具体表现为通过多示例越狱(Many-Shot Jailbreaking, MSJ)攻击模型的安全机制。解决方案的关键在于揭示上下文长度是影响攻击有效性的主要因素,且成功的攻击并不依赖于精心设计的有害内容,即使重复示例或随机文本也能绕过模型的安全措施,这表明LLMs在处理长上下文时存在根本性的能力局限。

链接: https://arxiv.org/abs/2505.19773
作者: Sangyeop Kim,Yohan Lee,Yongwoo Song,Kimin Lee
机构: Coxwave(科斯瓦夫); Seoul National University (首尔国立大学); Kyung Hee University (庆熙大学); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.
zh

[NLP-94] Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

【速读】: 该论文试图解决强化学习中基于人类反馈(RLHF)与直接偏好优化(DPO)在性能上的差距问题,特别是在存在表示差异(representation gap)的情况下。其解决方案的关键在于对性能差距进行细粒度的理论分析,将其分解为精确优化下的显式表示差距和有限样本下的隐式表示差距,并通过理论推导和实例构造揭示不同方法在不同模型误设条件下的表现差异,从而为实际应用中选择合适的方法提供依据。

链接: https://arxiv.org/abs/2505.19770
作者: Ruizhe Shi,Minhak Song,Runlong Zhou,Zihan Zhang,Maryam Fazel,Simon S. Du
机构: Tsinghua University (清华大学); KAIST (韩国科学技术院); University of Washington (华盛顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 30 pages, 5 figures

点击查看摘要

Abstract:We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model – highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
zh

[NLP-95] 2Agent A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search

【速读】: 该论文旨在解决现实世界中多模态虚假信息检测问题,此类虚假信息通常来源于混合的伪造来源,需要动态推理和适应性验证。现有方法主要依赖静态流水线和有限的工具使用,难以应对复杂性和多样性。解决方案的关键在于提出T2Agent,一个结合可扩展工具包与蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的虚假信息检测代理。该工具包包含模块化工具,并通过标准化模板实现无缝集成与扩展;同时引入基于贝叶斯优化的选择器以确定任务相关的工具子集,作为MCTS的行动空间,从而动态收集证据并进行多源验证。此外,T2Agent通过多源验证扩展传统MCTS,并引入双奖励机制以平衡探索与利用,提升检测效果。

链接: https://arxiv.org/abs/2505.19768
作者: Xing Cui,Yueying Zou,Zekun Li,Peipei Li,Xinyuan Xu,Xuannan Liu,Huaibo Huang,Ran He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world multimodal misinformation often arises from mixed forgery sources, requiring dynamic reasoning and adaptive verification. However, existing methods mainly rely on static pipelines and limited tool usage, limiting their ability to handle such complexity and diversity. To address this challenge, we propose T2Agent, a novel misinformation detection agent that incorporates an extensible toolkit with Monte Carlo Tree Search (MCTS). The toolkit consists of modular tools such as web search, forgery detection, and consistency analysis. Each tool is described using standardized templates, enabling seamless integration and future expansion. To avoid inefficiency from using all tools simultaneously, a Bayesian optimization-based selector is proposed to identify a task-relevant subset. This subset then serves as the action space for MCTS to dynamically collect evidence and perform multi-source verification. To better align MCTS with the multi-source nature of misinformation detection, T2Agent extends traditional MCTS with multi-source verification, which decomposes the task into coordinated subtasks targeting different forgery sources. A dual reward mechanism containing a reasoning trajectory score and a confidence score is further proposed to encourage a balance between exploration across mixed forgery sources and exploitation for more reliable evidence. We conduct ablation studies to confirm the effectiveness of the tree search mechanism and tool usage. Extensive experiments further show that T2Agent consistently outperforms existing baselines on challenging mixed-source multimodal misinformation benchmarks, demonstrating its strong potential as a training-free approach for enhancing detection accuracy. The code will be released.
zh

[NLP-96] SGM: A Framework for Building Specification-Guided Moderation Filters

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在部署过程中与特定需求对齐的问题,尤其是模型在面对对抗性输入(如越狱攻击)时的不稳定性。其解决方案的关键在于提出一种名为SGM(Specification-Guided Moderation)的框架,该框架基于用户自定义的规范进行训练,能够自动生成训练数据而无需依赖人工编写的示例,从而实现可扩展且细粒度的对齐控制。

链接: https://arxiv.org/abs/2505.19766
作者: Masoomali Fatehkia,Enes Altinisik,Husrev Taha Sencar
机构: Qatar Computing Research Institute, HBKU, Doha, Qatar
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with deployment-specific requirements is critical but inherently imperfect. Despite extensive training, models remain susceptible to misalignment and adversarial inputs such as jailbreaks. Content moderation filters are commonly used as external safeguards, though they typically focus narrowly on safety. We introduce SGM (Specification-Guided Moderation), a flexible framework for training moderation filters grounded in user-defined specifications that go beyond standard safety concerns. SGM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals. SGM-trained filters perform on par with state-of-the-art safety filters built on curated datasets, while supporting fine-grained and user-defined alignment control.
zh

[NLP-97] CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

【速读】: 该论文试图解决结构化代码注释生成中数据集构建所需的鲁棒质量度量问题,现有方法(SIDE、MIDQ、STASIS)在代码-注释分析方面存在局限。其解决方案的关键在于提出CIDRe,这是一种与语言无关的无参考质量标准,结合了四个协同作用的方面:相关性(代码-注释语义对齐)、信息性(功能覆盖)、完整性(所有结构部分的存在)和描述长度(细节充分性)。

链接: https://arxiv.org/abs/2505.19757
作者: Maria Dziuba,Valentin Malykh
机构: MTS AI; ITMO University; IITU University
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe’s superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.
zh

[NLP-98] Efficient Reasoning via Chain of Unconscious Thought

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中因冗长的思维过程而导致的token效率低下的问题。其解决方案的关键在于提出一种新的推理范式,称为“无意识思维链”(Chain of Unconscious Thought, CoUT),该范式受到无意识思维理论(Unconscious Thought Theory, UTT)的启发,通过引导模型模仿人类的无意识思维并内化推理过程,从而提升token效率。具体而言,首先通过隐藏层引导模型进行内部化推理,随后设计一系列token高效策略以减少不必要的token消耗,同时保持模型性能。

链接: https://arxiv.org/abs/2505.19756
作者: Ruihan Gong,Yue Liu,Wenjie Qu,Mingzhe Du,Yufei He,Yingwei Ma,Yulin Chen,Xiang Liu,Yi Wen,Xinfeng Li,Ruidong Wang,Xinzhong Zhu,Bryan Hooi,Jiaheng Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs by guiding them to mimic human unconscious thought and internalize reasoning processes. Concretely, we first prompt the model to internalize the reasoning by thinking in the hidden layer. Then, we design a bag of token-efficient strategies to further help models reduce unnecessary tokens yet preserve the performance. Our work reveals that models may possess beneficial unconscious thought, enabling improved efficiency without sacrificing performance. Extensive experiments demonstrate the effectiveness of CoUT. Remarkably, it surpasses CoT by reducing token usage by 47.62% while maintaining comparable accuracy, as shown in Figure 1. The code of CoUT is available at this link: this https URL
zh

[NLP-99] NeuSym-RAG : Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering ACL2025

【速读】: 该论文旨在解决学术论文数量增加带来的研究者高效获取关键信息的挑战,以及传统检索增强生成(RAG)方法在整合神经网络与符号检索方面的不足。其解决方案的关键在于提出NeuSym-RAG,一个结合神经与符号检索的混合框架,通过多视角分块和基于模式的解析,将半结构化的PDF内容组织到关系型数据库和向量存储中,使大语言模型(LLM)代理能够迭代获取上下文以生成答案。

链接: https://arxiv.org/abs/2505.19754
作者: Ruisheng Cao,Hanchong Zhang,Tiancheng Huang,Zhangyi Kang,Yuxin Zhang,Liangtai Sun,Hanqi Li,Yuxun Miao,Shuai Fan,Lu Chen,Kai Yu
机构: Shanghai Jiao Tong University (上海交通大学); AISpeech Co., Ltd. (AISpeech公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 11 figures, 12 tables, accepted to ACL 2025 Long Main

点击查看摘要

Abstract:The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at this https URL.
zh

[NLP-100] Discrete Markov Bridge

【速读】: 该论文试图解决现有离散数据建模方法在训练过程中依赖固定转移矩阵所带来的局限性,这限制了潜在表示的表达能力以及整体设计空间。其解决方案的关键在于提出一种名为Discrete Markov Bridge的新框架,该框架基于两个核心组件:矩阵学习(Matrix Learning)和得分学习(Score Learning),通过理论分析和实证验证,证明了其性能保障与收敛性,并在Text8和CIFAR-10数据集上展示了优越的生成效果。

链接: https://arxiv.org/abs/2505.19752
作者: Hengli Li,Yuxuan Wang,Song-Chun Zhu,Ying Nian Wu,Zilong Zheng
机构: Peking University (北京大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室); Tsinghua University (清华大学); University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discrete diffusion has recently emerged as a promising paradigm in discrete data modeling. However, existing methods typically rely on a fixed rate transition matrix during training, which not only limits the expressiveness of latent representations, a fundamental strength of variational methods, but also constrains the overall design space. To address these limitations, we propose Discrete Markov Bridge, a novel framework specifically designed for discrete representation learning. Our approach is built upon two key components: Matrix Learning and Score Learning. We conduct a rigorous theoretical analysis, establishing formal performance guarantees for Matrix Learning and proving the convergence of the overall framework. Furthermore, we analyze the space complexity of our method, addressing practical constraints identified in prior studies. Extensive empirical evaluations validate the effectiveness of the proposed Discrete Markov Bridge, which achieves an Evidence Lower Bound (ELBO) of 1.38 on the Text8 dataset, outperforming established baselines. Moreover, the proposed model demonstrates competitive performance on the CIFAR-10 dataset, achieving results comparable to those obtained by image-specific generation approaches.
zh

[NLP-101] oken-level Accept or Reject: A Micro Alignment Approach for Large Language Models IJCAI2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与人类偏好和价值观对齐的问题,以确保其应用的伦理性和安全性。现有对齐技术如RLHF或DPO通常需要在参数量达数十亿的LLMs上进行直接微调,导致计算成本高且效率低下。论文提出的解决方案是Micro token-level Accept-Reject Aligning (MARA),其关键在于将句子级别的偏好学习分解为令牌级别的二分类任务,通过一个简化的三层全连接网络判断候选令牌是否应被“接受”或“拒绝”,从而实现高效的模型对齐。

链接: https://arxiv.org/abs/2505.19743
作者: Yang Zhang,Yu Yu,Bo Tang,Yu Zhu,Chuxiong Sun,Wenqiang Wei,Jie Hu,Zipeng Xie,Zhiyu Li,Feiyu Xiong,Edward Chung
机构: Hong Kong Polytechnic University, Hong Kong SAR, China; MemTensor (Shanghai) Technology Co., Ltd, Shanghai, China; University of Science and Technology of China, Suzhou Institute for Advanced Research, Suzhou, China; University of Science and Technology of China, Hefei, China; China Telecom Corporation Limited Beijing Research Institute, Beijing, China; Nanjing University of Information Science and Technology, Nanjing, China
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are “Accepted” or “Rejected” as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs.
zh

[NLP-102] Distilling Closed-Source LLM s Knowledge for Locally Stable and Economic Biomedical Entity Linking

【速读】: 该论文旨在解决生物医学实体链接(Biomedical Entity Linking)在低资源场景下的问题,即传统监督方法依赖大量标注数据,限制了其应用。解决方案的关键在于提出“RPDR”框架,该框架结合封闭源代码的大语言模型(LLMs)和开源LLMs,通过提示封闭源代码LLM从未标注数据生成训练数据,并微调开源LLM进行候选实体重排序,从而有效将知识蒸馏至可本地部署的开源LLM,避免了稳定性问题和高昂的经济成本。

链接: https://arxiv.org/abs/2505.19722
作者: Yihao Ai,Zhiyuan Ning,Weiwei Dai,Pengfei Wang,Yi Du,Wenjuan Cui,Kunpeng Liu,Yuanchun Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICIC 2025

点击查看摘要

Abstract:Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR’', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.
zh

[NLP-103] Graceful Forgetting in Generative Language Models

【速读】: 该论文试图解决预训练语言模型在微调过程中存在的负迁移(negative transfer)问题,即部分预训练阶段获得的知识可能对下游任务的微调产生不利影响。解决方案的关键在于提出一种名为“Learning With Forgetting (LWF)”的框架,通过Fisher Information Matrix对参数更新进行加权,计算遗忘置信度以评估自生成知识与遗忘任务的相关性,并在微调过程中周期性地剔除高置信度的无关知识,从而实现优雅遗忘(graceful forgetting)。

链接: https://arxiv.org/abs/2505.19715
作者: Chunyang Jiang,Chi-min Chan,Yiyang Cai,Yulong Liu,Wei Xue,Yike Guo
机构: HKUST (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
zh

[NLP-104] MT3: Scaling MLLM -based Text Image Machine Translation via Multi-Task Reinforcement Learning

【速读】: 该论文旨在解决文本图像机器翻译(Text Image Machine Translation, TIMT)中的复杂挑战,包括准确的光学字符识别(OCR)、鲁棒的视觉-文本推理以及高质量的翻译,传统方法通常依赖于多阶段级联流水线。其解决方案的关键在于引入MT³,这是首个将多任务强化学习(Multi-Task RL)应用于多模态大语言模型(MLLMs)以实现端到端TIMT的框架。MT³采用多任务优化范式,针对文本识别、上下文感知推理和翻译三个关键子技能进行训练,并通过新颖的多混合奖励机制,提供细粒度、非二值化的跨任务反馈,从而提升模型性能。

链接: https://arxiv.org/abs/2505.19714
作者: Zhaopeng Feng,Yupu Liang,Shaosheng Cao,Jiayuan Su,Jiahan Ren,Zhe Xu,Yao Hu,Wenxuan Huang,Jian Wu,Zuozhu Liu
机构: Zhejiang University (浙江大学); University of Chinese Academy of Sciences (中国科学院大学); Xiaohongshu Inc. (小红书公司); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT ^3 , the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT ^3 adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT’s intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT ^3 -7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.
zh

[NLP-105] Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多跳和推理密集型任务中容易产生幻觉的问题。其解决方案的关键在于引入PathFinder-PRM,这是一种分层、误差感知的判别式过程奖励模型(Process Reward Model, PRM),它通过在每个中间步骤分类数学错误和一致性错误,并结合这些细粒度信号来估计步骤的正确性,从而引导生成过程向连贯解题方向发展。

链接: https://arxiv.org/abs/2505.19706
作者: Tej Deep Pala,Panshul Sharma,Amir Zadeh,Chuan Li,Soujanya Poria
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.
zh

[NLP-106] Leverag ing Importance Sampling to Detach Alignment Modules from Large Language Models

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在实际应用中需要高质量、可定制化输出,但传统对齐方法通常需要重新训练预训练模型,导致难以快速适应和优化的问题。解决方案的关键在于提出一种新颖的残差对齐模型(Residual Alignment Model, RAM),将对齐过程形式化为重要性采样的类型,其中未对齐的上游模型作为提议分布,对齐过程则通过自回归对齐模块作为重要性权重估计器进行二次采样,从而实现对齐模块与目标对齐模型的自然解耦,提升灵活性和可扩展性。

链接: https://arxiv.org/abs/2505.19700
作者: Yi Liu,Dianqing Liu,Mingye Zhu,Junbo Guo,Yongdong Zhang,Zhendong Mao
机构: State Key Laboratory of Communication Content Cognition, People’s Daily Online, Beijing, China (国家通信内容认知重点实验室,人民日报网络版,中国北京); University of Science and Technology of China, Hefei, China (中国科学技术大学,中国合肥)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textitResidual Alignment Model (\textitRAM) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.
zh

[NLP-107] Large Language Models for Planning : A Comprehensive and Systematic Survey

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在规划任务中的广泛应用潜力尚未得到系统性研究的问题,旨在探索其在智能体规划能力中的适用性和优化路径。解决方案的关键在于提出一种全面的分类框架,将当前基于LLM的规划方法划分为三种主要类型:外部模块增强方法、微调方法和搜索方法,分别通过整合额外组件、利用轨迹数据与反馈信号进行模型调整以及分解任务并优化解码策略来提升规划性能。

链接: https://arxiv.org/abs/2505.19683
作者: Pengfei Cao,Tianyi Men,Wencan Liu,Jingwen Zhang,Xuzhao Li,Xixun Lin,Dianbo Sui,Yanan Cao,Kang Liu,Jun Zhao
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, CASIA (中国科学院认知与决策智能复杂系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Harbin Institute of Technology (哈尔滨工业大学); Institute of Information Engineering, Chinese Academy of Sciences (中国信息工程研究所); School of Automation, Beijing Institute of Technology (北京理工大学自动化学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Planning represents a fundamental capability of intelligent agents, requiring comprehensive environmental understanding, rigorous logical reasoning, and effective sequential decision-making. While Large Language Models (LLMs) have demonstrated remarkable performance on certain planning tasks, their broader application in this domain warrants systematic investigation. This paper presents a comprehensive review of LLM-based planning. Specifically, this survey is structured as follows: First, we establish the theoretical foundations by introducing essential definitions and categories about automated planning. Next, we provide a detailed taxonomy and analysis of contemporary LLM-based planning methodologies, categorizing them into three principal approaches: 1) External Module Augmented Methods that combine LLMs with additional components for planning, 2) Finetuning-based Methods that involve using trajectory data and feedback signals to adjust LLMs in order to improve their planning abilities, and 3) Searching-based Methods that break down complex tasks into simpler components, navigate the planning space, or enhance decoding strategies to find the best solutions. Subsequently, we systematically summarize existing evaluation frameworks, including benchmark datasets, evaluation metrics and performance comparisons between representative planning methods. Finally, we discuss the underlying mechanisms enabling LLM-based planning and outline promising research directions for this rapidly evolving field. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this field.
zh

[NLP-108] KITs Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

【速读】: 该论文旨在解决低资源语言对下的语音翻译(Speech Translation, ST)任务,特别是在缺乏平行语料的情况下提升系统性能。其关键解决方案是通过预训练模型进行微调,并结合合成数据生成与模型正则化技术来高效利用有限资源。具体而言,研究探索了基于机器翻译(MT)增强的语音翻译方法,以及通过文本到语音模型生成合成语音数据以提升自动语音识别(ASR)和ST性能。此外,采用内部蒸馏(intra-distillation)技术进一步优化模型表现,并通过最小贝叶斯风险解码融合级联系统与端到端系统,从而在多个任务和模型上实现性能提升。

链接: https://arxiv.org/abs/2505.19679
作者: Zhaolin Li,Yining Liu,Danni Liu,Tuan Nam Nguyen,Enes Yavuz Ugan,Tu Anh Dinh,Carlos Mullov,Alexander Waibel,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents KIT’s submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
zh

[NLP-109] Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

【速读】: 该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即生成的响应在语义上看似合理,但与输入图像的相关性极低。解决方案的关键在于提出一种新的条件点互信息(Conditional Pointwise Mutual Information, C-PMI)校准解码策略,通过自适应增强生成文本与输入图像之间的互依赖性来缓解幻觉。该方法不仅考虑文本标记的采样,还联合建模视觉和文本标记对C-PMI的贡献,将幻觉缓解建模为一个双层优化问题,以最大化互信息。

链接: https://arxiv.org/abs/2505.19678
作者: Hao Fang,Changle Zhou,Jiawei Kong,Kuofeng Gao,Bin Chen,Tao Liang,Guojun Ma,Shu-Tao Xia
机构: Tsinghua Shenzhen Internation Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Pengcheng Labortary (鹏城实验室); ByteDance (字节跳动)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs’ over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
zh

[NLP-110] Calibrating Pre-trained Language Classifiers on LLM -generated Noisy Labels via Iterative Refinement

【速读】: 该论文试图解决由生成式 AI (Generative AI) 生成的噪声标签对自然语言处理 (NLP) 模型泛化能力造成的负面影响问题。解决方案的关键在于提出 SiDyP:一种基于简单形标签扩散与动态先验的方法,通过在文本嵌入空间中检索潜在的真实标签候选,并利用简单形扩散模型迭代优化噪声候选,从而校准分类器的预测,提升其对 LLM 生成噪声标签的鲁棒性。

链接: https://arxiv.org/abs/2505.19675
作者: Liqin Ye,Agam Shah,Chao Zhang,Sudheer Chava
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The traditional process of creating labeled datasets is labor-intensive and expensive. Recent breakthroughs in open-source large language models (LLMs) have opened up a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks, providing an alternative to such an expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from noisy labels, the model’s generalization is likely to be harmed as it is prone to overfit to those label noises. While previous studies in learning from noisy labels mainly focus on synthetic noise and real-world noise, LLM-generated label noise receives less attention. In this paper, we propose SiDyP: Simplex Label Diffusion with Dynamic Prior to calibrate the classifier’s prediction, thus enhancing its robustness towards LLM-generated noisy labels. SiDyP retrieves potential true label candidates by neighborhood label distribution in text embedding space and iteratively refines noisy candidates using a simplex diffusion model. Our framework can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot LLM-generated noisy label datasets by an average of 7.21% and 7.30% respectively. We demonstrate the effectiveness of SiDyP by conducting extensive benchmarking for different LLMs over a variety of NLP tasks. Our code is available on Github.
zh

[NLP-111] Comparing Moral Values in Western English-speaking societies and LLM s with Word Associations ACL2025

【速读】: 该论文试图解决如何准确评估大型语言模型(Large Language Models, LLMs)所反映的道德价值观的问题,因为直接通过提示词进行评估存在人类规范泄露和提示词敏感性等挑战。解决方案的关键在于利用词联想(word associations),这些联想已被证明能反映人类的道德推理,并作为低层次基础表示来获得更稳健的LLMs道德推理图景。通过构建LLM生成的词联想数据集,并基于道德基础理论的种子词传播道德价值观,最终比较不同来源的道德概念化,揭示英语使用者与LLM联想之间的系统性差异。

链接: https://arxiv.org/abs/2505.19674
作者: Chaoyi Xiang,Chunhua Liu,Simon De Deyne,Lea Frermann
机构: The University of Melbourne(墨尔本大学)
类目: Computation and Language (cs.CL)
备注: 9 pages,7 figures. Accepted to the ACL 2025 conference

点击查看摘要

Abstract:As the impact of large language models increases, understanding the moral values they reflect becomes ever more important. Assessing the nature of moral values as understood by these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs’ moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of LLM-generated word associations, resembling an existing data set of human word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral conceptualizations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.
zh

[NLP-112] Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models

【速读】: 该论文试图解决大型音频语言模型(Large Audio Language Models, LALMs)在面对有害查询时的安全对齐不足问题,以及现有基于监督微调(Supervised Fine-tuning, SFT)的防御措施在提升安全性的同时容易导致过度拒绝(over-rejection)的问题。解决方案的关键在于提出一种无监督的安全微调策略,通过重塑模型的表示空间,在增强LALMs安全对齐的同时平衡过度拒绝的风险。

链接: https://arxiv.org/abs/2505.19670
作者: Hao Yang,Lizhen Qu,Ehsan Shareghi,Gholamreza Haffari
机构: Monash University (莫纳什大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model’s representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.
zh

[NLP-113] LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

【速读】: 该论文旨在解决法律咨询领域中专业人员短缺导致的高成本和可及性不足问题,同时应对现有大型语言模型(Large Language Models, LLMs)在处理真实世界法律咨询的交互性和知识密集性方面的不足。其解决方案的关键在于构建一个名为LeCoDe的真实多轮法律咨询基准数据集,该数据集包含3,696个法律咨询对话,共计110,008轮对话,并通过法律专家的严格标注提升数据的专业性。此外,论文还提出了一套全面的评估框架,从澄清能力和专业建议质量两个维度对LLMs的法律咨询能力进行评估,从而为提升模型在专业咨询场景中的表现提供依据。

链接: https://arxiv.org/abs/2505.19667
作者: Weikang Yuan,Kaisong Song,Zhuoren Jiang,Junjie Cao,Yujie Zhang,Jun Lin,Kun Kuang,Ji Zhang,Xiaozhong Liu
机构: Zhejiang University (浙江大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs’ legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs’ consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs’ legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.
zh

[NLP-114] GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models

【速读】: 该论文旨在解决开放域问答(OpenQA)中两个关键问题:如何有效将知识整合到大型语言模型(LLMs)中,以及如何在不同任务场景下自适应生成特定格式的答案。解决方案的关键在于提出一种名为GenKI的框架,该框架通过同时探索知识整合(Knowledge Integration)和可控生成(controllable Generation)来提升OpenQA性能。具体而言,GenKI首先训练一个密集段落检索模型以从知识库中获取相关知识,随后引入一种新的知识整合模型,在微调过程中将检索到的知识融入指令中以增强模型能力,并利用经过微调的LLM和基于文本一致性的集成方法实现可控生成。

链接: https://arxiv.org/abs/2505.19660
作者: Tingjia Shen,Hao Wang,Chuan Qin,Ruijun Sun,Yang Song,Defu Lian,Hengshu Zhu,Enhong Chen
机构: University of Science and Technology of China(中国科学技术大学); Chinese Academy of Sciences(中国科学院); BOSS Zhipin Career Science Lab(BOSS直聘职业科学实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model’s ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at this https URL
zh

[NLP-115] Select Read and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation ACL2025

【速读】: 该论文试图解决自动相关工作生成(Automatic Related Work Generation, RWG)中存在的浅层理解问题以及参考文献之间关系捕捉不足的问题。现有方法由于仅使用参考文献的部分内容作为输入,并且对每个参考文献进行孤立解释,导致无法有效捕捉文献间的关联性。解决方案的关键在于提出一种基于全文的RWG任务框架,该框架包含三个代理:选择器(用于决定下一步阅读的论文部分)、阅读器(用于消化所选部分并更新共享工作记忆)和生成器(基于最终整理的记忆生成相关工作部分)。此外,为更好地捕捉参考文献之间的关系,还提出了两种图感知策略,以优化阅读顺序并考虑图结构约束。

链接: https://arxiv.org/abs/2505.19647
作者: Xiaochuan Liu,Ruihua Song,Xiting Wang,Xu Chen
机构: Renmin University of China (中国人民大学); Gaoling School of Artificial Intelligence (高瓴人工智能学院); Engineering Research Center of Next-Generation Intelligent Search and Recommendation (下一代智能搜索与推荐工程研究中心); Ministry of Education (教育部); Beijing Key Laboratory of Research on Large Models and Intelligent Governance (北京大型模型与智能治理研究实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 (Findings)

点击查看摘要

Abstract:Automatic related work generation (RWG) can save people’s time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at this https URL.
zh

[NLP-116] SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond

【速读】: 该论文旨在解决如何有效提升大型语言模型(Large Language Models, LLMs)的通用推理能力,特别是在开放源代码环境中,针对逻辑推理数据的稀缺性和可验证性不足的问题。其解决方案的关键在于提出SynLogic,一个数据合成框架和数据集,能够大规模生成多样化的逻辑推理数据,涵盖35种不同的逻辑推理任务,并支持难度和数量的可控调整。所有示例均可通过简单规则验证,使其非常适合用于具有可验证奖励的强化学习(Reinforcement Learning, RL)训练,从而显著提升了模型在逻辑推理任务上的表现。

链接: https://arxiv.org/abs/2505.19641
作者: Junteng Liu,Yuanxiang Fan,Zhuo Jiang,Han Ding,Yongyi Hu,Chi Zhang,Yiqi Shi,Shitong Weng,Aili Chen,Shiqi Chen,Yunan Huang,Mozhi Zhang,Pengyu Zhao,Junjie Yan,Junxian He
机构: The Hong Kong University of Science and Technology (香港科技大学); MiniMax; The City University of Hong Kong (香港城市大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at this https URL.
zh

[NLP-117] Interleaved Reasoning for Large Language Models via Reinforcement Learning

【速读】: 该论文试图解决长链式思维(Long chain-of-thought, CoT)在大型语言模型(Large Language Models, LLM)中导致的推理效率低下和首次标记时间(Time-to-First-Token, TTFT)增加的问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的训练范式,通过引导模型在多跳问题中交错进行思考与回答,从而提升推理效率和准确性。该方法利用基于规则的奖励机制激励正确的中间步骤,借助交错推理过程中生成的中间信号指导策略模型走向正确的推理路径。

链接: https://arxiv.org/abs/2505.19640
作者: Roy Xie,David Qiu,Deepak Gopinath,Dong Lin,Yanchao Sun,Chong Wang,Saloni Potdar,Bhuwan Dhingra
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long chain-of-thought (CoT) significantly enhances large language models’ (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.
zh

[NLP-118] Faster and Better LLM s via Latency-Aware Test-Time Scaling

【速读】: 该论文旨在解决在延迟敏感场景下,现有Test-Time Scaling (TTS)方法在计算最优性与延迟之间的矛盾问题,即计算最优的TTS并不总是能实现最低延迟。解决方案的关键在于提出两种优化并发配置的方法:(1)分支级并行性,通过多个并发推理分支提升效率;(2)序列级并行性,通过推测解码实现。通过整合这两种方法并合理分配计算资源,实现了延迟最优的TTS,从而在保证准确率的同时显著降低推理时间。

链接: https://arxiv.org/abs/2505.19634
作者: Zili Wang,Tianyu Zhang,Haoli Bai,Lu Hou,Xianzhi Yu,Wulong Liu,Shiming Xiang,Lei Zhu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a latency-aware evaluation of representative TTS methods, we demonstrate that a compute-optimal TTS does not always result in the lowest latency in scenarios where latency is critical. To address this gap and achieve latency-optimal TTS, we propose two key approaches by optimizing the concurrency configurations: (1) branch-wise parallelism, which leverages multiple concurrent inference branches, and (2) sequence-wise parallelism, enabled by speculative decoding. By integrating these two approaches and allocating computational resources properly to each, our latency-optimal TTS enables a 32B model to reach 82.3% accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4% within 10 seconds. Our work emphasizes the importance of latency-aware TTS and demonstrates its ability to deliver both speed and accuracy in latency-sensitive scenarios.
zh

[NLP-119] Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

【速读】: 该论文旨在解决无监督词分割(word segmentation)问题,同时评估大型语言模型(Large Language Models, LLMs)的语义理解能力。其解决方案的关键在于利用LLMs的强大语言理解和模式识别能力,结合改进的无监督方法LLACA(Large Language Model-Inspired Aho-Corasick Automaton),通过动态n-gram模型与上下文信息的自适应调整,提升词分割效果,并在多个语言上验证了LLMs的“理解”能力。

链接: https://arxiv.org/abs/2505.19631
作者: Zihong Zhang,Liqi He,Zuchao Li,Lefei Zhang,Hai Zhao,Bo Du
机构: Wuhan University (武汉大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of “comprehend first, segment later”, we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs’ “comprehension”. Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ( \textbfL arge \textbfL anguage Model-Inspired \textbfA ho- \textbfC orasick \textbfA utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic n -gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at this https URL
zh

[NLP-120] DoctorAg ent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在真实临床咨询中应用时的核心挑战,即现有系统依赖单向信息传递模式,导致当患者症状描述模糊时无法提供具体诊断建议,以及基于监督学习的传统多轮对话方法受限于静态数据驱动范式,缺乏泛化能力且难以智能提取关键临床信息。解决方案的关键在于提出DoctorAgent-RL,一个基于强化学习(Reinforcement Learning, RL)的多智能体协作框架,将医疗咨询建模为不确定性下的动态决策过程,通过医生智能体与患者智能体的多轮交互,在RL框架内持续优化提问策略,并根据咨询评估器的综合奖励动态调整信息收集路径,从而实现与临床推理逻辑对齐的自主交互策略。

链接: https://arxiv.org/abs/2505.19630
作者: Yichun Feng,Jiawei Wang,Lu Zhou,Yixue Li
机构: Guangzhou National Laboratory(广州国家实验室); School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学交叉科学学院); Department of EEIS, University of Science and Technology of China(中国科学技术大学电子与信息工程系); Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences(中国科学院上海营养与健康研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Existing systems rely on a one-way information transmission mode where patients must fully describe their symptoms in a single round, leading to nonspecific diagnostic recommendations when complaints are vague. Traditional multi-turn dialogue methods based on supervised learning are constrained by static data-driven paradigms, lacking generalizability and struggling to intelligently extract key clinical information. To address these limitations, we propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that DoctorAgent-RL outperforms existing models in both multi-turn reasoning capability and final diagnostic performance, demonstrating practical value in assisting clinical consultations. this https URL
zh

[NLP-121] HomeBench: Evaluating LLM s in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在智能家庭助手应用中面对复杂现实场景时的局限性,特别是处理无效指令和多设备同时控制的问题。解决方案的关键在于引入HomeBench,这是首个包含单设备和多设备有效及无效指令的智能家庭数据集,旨在推动LLM在复杂家庭环境中的性能提升。

链接: https://arxiv.org/abs/2505.19628
作者: Silin Li,Yuhang Guo,Jiashu Yao,Zeming Liu,Haifeng Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at this https URL.
zh

[NLP-122] hink Again! The Effect of Test-Time Compute on Preferences Opinions and Beliefs of Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否表现出主观偏好、观点和信念的问题,以及这些倾向可能源于模型中的偏见,进而影响其行为和对用户建议的中立性。解决方案的关键是提出一个名为Preference, Opinion, and Belief survey (POBs)的基准测试,用于评估LLMs在社会、文化、伦理和个人领域中的主观倾向,并通过该基准测试评估主流开源和闭源LLMs的可靠性、中立性和一致性。此外,研究还探讨了增加推理和自我反思机制在测试阶段计算资源对这些指标的影响,但结果显示这些机制在该领域效果有限。

链接: https://arxiv.org/abs/2505.19621
作者: George Kour,Itay Nakash,Ateret Anaby-Tavor,Michal Shmueli-Scheuer
机构: IBM Research AI (IBM 研究院人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it’s crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs’ subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: this https URL
zh

[NLP-123] Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

【速读】: 该论文试图解决跨语言对齐是否适用于语音基础模型的问题,特别是验证文本中的跨语言对齐方法是否能在语音模型中同样有效。其解决方案的关键在于通过发音控制实验和词级数据集的对照实验,证明了语音模型在语义层面具备跨语言对齐能力,并且编码器中同时包含了语音和语义知识。此外,研究还利用早期退出的编码器输出进行定性分析,揭示了语音翻译中语义错误与源语言词汇的语音相似性之间的关联,并将这一发现应用于七种低资源语言的语音识别任务,显著提升了识别准确率。

链接: https://arxiv.org/abs/2505.19606
作者: Ryan Soh-Eun Shim,Domenico De Cristofaro,Chengzhi Martin Hu,Alessandro Vietti,Barbara Plank
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.
zh

[NLP-124] Evaluating Machine Translation Models for English-Hindi Language Pairs: A Comparative Analysis

【速读】: 该论文旨在解决跨语言翻译中,尤其是在英语与印地语这一语种差异较大的语言对之间,机器翻译模型的有效性评估问题。其解决方案的关键在于利用一个包含18000+条平行语料的英语-印地语语料库以及一个来自政府网站的定制问答(FAQ)数据集,通过多种自动评估指标(包括词汇性和基于机器学习的指标)全面分析不同机器翻译模型在通用和专业语言领域中的表现。

链接: https://arxiv.org/abs/2505.19604
作者: Ahan Prasannakumar Shetty
机构: National Institute of Technology Karnataka(卡纳塔克邦国立技术学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine translation has become a critical tool in bridging linguistic gaps, especially between languages as diverse as English and Hindi. This paper comprehensively evaluates various machine translation models for translating between English and Hindi. We assess the performance of these models using a diverse set of automatic evaluation metrics, both lexical and machine learning-based metrics. Our evaluation leverages an 18000+ corpus of English Hindi parallel dataset and a custom FAQ dataset comprising questions from government websites. The study aims to provide insights into the effectiveness of different machine translation approaches in handling both general and specialized language domains. Results indicate varying performance levels across different metrics, highlighting strengths and areas for improvement in current translation systems.
zh

[NLP-125] Preference Optimization by Estimating the Ratio of the Data Distribution

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与人类偏好对齐的问题,其核心挑战在于如何在不依赖奖励模型或分区函数的情况下,有效且稳定地优化策略模型以匹配目标策略。解决方案的关键在于提出一种广义的直接偏好优化(Direct Preference Optimization, DPO)损失,从似然比估计的角度实现策略分布的唯一识别,从而保留方法的简洁性与理论保障。论文进一步提出了Bregman偏好优化(Bregman Preference Optimization, BPO)框架,通过比率匹配实现目标策略最优,并提供了可计算的损失函数形式,使其实现更加高效。此外,论文还引入了缩放巴苏幂发散(Scaled Basu’s Power Divergence, SBA)作为梯度缩放方法,增强了BPO实例的性能。实验表明,BPO在生成保真度与多样性之间实现了协同提升,优于其他概率损失扩展方法。

链接: https://arxiv.org/abs/2505.19601
作者: Yeongmin Kim,Heesun Bae,Byeonghu Na,Il-Chul Moon
机构: Korea Advanced Institute of Science and Technology (KAIST); 2summary.ai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as f -PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu’s power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as f -DPO or f -PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-Instruct-8B, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9% length-controlled win rate on AlpacaEval2.
zh

[NLP-126] Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

【速读】: 该论文试图解决语言模型在处理特定语法点时的细微能力评估问题,尤其是针对非英语语言中的罕见语法规则,如日语中的“第一人称心理谓语限制”(first person psych predicate restriction)。现有评估方法无法有效捕捉这些细微能力,因此论文通过测量语言模型在面对该语法点时的困惑度(perplexity)来进行评估。解决方案的关键在于识别并改进模型的分词(tokenization)质量,实验表明,通过限制测试句子为具有良好分词表现的句子,可以显著降低模型在语法正确句子上的困惑度,从而更准确地反映模型的语法理解能力。

链接: https://arxiv.org/abs/2505.19599
作者: Andrew Gambardella,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: University of Tokyo(东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

点击查看摘要

Abstract:Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the “first person psych predicate restriction” grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab’s uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3’s perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
zh

[NLP-127] Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study

【速读】: 该论文试图解决大型音频-语言模型(Large Audio-Language Models, LALMs)在面对恶意音频注入攻击时的鲁棒性不足问题,旨在评估其在不同攻击场景下的脆弱性和抗攻击能力。解决方案的关键在于构建一个系统化的基准框架,通过定量分析模型在多种攻击类型中的表现,揭示模型鲁棒性的差异,并提出将鲁棒性整合到训练流程中的必要性,同时强调开发多模态防御机制和解耦能力与易感性的架构设计的重要性。

链接: https://arxiv.org/abs/2505.19598
作者: Guanyu Hou,Jiaming He,Yinhang Zhou,Ji Guo,Yitong Qiao,Rui Zhang,Wenbo Jiang
机构: University of Electronic Science and Technology of China (中国电子科技大学); Chengdu University of Technology (成都理工大学); Sun Yat-Sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.
zh

[NLP-128] Multi-Agent Collaboration via Evolving Orchestration

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂问题求解中因模块化结构导致的可扩展性和效率受限的问题。现有方法多依赖静态组织结构,难以适应任务复杂度和代理数量的增长,从而产生协调开销和低效性。论文提出的解决方案是采用一种“提线木偶”风格的范式,其中中央协调器(“提线者”)动态引导代理(“木偶”)以响应任务状态的变化。该协调器通过强化学习进行训练,以自适应地排序和优先级划分代理,实现灵活且可演化的集体推理。关键在于通过协调器的演化,促使更紧凑、循环的推理结构出现,从而提升性能并降低计算成本。

链接: https://arxiv.org/abs/2505.19591
作者: Yufan Dang,Chen Qian,Xueheng Luo,Jingru Fan,Zihao Xie,Ruijie Shi,Weize Chen,Cheng Yang,Xiaoyin Che,Ye Tian,Xuantang Xiong,Lei Han,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Beijing University of Posts and Telecommunications (北京邮电大学); Siemens (西门子); Tencent Robotics X (腾讯机器人实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Work in Progress

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator (“puppeteer”) dynamically directs agents (“puppets”) in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator’s evolution.
zh

[NLP-129] Learning to Reason without External Rewards

【速读】: 该论文试图解决在复杂推理任务中训练大规模语言模型(Large Language Models, LLMs)时依赖于昂贵且领域特定的监督信号的问题。其解决方案的关键在于提出Intuitor,这是一种基于内部反馈的强化学习(Reinforcement Learning from Internal Feedback, RLIF)方法,该方法仅使用模型自身的置信度(称为自证性,self-certainty)作为唯一的奖励信号,从而实现了完全无监督的学习。

链接: https://arxiv.org/abs/2505.19590
作者: Xuandong Zhao,Zhewei Kang,Aosong Feng,Sergey Levine,Dawn Song
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model’s own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO’s performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at this https URL
zh

[NLP-130] ailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

【速读】: 该论文旨在解决生成式大语言模型(LLM)中键值(Key-Value, KV)缓存带来的显著内存开销问题。现有方法通过将KV缓存卸载或压缩来减轻这一负担,但完整加载缓存会因CPU-GPU通信中的PCIe带宽瓶颈导致显著延迟,而过度压缩则会导致性能明显下降。论文的关键解决方案是提出一种混合压缩方法TailorKV,该方法通过结合量化与卸载技术,利用不同层对全局信息和主导激活token的不同需求,实现主导token的加载与所有token的量化之间的互补,从而在极端压缩条件下实现几乎无损失的性能表现。

链接: https://arxiv.org/abs/2505.19586
作者: Dingyu Yao,Bowen Shen,Zheng Lin,Wei Liu,Jian Luan,Bin Wang,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络空间安全学院); Xiaomi AI Lab, Beijing, China(小米人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.
zh

[NLP-131] Accelerating Prefilling for Long-Context LLM s via Sparse Pattern Sharing

【速读】: 该论文旨在解决传统稀疏注意力(sparse attention)方法在长上下文推理的预填充阶段因依赖预定义模式或不准确估计而无法充分捕捉注意力动态行为,从而导致效率和准确性下降的问题。其解决方案的关键在于提出一种高度精确的稀疏注意力机制,该机制通过在不同注意力头之间共享相似且精确的注意力模式,实现了对注意力动态行为更真实的建模,同时仅需对少量注意力头进行全注意力计算,从而在保持高精度的同时显著提升计算效率。

链接: https://arxiv.org/abs/2505.19578
作者: Dan Peng,Zhihui Fu,Zewen Ye,Zhuoran Song,Jun Wang
机构: OPPO Research Institute (OPPO 研究院); Zhejiang University (浙江大学); Shanghai Jiaotong University (上海交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely on predefined patterns or inaccurate estimations to approximate attention behavior, they often fail to fully capture the true dynamics of attention, resulting in reduced efficiency and compromised accuracy. Instead, we propose a highly accurate sparse attention mechanism that shares similar yet precise attention patterns across heads, enabling a more realistic capture of the dynamic behavior of attention. Our approach is grounded in two key observations: (1) attention patterns demonstrate strong inter-head similarity, and (2) this similarity remains remarkably consistent across diverse inputs. By strategically sharing computed accurate patterns across attention heads, our method effectively captures actual patterns while requiring full attention computation for only a small subset of heads. Comprehensive evaluations demonstrate that our approach achieves superior or comparable speedup relative to state-of-the-art methods while delivering the best overall accuracy.
zh

[NLP-132] DocMEdit: Towards Document-Level Model Editing ACL2025

【速读】: 该论文试图解决现有模型编辑数据集在实际应用中的局限性,即大多数数据集仅要求模型输出短语或句子,而忽视了现实世界中广泛存在的文档级任务,从而影响了模型编辑方法的实用性和有效性。解决方案的关键在于提出一个面向文档级模型编辑的基准数据集\benchmarkname,该数据集具有文档级输入和输出、可推广性以及单次编辑包含多个事实的特点,旨在提升模型在真实场景下的编辑能力。

链接: https://arxiv.org/abs/2505.19572
作者: Li Zeng,Zeming Liu,Chong Feng,Heyan Huang,Yuhang Guo
机构: Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 findings

点击查看摘要

Abstract:Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document-level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce \benchmarkname, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.
zh

[NLP-133] Automated Text-to-Table for Reasoning -Intensive Table QA: Pipeline Design and Benchmarking Insights

【速读】: 该论文旨在解决表格式问答(Table Question Answering, QA)任务中因依赖昂贵的手动标注真实数据和表格结构异质性导致的评估方法不完善问题。其解决方案的关键在于提出一种自动化生成管道AutoT2T,该管道能够将数学应用题转化为基于表格的推理任务,从而无需人工标注,并生成多种表格变体以支持鲁棒性评估。通过此方法构建的新基准TabularGSM系统地覆盖了不同复杂度的表格和陷阱问题,揭示了推理与检索或识别过程之间的紧密耦合是大型语言模型在复杂表式QA任务中表现不佳的关键因素。

链接: https://arxiv.org/abs/2505.19563
作者: Shi-Yu Tian,Zhi Zhou,Wei Dong,Ming Yang,Kun-Yang Yu,Zi-Jian Cheng,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Paper under review, code and dataset are all available

点击查看摘要

Abstract:Reasoning with tabular data holds increasing importance in modern applications, yet comprehensive evaluation methodologies for reasoning-intensive Table Question Answering (QA) tasks remain nascent. Existing research is constrained by two primary bottlenecks: 1) Reliance on costly manually annotated real-world data, which is difficult to cover complex reasoning scenarios; 2) The heterogeneity of table structures hinders systematic analysis of the intrinsic mechanisms behind the underperformance of LLMs, especially in reasoning-intensive tasks. To address these issues, we propose an automated generation pipeline AutoT2T that transforms mathematical word problems into table-based reasoning tasks, eliminating the need for manual annotation. The pipeline can generate multiple variants of a table for the same reasoning problem, including noisy versions to support robustness evaluation. Based on this, we construct a new benchmark TabularGSM, which systematically spans a range of table complexities and trap problems. Experimental analyses through AutoT2T and TabularGSM reveal that the tight coupling between reasoning and retrieval or identification processes is a key factor underlying the failure of LLMs in complex Table QA tasks. This highlights the necessity for models to develop synergistic reasoning capabilities in order to perform effectively in complex Table QA tasks.
zh

[NLP-134] owards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期对话中因上下文窗口限制而难以维持连贯的对话记忆和提供个性化回复的问题。现有方法依赖于单粒度的记忆分割与检索,难以捕捉深层次的记忆关联,导致有用信息部分丢失或引入大量噪声。其解决方案的关键在于提出MemGAS框架,通过构建多粒度关联、自适应选择与检索机制,利用高斯混合模型对新旧记忆进行聚类与关联,并通过基于熵的路由策略动态选择最优粒度,结合LLM进行记忆精炼,从而提升长期记忆任务的性能。

链接: https://arxiv.org/abs/2505.19549
作者: Derong Xu,Yi Wen,Pengyue Jia,Yingyi Zhang,wenlin zhang,Yichao Wang,Huifeng Guo,Ruiming Tang,Xiangyu Zhao,Enhong Chen,Tong Xu
机构: University of Science and Technology of China; City University of Hong Kong; Huawei Noah’s Ark Lab
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings.
zh

[NLP-135] How Syntax Specialization Emerges in Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中如何发展出对句法结构的内部专长这一问题,特别是其形成机制及影响因素。解决方案的关键在于通过追踪句法敏感性的演变过程,量化不同句法现象中最小对立对的内部句法一致性,从而揭示句法敏感性逐渐出现、集中在特定层并经历“关键期”的发展轨迹。这一过程在不同架构和初始化参数下具有一致性,并受到模型规模和训练数据的影响。

链接: https://arxiv.org/abs/2505.19548
作者: Xufeng Duan,Zhaoqian Yao,Yunhao Zhang,Shaonan Wang,Zhenguang G. Cai
机构: The Chinese University of Hong Kong (香港中文大学); Chinese Academy of Sciences (中国科学院); Brain and Mind Institute, The Chinese University of Hong Kong (脑与心智研究所,香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been found to develop surprising internal specializations: Individual neurons, attention heads, and circuits become selectively sensitive to syntactic structure, reflecting patterns observed in the human brain. While this specialization is well-documented, how it emerges during training and what influences its development remains largely unknown. In this work, we tap into the black box of specialization by tracking its formation over time. By quantifying internal syntactic consistency across minimal pairs from various syntactic phenomena, we identify a clear developmental trajectory: Syntactic sensitivity emerges gradually, concentrates in specific layers, and exhibits a ‘critical period’ of rapid internal specialization. This process is consistent across architectures and initialization parameters (e.g., random seeds), and is influenced by model scale and training data. We therefore reveal not only where syntax arises in LLMs but also how some models internalize it during training. To support future research, we will release the code, models, and training checkpoints upon acceptance. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19548 [cs.CL] (or arXiv:2505.19548v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.19548 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-136] DoctorRAG : Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients

【速读】: 该论文试图解决现有医疗检索增强生成(RAG)系统主要依赖医疗知识库,而忽视了从相似患者病例中衍生出的临床经验知识这一关键问题,这限制了系统在模拟人类临床推理方面的能力。解决方案的关键在于提出DoctorRAG框架,该框架通过整合显式临床知识与隐式案例经验,模拟医生的推理过程。其核心创新包括:为查询和知识源分配概念标签,并结合来自相关知识和患者的混合检索机制以提升检索精度;以及集成Med-TextGrad模块,利用多智能体文本梯度确保最终输出符合检索到的知识和患者查询。

链接: https://arxiv.org/abs/2505.19538
作者: Yuxing Lu,Gecheng Fu,Wei Wu,Xukai Zhao,Sin Yee Goi,Jinzhuo Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 32 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases – a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor-like reasoning by integrating both explicit clinical knowledge and implicit case-based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med-TextGrad module using multi-agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor-like medical reasoning systems.
zh

[NLP-137] FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态理解中因冗余视觉标记导致的高计算成本问题。现有剪枝方法通常依赖单层注意力得分来评估和剪除冗余视觉标记,但这种方法未能充分考虑标记与层之间的复杂交互。论文的关键解决方案是提出FlowCut,一个基于信息流感知的剪枝框架,通过分析信息在不同层间的流动来识别冗余标记,从而更准确地反映模型内部行为,提升剪枝效率与效果。

链接: https://arxiv.org/abs/2505.19536
作者: Jintao Tong,Wenwei Jin,Pengda Qin,Anqi Li,Yixiong Zou,Yuhong Li,Yuhua Li,Ruixuan Li
机构: Huazhong University of Science and Technology (华中科技大学); Xiaohongshu Inc. (小红书公司); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model’s inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at this https URL
zh

[NLP-138] Small Language Models: Architectures Techniques Evaluation Problems and Future Adaptation

【速读】: 该论文旨在解决如何构建紧凑、高效且高性能的小语言模型(Small Language Models, SLMs)的问题。其解决方案的关键在于对SLMs的设计框架、训练方法以及降低模型规模和复杂性的技术进行全面评估,并提出一种新的分类体系来组织优化方法,包括剪枝(pruning)、量化(quantization)和模型压缩(model compression)等策略。此外,论文还构建了一个严谨的评估平台,以衡量SLMs的能力,并探讨了该领域尚未解决的重要挑战,如效率与性能之间的权衡问题。

链接: https://arxiv.org/abs/2505.19529
作者: Tanjil Hasan Sakib,Md. Tanzib Hosain,Md. Kishor Morol
机构: Cornell University (康奈尔大学); American International University-Bangladesh (美国国际大学-孟加拉国)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Small Language Models (SLMs) have gained substantial attention due to their ability to execute diverse language tasks successfully while using fewer computer resources. These models are particularly ideal for deployment in limited environments, such as mobile devices, on-device processing, and edge systems. In this study, we present a complete assessment of SLMs, focussing on their design frameworks, training approaches, and techniques for lowering model size and complexity. We offer a novel classification system to organize the optimization approaches applied for SLMs, encompassing strategies like pruning, quantization, and model compression. Furthermore, we assemble SLM’s studies of evaluation suite with some existing datasets, establishing a rigorous platform for measuring SLM capabilities. Alongside this, we discuss the important difficulties that remain unresolved in this sector, including trade-offs between efficiency and performance, and we suggest directions for future study. We anticipate this study to serve as a beneficial guide for researchers and practitioners who aim to construct compact, efficient, and high-performing language models.
zh

[NLP-139] AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection

【速读】: 该论文旨在解决隐性仇恨言论(implicit hate speech)检测的问题,该问题因其微妙性和对上下文解释的依赖而具有挑战性。现有方法主要依赖对比学习,尽管在区分仇恨与非仇恨句子方面表现出一定效果,但未能模拟人类通过识别文本中的具体目标并分析其与上下文关系来检测隐性仇恨的推理过程。论文提出的解决方案AmpleHate的关键在于模仿人类的推理机制,通过预训练的命名实体识别模型识别显性目标,并利用[CLS]标记捕捉隐性目标信息,进而计算显性、隐性目标与句子上下文之间的注意力关系,并将这些关系向量直接注入最终的句子表示中,从而增强目标-上下文关系的关键信号以提升隐性仇恨检测的效果。

链接: https://arxiv.org/abs/2505.19528
作者: Yejin Lee,Joonghyuk Hahn,Hyeseon Ahn,Yo-Sub Han
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 4 figures, Under Review

点击查看摘要

Abstract:Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.
zh

[NLP-140] Bias in Political Dialogue: Tagging U.S. Presidential Debates with an Extended DAMSL Framework

【速读】: 该论文试图解决政治对话中偏见驱动和对抗性话语特征的系统性分析问题,特别是针对2024年美国总统大选辩论中唐纳德·特朗普(Donald Trump)的修辞策略进行批判性话语分析。其解决方案的关键在于提出一种新的标注框架BEADS(Bias Enriched Annotation for Dialogue Structure),该框架在DAMSL框架基础上进行了系统扩展,能够捕捉政治传播中的意识形态框架、情感诉求和对抗性策略,从而实现对话语结构的多维度建模与分析。

链接: https://arxiv.org/abs/2505.19515
作者: Lavanya Prahallad,Radhika Mamidi
机构: International Institute of Information Technology, Hyderabad, India
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:We present a critical discourse analysis of the 2024 U.S. presidential debates, examining Donald Trump’s rhetorical strategies in his interactions with Joe Biden and Kamala Harris. We introduce a novel annotation framework, BEADS (Bias Enriched Annotation for Dialogue Structure), which systematically extends the DAMSL framework to capture bias driven and adversarial discourse features in political communication. BEADS includes a domain and language agnostic set of tags that model ideological framing, emotional appeals, and confrontational tactics. Our methodology compares detailed human annotation with zero shot ChatGPT assisted tagging on verified transcripts from the Trump and Biden (19,219 words) and Trump and Harris (18,123 words) debates. Our analysis shows that Trump consistently dominated in key categories: Challenge and Adversarial Exchanges, Selective Emphasis, Appeal to Fear, Political Bias, and Perceived Dismissiveness. These findings underscore his use of emotionally charged and adversarial rhetoric to control the narrative and influence audience perception. In this work, we establish BEADS as a scalable and reproducible framework for critical discourse analysis across languages, domains, and political contexts.
zh

[NLP-141] SIPDO: Closed-Loop Prompt Optimization via Synthetic Data Feedback

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中提示质量对模型性能的影响问题,特别是现有方法在固定数据集上优化提示,假设输入分布静态且难以支持迭代改进的局限性。解决方案的关键在于提出SIPDO(Self-Improving Prompts through Data-Augmented Optimization),这是一个闭环框架,将合成数据生成整合到提示优化过程中,通过合成数据生成器与提示优化器的协同工作,持续揭示提示的弱点并逐步优化提示,从而实现无需外部监督或新任务的系统性性能提升。

链接: https://arxiv.org/abs/2505.19514
作者: Yaoning Yu,Ye Yu,Kai Wei,Haojing Luo,Haohan Wang
机构: University of Illinois at Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校); University of South Florida (南佛罗里达大学); iDreamer.ai (iDreamer.ai)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt quality plays a critical role in the performance of large language models (LLMs), motivating a growing body of work on prompt optimization. Most existing methods optimize prompts over a fixed dataset, assuming static input distributions and offering limited support for iterative improvement. We introduce SIPDO (Self-Improving Prompts through Data-Augmented Optimization), a closed-loop framework for prompt learning that integrates synthetic data generation into the optimization process. SIPDO couples a synthetic data generator with a prompt optimizer, where the generator produces new examples that reveal current prompt weaknesses and the optimizer incrementally refines the prompt in response. This feedback-driven loop enables systematic improvement of prompt performance without assuming access to external supervision or new tasks. Experiments across question answering and reasoning benchmarks show that SIPDO outperforms standard prompt tuning methods, highlighting the value of integrating data synthesis into prompt learning workflows.
zh

[NLP-142] Causal Distillation: Transferring Structured Explanations from Large to Compact Language Models

【速读】: 该论文试图解决大体量专有语言模型在因果推理能力上的优势难以被小型开源模型复现的问题。其解决方案的关键在于提出一种新的知识蒸馏框架,通过让小型模型生成与教师模型一致的结构化因果解释,从而迁移因果推理技能。该框架的核心思想是训练小型模型生成符合教师模型因果逻辑的因果解释,以提升其因果推理能力。

链接: https://arxiv.org/abs/2505.19511
作者: Aggrey Muhebwa,Khalid K. Osman
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large proprietary language models exhibit strong causal reasoning abilities that smaller open-source models struggle to replicate. We introduce a novel framework for distilling causal explanations that transfers causal reasoning skills from a powerful teacher model to a compact open-source model. The key idea is to train the smaller model to develop causal reasoning abilities by generating structured cause-and-effect explanations consistent with those of the teacher model. To evaluate the quality of the student-generated explanations, we introduce a new metric called Causal Explanation Coherence (CEC) to assess the structural and logical consistency of causal reasoning. This metric uses sentence-level semantic alignment to measure how well each part of the generated explanation corresponds to the teacher’s reference, capturing both faithfulness and coverage of the underlying causal chain. Our framework and the CEC metric provide a principled foundation for training smaller models to perform robust causal reasoning and for systematically assessing the coherence of explanations in language model outputs.
zh

[NLP-143] LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多模态环境中对场景图(scene graph)的理解与生成能力不足的问题。其解决方案的关键在于引入Text-Scene Graph (TSG) Bench基准,用于系统评估LLMs在理解场景图以及从文本叙述中生成场景图方面的性能。通过该基准,研究揭示了当前模型在复杂叙述的场景图生成任务中存在显著瓶颈,主要表现为无法有效分解离散场景,从而为未来改进场景图生成方法提供了重要方向。

链接: https://arxiv.org/abs/2505.19510
作者: Dongil Yang,Minjin Kim,Sunghwan Kim,Beong-woo Kwak,Minjun Park,Jinseok Hong,Woontack Woo,Jinyoung Yeo
机构: Yonsei University (延世大学); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs’ ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs’ ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at this https URL. Additionally, our code and evaluation data are publicly available at this https URL.
zh

[NLP-144] DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在API访问环境下易被知识蒸馏(Knowledge Distillation, KD)技术模仿的问题。现有保护方法如水印技术仅能事后识别模仿行为,而其他防御机制假设学生模型模仿教师模型的内部logits,无法有效应对仅通过观察输出文本进行的蒸馏攻击。论文提出的解决方案关键在于一种高效的防御性输出生成(Defensive Output Generation, DOGe)策略,该策略通过对抗性损失微调教师模型的最终线性层,使输出在保持对合法用户有用的同时,对蒸馏过程产生误导,从而显著降低模仿效果。

链接: https://arxiv.org/abs/2505.19504
作者: Pingzhi Li,Zhen Tan,Huaizhi Qu,Huan Liu,Tianlong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher’s internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method’s effectiveness as a practical safeguard against KD-based model imitation.
zh

[NLP-145] Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

【速读】: 该论文旨在解决跨语言检索问题,特别是针对使用英文查询检索梵文文献(如《博伽梵歌》章节)的挑战。其解决方案的关键在于采用三阶段方法:直接检索(DR)、基于翻译的检索(DT)和查询翻译(QT),通过共享嵌入空间和先进的翻译方法,在检索增强生成(RAG)框架中提升系统的性能。研究还对最先进的模型进行了微调,以适应梵语的语言特征,并利用摘要技术改进问答处理,最终证明基于翻译的方法在处理古代文本的跨语言挑战方面优于其他方法。

链接: https://arxiv.org/abs/2505.19494
作者: Manoj Balaji Jagadeeshan,Prince Raj,Pawan Goyal
机构: Indian Institute of Technology, Kharagpur (印度理工学院,克哈格普尔分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit’s linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at this https URL
zh

[NLP-146] CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的文化偏见问题,这种偏见忽视了低资源地区的价值观和语言多样性,可能损害普遍平等并加剧刻板印象和歧视。解决方案的关键在于提出CulFiT,一种基于多语言数据和细粒度奖励建模的文化敏感性训练范式,通过合成多样化的文化相关问题、构建文化相关语言的批判数据,并利用细粒度奖励将文化文本分解为可验证的知识单元,以实现可解释的评估。

链接: https://arxiv.org/abs/2505.19484
作者: Ruixiang Feng,Shen Gao,Xiuying Chen,Lisi Chen,Shuo Shang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural biases, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality, but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalCultureQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalCultureQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.
zh

[NLP-147] Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection

【速读】: 该论文旨在解决预训练语言模型在未标注、分布外数据上的适应性问题,尤其是在结构新颖的推理任务中表现不佳的问题。其解决方案的关键在于引入一种名为VDS-TTT(Verifier-Driven Sample Selection for Test-Time Training)的框架,通过学习一个验证器(verifier)对生成的回答进行评分,并仅选择高排名的伪标签样本进行微调适应。该方法通过仅微调低秩LoRA适配器参数,实现了高效的模型适应与快速收敛,从而提升了模型在实时场景下的性能。

链接: https://arxiv.org/abs/2505.19475
作者: Mohammad Mahdi Moradi,Hossam Amer,Sudhir Mudur,Weiwei Zhang,Yang Liu,Walid Ahmed
机构: Concordia University (康奈迪亚大学); Huawei Technologies (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Learning to adapt pretrained language models to unlabeled, out-of-distribution data is a critical challenge, as models often falter on structurally novel reasoning tasks even while excelling within their training distribution. We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training to efficiently address this. We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation. Specifically, for each input query our LLM generates N candidate answers; the verifier assigns a reliability score to each, and the response with the highest confidence and above a fixed threshold is paired with its query for test-time training. We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence. Our proposed self-supervised framework is the first to synthesize verifier driven test-time training data for continuous self-improvement of the model. Experiments across three diverse benchmarks and three state-of-the-art LLMs demonstrate that VDS-TTT yields up to a 32.29% relative improvement over the base model and a 6.66% gain compared to verifier-based methods without test-time training, highlighting its effectiveness and efficiency for on-the-fly large language model adaptation.
zh

[NLP-148] Balancing Computation Load and Representation Expressivity in Parallel Hybrid Neural Networks

【速读】: 该论文旨在解决混合架构中序列和并行结构所带来的性能瓶颈问题,特别是在计算负载不平衡和知识表示表达性不足方面。其关键解决方案是提出一种新型的并行混合网络架构FlowHN,该架构通过FLOP感知的动态令牌分割策略实现注意力机制与状态空间模型(SSM)分支之间的计算负载均衡,并采用一种融合方法处理来自不同分支的高度差异化的输出以增强表示的表达能力。这些创新使得FlowHN在保持高精度的同时显著提升了令牌处理速度和模型FLOPs利用率。

链接: https://arxiv.org/abs/2505.19472
作者: Mohammad Mahdi Moradi,Walid Ahmed,Shuangyue Wen,Sudhir Mudur,Weiwei Zhang,Yang Liu
机构: Concordia University (康考迪亚大学); Huawei Technologies (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention and State-Space Models (SSMs) when combined in a hybrid network in sequence or in parallel provide complementary strengths. In a hybrid sequential pipeline they alternate between applying a transformer to the input and then feeding its output into a SSM. This results in idle periods in the individual components increasing end-to-end latency and lowering throughput caps. In the parallel hybrid architecture, the transformer operates independently in parallel with the SSM, and these pairs are cascaded, with output from one pair forming the input to the next. Two issues are (i) creating an expressive knowledge representation with the inherently divergent outputs from these separate branches, and (ii) load balancing the computation between these parallel branches, while maintaining representation fidelity. In this work we present FlowHN, a novel parallel hybrid network architecture that accommodates various strategies for load balancing, achieved through appropriate distribution of input tokens between the two branches. Two innovative differentiating factors in FlowHN include a FLOP aware dynamic token split between the attention and SSM branches yielding efficient balance in compute load, and secondly, a method to fuse the highly divergent outputs from individual branches for enhancing representation expressivity. Together they enable much better token processing speeds, avoid bottlenecks, and at the same time yield significantly improved accuracy as compared to other competing works. We conduct comprehensive experiments on autoregressive language modeling for models with 135M, 350M, and 1B parameters. FlowHN outperforms sequential hybrid models and its parallel counterpart, achieving up to 4* higher Tokens per Second (TPS) and 2* better Model FLOPs Utilization (MFU).
zh

[NLP-149] BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在逻辑密集、精度要求高的领域(如金融、法律和医疗)中可靠性评估的难题。其关键解决方案是引入BizFinBench,这是一个专门针对实际金融应用场景设计的基准测试集,包含6,781个中文标注查询,涵盖数值计算、推理、信息抽取、预测识别和基于知识的问题回答五个维度,并采用IteraJudge这一新型LLM评估方法,以减少LLM作为评估者时的偏差。

链接: https://arxiv.org/abs/2505.19457
作者: Guilong Lu,Xuntao Guo,Rongjunchen Zhang,Wenqiao Zhu,Ji Liu
机构: HiThink Research(慧思科技); Harbin Institute of Technology(哈尔滨工业大学)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at this https URL.
zh

[NLP-150] Vibe Coding vs. Agent ic Coding: Fundamentals and Practical Implications of Agent ic AI

【速读】: 该论文试图解决AI辅助软件开发中两种新兴范式——vibe coding与agentic coding之间的差异及其应用场景问题,旨在明确它们在自主性、架构设计及开发者角色方面的根本区别。解决方案的关键在于提出一个涵盖概念基础、执行模型、反馈循环、安全机制、调试策略和实际工具生态的详细分类体系,并通过对比工作流分析与20个具体用例,揭示vibe系统在早期原型设计与教育中的优势,以及agentic系统在企业级自动化、代码库重构和持续集成/持续交付(CI/CD)整合中的优越性。此外,论文还探讨了混合架构的发展趋势,强调通过自然语言接口与自主执行流水线的结合,推动未来可信、可解释和协作式的agentic AI系统建设。

链接: https://arxiv.org/abs/2505.19443
作者: Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee
机构: Cornell University (康奈尔大学); University of the Peloponnese (希腊伯罗奔尼撒大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 Pages, 8 Figures, 6 Tables

点击查看摘要

Abstract:This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.
zh

[NLP-151] he Birth of Knowledge: Emergent Features across Time Space and Scale in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中可解释的类别特征如何在训练过程、模型结构和规模维度上出现的问题。其解决方案的关键在于利用稀疏自编码器(sparse autoencoders)进行机制可解释性分析,从而识别神经激活中特定语义概念出现的时间、位置及规模特性,揭示了特征涌现的时空规律以及早期层特征在后期层中的重新激活现象,对传统关于Transformer模型表征动态的假设提出了挑战。

链接: https://arxiv.org/abs/2505.19440
作者: Shashata Sawmya,Micah Adler,Nir Shavit
机构: Massachusetts Institute of Technology (麻省理工学院); Red Hat, Inc. (红帽公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.
zh

[NLP-152] Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

【速读】: 该论文试图解决在数学问题求解任务中,由于获取真实答案(ground truth)困难、成本高或不可行,导致大型语言模型(Large Language Models, LLMs)训练受限的问题。其解决方案的关键在于利用格式和长度作为替代信号(surrogate signals)来训练LLMs,而非依赖传统的真值数据。研究显示,基于格式正确性的奖励函数在早期阶段即可实现与标准GRPO算法相当的性能,而在后期引入长度奖励后,所提出的GRPO方法在某些场景下不仅达到甚至超越了依赖真值答案的标准GRPO算法性能,证明了无标签训练的有效性。

链接: https://arxiv.org/abs/2505.19439
作者: Rihui Xin,Han Liu,Zecheng Wang,Yupeng Zhang,Dianbo Sui,Xiaolin Hu,Bingning Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth this http URL study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.
zh

[NLP-153] ask Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多步骤交互中表现不佳的问题,具体表现为幻觉、重复动作或误解用户修正,这主要是由于其依赖线性、非结构化的上下文。解决方案的关键在于引入任务记忆引擎(Task Memory Engine, TME),它通过构建基于图的结构替代传统的扁平化上下文,实现持续的多轮推理。TME采用动态任务图(树或有向无环图)来映射用户输入与子任务,并跟踪任务依赖关系,从而提升对用户意图的理解和修正能力。

链接: https://arxiv.org/abs/2505.19436
作者: Ye Ye
机构: New York University (纽约大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review. 9 pages main content, 15 pages appendix, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) falter in multi-step interactions – often hallucinating, repeating actions, or misinterpreting user corrections – due to reliance on linear, unstructured context. This fragility stems from the lack of persistent memory to track evolving goals and task dependencies, undermining trust in autonomous agents. We introduce the Task Memory Engine (TME), a modular memory controller that transforms existing LLMs into robust, revision-aware agents without fine-tuning. TME implements a spatial memory framework that replaces flat context with graph-based structures to support consistent, multi-turn reasoning. Departing from linear concatenation and ReAct-style prompting, TME builds a dynamic task graph – either a tree or directed acyclic graph (DAG) – to map user inputs to subtasks, align them with prior context, and enable dependency-tracked revisions. Its Task Representation and Intent Management (TRIM) component models task semantics and user intent to ensure accurate interpretation. Across four multi-turn scenarios-trip planning, cooking, meeting scheduling, and shopping cart editing – TME eliminates 100% of hallucinations and misinterpretations in three tasks, and reduces hallucinations by 66.7% and misinterpretations by 83.3% across 27 user turns, outperforming ReAct. TME’s modular design supports plug-and-play deployment and domain-specific customization, adaptable to both personal assistants and enterprise automation. We release TME’s codebase, benchmarks, and components as open-source resources, enabling researchers to develop reliable LLM agents. TME’s scalable architecture addresses a critical gap in agent performance across complex, interactive settings.
zh

[NLP-154] Route to Reason : Adaptive Routing for LLM and Reasoning Strategy Selection

【速读】: 该论文试图解决语言模型(Language Model, LM)在复杂推理任务中因测试时扩展(test-time scaling)导致的高计算成本和“过度思考”问题,即模型陷入“思维陷阱”。解决方案的关键在于提出了一种名为Route-To-Reason (RTR) 的统一路由框架,该框架在预算约束下根据任务难度动态分配语言模型和推理策略,并通过学习专家模型和推理策略的压缩表示,在推理阶段实现它们的联合自适应选择,从而实现了低成本、高灵活性且可扩展的端到端推理系统。

链接: https://arxiv.org/abs/2505.19435
作者: Zhihong Pan,Kai Zhang,Yuze Zhao,Yupeng Han
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The inherent capabilities of a language model (LM) and the reasoning strategies it employs jointly determine its performance in reasoning tasks. While test-time scaling is regarded as an effective approach to tackling complex reasoning tasks, it incurs substantial computational costs and often leads to “overthinking”, where models become trapped in “thought pitfalls”. To address this challenge, we propose Route-To-Reason (RTR), a novel unified routing framework that dynamically allocates both LMs and reasoning strategies according to task difficulty under budget constraints. RTR learns compressed representations of both expert models and reasoning strategies, enabling their joint and adaptive selection at inference time. This method is low-cost, highly flexible, and can be seamlessly extended to arbitrary black-box or white-box models and strategies, achieving true plug-and-play functionality. Extensive experiments across seven open source models and four reasoning strategies demonstrate that RTR achieves an optimal trade-off between accuracy and computational efficiency among all baselines, achieving higher accuracy than the best single model while reducing token usage by over 60%.
zh

[NLP-155] Deriving Strategic Market Insights with Large Language Models : A Benchmark for Forward Counterfactual Generation

【速读】: 该论文试图解决在动态金融市场中,如何通过前向反事实推理(forward counterfactual reasoning)来预测潜在的市场发展,从而为利益相关者提供风险与机会的洞察问题。传统方法在大规模应用时面临认知负担过重的挑战,因此亟需自动化解决方案。论文提出的解决方案关键在于引入了一个新的基准测试集Fin-Force-FINancial FORward Counterfactual Evaluation,通过整理金融新闻标题并提供结构化评估,支持基于大型语言模型(Large Language Models, LLMs)的前向反事实生成,从而实现对市场发展的可扩展和自动化探索。

链接: https://arxiv.org/abs/2505.19430
作者: Keane Ong,Rui Mao,Deeksha Varshney,Paul Pu Liang,Erik Cambria,Gianmarco Mengaldo
机构: National University of Singapore (新加坡国立大学); Massachusetts Institute of Technology (麻省理工学院); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force-FINancial FORward Counterfactual Evaluation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.
zh

[NLP-156] Rhapsody: A Dataset for Highlight Detection in Podcasts

【速读】: 该论文试图解决长时长口语媒体中自动识别关键片段(highlight)的问题,这一任务对于帮助用户快速获取播客内容的核心信息具有重要意义。解决方案的关键在于构建了一个包含13,000个播客节目及其基于YouTube“最常重播”功能生成的段落级高亮评分的数据集Rhapsody,并将其建模为一个段落级二分类任务。研究还表明,通过领域内数据微调的语言模型能够显著优于零样本提示方法,且结合语音信号特征与转录文本的多模态方法在性能上更具优势。

链接: https://arxiv.org/abs/2505.19429
作者: Younghan Park,Anuj Diwan,David Harwath,Eunsol Choi
机构: Yonsei University (延世大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube’s ‘most replayed’ feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight finetuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models finetuned with in-domain data significantly outperform their zero-shot performance. The finetuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media.
zh

[NLP-157] Frictional Agent Alignment Framework: Slow Down and Dont Break Things ACL2025

【速读】: 该论文试图解决在协作交互中,由于对话者信念潜在不一致而导致的对齐问题。传统偏好对齐方法如DPO在静态场景中表现良好,但在动态协作任务中因对话者信念的显式信号稀疏且偏差较大而效果不佳。解决方案的关键在于提出摩擦代理对齐框架(Frictional Agent Alignment Framework, FAAF),通过生成精确、上下文感知的“摩擦”来促使反思和重新审视现有证据。FAAF的核心是双玩家目标,其分离了数据偏差:摩擦状态策略用于识别信念不一致,而干预策略则生成符合协作者偏好的回应。该框架通过解析解实现单一策略的训练,提升了模型在生成简洁可解释摩擦及分布外泛化能力方面的性能。

链接: https://arxiv.org/abs/2505.19428
作者: Abhijnan Nath,Carine Graff,Andrei Bachinin,Nikhil Krishnaswamy
机构: Colorado State University (科罗拉多州立大学)
类目: Computation and Language (cs.CL)
备注: 48 pages (main paper: 10 pages incl. Limitations and Acknowledgments; references: 6 pages; appendix: 32 pages), 9 figures, 12 tables, appearing in Proceedings of ACL 2025, Vienna, Austria

点击查看摘要

Abstract:AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware “friction” that prompts for deliberation and re-examination of existing evidence. FAAF’s two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive “thought partners” – not passive responders – FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at this https URL.
zh

[NLP-158] he Role of Diversity in In-Context Learning for Large Language Models

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在上下文学习(In-context Learning, ICL)中示例选择的问题,特别是现有方法主要关注示例与查询的相似性而忽视了示例多样性的影响。解决方案的关键在于引入一种注重多样性的示例选择方法,通过实验证明该方法在复杂任务如数学和代码问题上能够提升模型性能,并增强对分布外查询的鲁棒性。

链接: https://arxiv.org/abs/2505.19426
作者: Wenyang Xiao,Haoyu Zhao,Lingxiao Huang
机构: Nanjing Universtiy (南京大学); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages

点击查看摘要

Abstract:In-context learning (ICL) is a crucial capability of current large language models (LLMs), where the selection of examples plays a key role in performance. While most existing approaches focus on selecting the most similar examples to the query, the impact of diversity in example selection remains underexplored. We systematically investigate the role of diversity in in-context example selection through experiments across a range of tasks, from sentiment classification to more challenging math and code problems. Experiments on Llama-3.1, Gemma-2, and Mistral-v0.3 families of models show that diversity-aware selection methods improve performance, particularly on complex tasks like math and code, and enhance robustness to out-of-distribution queries. To support these findings, we introduce a theoretical framework that explains the benefits of incorporating diversity in in-context example selection.
zh

[NLP-159] Self-Reflective Planning with Knowledge Graphs: Enhancing LLM Reasoning Reliability for Question Answering

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中因内部知识不足而产生的幻觉问题,以及现有结合知识图谱(Knowledge Graphs, KGs)的方法生成的推理路径不完整或事实不一致的问题。解决方案的关键在于提出一种自省式规划(Self-Reflective Planning, SRP)框架,该框架通过迭代的、参考引导的推理过程,结合LLMs与KGs,实现对推理路径的动态调整与优化,从而提升推理的准确性和可靠性。

链接: https://arxiv.org/abs/2505.19410
作者: Jiajun Zhu,Ye Liu,Meikai Bao,Kai Zhang,Yanghai Zhang,Qi Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, yet they remain prone to hallucinations when reasoning with insufficient internal knowledge. While integrating LLMs with knowledge graphs (KGs) provides access to structured, verifiable information, existing approaches often generate incomplete or factually inconsistent reasoning paths. To this end, we propose Self-Reflective Planning (SRP), a framework that synergizes LLMs with KGs through iterative, reference-guided reasoning. Specifically, given a question and topic entities, SRP first searches for references to guide planning and reflection. In the planning process, it checks initial relations and generates a reasoning path. After retrieving knowledge from KGs through a reasoning path, it implements iterative reflection by judging the retrieval result and editing the reasoning path until the answer is correctly retrieved. Extensive experiments on three public datasets demonstrate that SRP surpasses various strong baselines and further underscore its reliable reasoning ability.
zh

[NLP-160] CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems

【速读】: 该论文试图解决多代理大语言模型(multi-agent LLM systems)在协作推理和任务执行过程中,因代理间通信与推理导致的版权内容意外重现问题。现有保护技术主要关注最终输出中的内容检测,而忽略了代理内部更丰富且更具揭示性的推理过程。解决方案的关键在于提出CoTGuard框架,该框架通过在思维链(Chain-of-Thought, CoT)推理中嵌入特定触发查询,激活特定的CoT片段,并监控中间推理步骤以检测未经授权的内容复制,从而实现对协作代理场景中版权侵权行为的细粒度、可解释性检测。

链接: https://arxiv.org/abs/2505.19405
作者: Yan Wen,Junfeng Guo,Heng Huang
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:As large language models (LLMs) evolve into autonomous agents capable of collaborative reasoning and task execution, multi-agent LLM systems have emerged as a powerful paradigm for solving complex problems. However, these systems pose new challenges for copyright protection, particularly when sensitive or copyrighted content is inadvertently recalled through inter-agent communication and reasoning. Existing protection techniques primarily focus on detecting content in final outputs, overlooking the richer, more revealing reasoning processes within the agents themselves. In this paper, we introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought (CoT) reasoning. Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts. This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios. We evaluate CoTGuard on various benchmarks in extensive experiments and show that it effectively uncovers content leakage with minimal interference to task performance. Our findings suggest that reasoning-level monitoring offers a promising direction for safeguarding intellectual property in LLM-based agent systems.
zh

[NLP-161] Simple and Effective Baselines for Code Summarisation Evaluation

【速读】: 该论文试图解决代码文档生成中评估方法不准确的问题,即传统的人工评估成本高且自动度量指标不可靠。其解决方案的关键在于引入一种新的基线方法,通过让大型语言模型(Large Language Model, LLM)对代码摘要进行整体评分,该方法能够考虑代码内容本身,而非仅依赖于参考摘要或嵌入向量。这一特性使得该方法不仅可用于代码摘要的评估,还可扩展至其他任务,如评估代码库中的文档质量。

链接: https://arxiv.org/abs/2505.19392
作者: Jade Robinson,Jonathan K. Kummerfeld
机构: University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.
zh

[NLP-162] gec-metrics: A Unified Library for Grammatical Error Correction Evaluation ACL2025

【速读】: 该论文旨在解决语法错误修正(Grammatical Error Correction, GEC)评估指标的不一致性和可扩展性问题,其解决方案的关键在于提出一个名为gec-metrics的库,该库通过统一接口支持评估指标的使用和开发,确保所有研究者采用一致的实现进行系统比较,同时注重API设计以提高可扩展性,并提供元评估功能及分析与可视化脚本,从而促进GEC评估指标的发展。

链接: https://arxiv.org/abs/2505.19388
作者: Takumi Goto,Yusuke Sakai,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技術大学院大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 System Demonstration Track, 11 pages, 9 figures

点击查看摘要

Abstract:We introduce gec-metrics, a library for using and developing grammatical error correction (GEC) evaluation metrics through a unified interface. Our library enables fair system comparisons by ensuring that everyone conducts evaluations using a consistent implementation. Moreover, it is designed with a strong focus on API usage, making it highly extensible. It also includes meta-evaluation functionalities and provides analysis and visualization scripts, contributing to developing GEC evaluation metrics. Our code is released under the MIT license and is also distributed as an installable package. The video is available on YouTube.
zh

[NLP-163] GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor

【速读】: 该论文旨在解决零样本语音合成(zero-shot speech synthesis)中说话风格迁移的挑战,即在没有目标说话人语音数据的情况下生成符合特定说话风格的语音。其解决方案的关键在于提出了一种渐进式风格适配器(Gradual Style Adaptor, GSA),该方法通过一个新颖的风格编码器,逐步从声学参考中编码说话风格。GSA首先捕捉每个语义音素单元的局部风格,随后通过自注意力机制将局部风格组合为全局风格条件,从而为声学模型提供鲁棒且丰富的风格表示。

链接: https://arxiv.org/abs/2505.19384
作者: Seokgi Lee,Jungjun Kim
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure.
zh

[NLP-164] Belief Attribution as Mental Explanation: The Role of Accuracy Informativity and Causality

【速读】: 该论文试图解决的问题是:在人类理论心理(theory-of-mind)中,人们倾向于将哪些特定信念归因于其他代理者(agents)以解释其行为。论文提出的解决方案关键在于开发一个计算模型,该模型通过准确性(accuracy)、信息性(informativity)和对行为的因果相关性(causal relevance)三个因素来量化自然语言陈述的解释力,从而预测人们在观察代理者行为时所归因的信念。

链接: https://arxiv.org/abs/2505.19376
作者: Lance Ying,Almog Hillel,Ryan Truong,Vikash K. Mansinghka,Joshua B. Tenenbaum,Tan Zhi-Xuan
机构: Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 figures; oral presentation at CogSci 2025

点击查看摘要

Abstract:A key feature of human theory-of-mind is the ability to attribute beliefs to other agents as mentalistic explanations for their behavior. But given the wide variety of beliefs that agents may hold about the world and the rich language we can use to express them, which specific beliefs are people inclined to attribute to others? In this paper, we investigate the hypothesis that people prefer to attribute beliefs that are good explanations for the behavior they observe. We develop a computational model that quantifies the explanatory strength of a (natural language) statement about an agent’s beliefs via three factors: accuracy, informativity, and causal relevance to actions, each of which can be computed from a probabilistic generative model of belief-driven behavior. Using this model, we study the role of each factor in how people selectively attribute beliefs to other agents. We investigate this via an experiment where participants watch an agent collect keys hidden in boxes in order to reach a goal, then rank a set of statements describing the agent’s beliefs about the boxes’ contents. We find that accuracy and informativity perform reasonably well at predicting these rankings when combined, but that causal relevance is the single factor that best explains participants’ responses.
zh

[NLP-165] ChartLens: Fine-grained Visual Attribution in Charts ACL2025

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表理解任务中产生的幻觉问题,即生成的文本序列与提供的视觉数据相冲突。解决方案的关键在于提出一种名为ChartLens的新型图表归因算法,该算法通过基于分割的技术识别图表对象,并结合集合标记提示(set-of-marks prompting)与MLLMs实现细粒度的视觉归因。

链接: https://arxiv.org/abs/2505.19360
作者: Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院公园分校); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: ACL 2025 (Main)

点击查看摘要

Abstract:The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
zh

[NLP-166] Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval ACL2025

【速读】: 该论文旨在解决低资源、形态丰富的语言(如阿姆哈拉语)在神经检索中的有效性问题,这些问题主要由于数据稀缺和分词策略不佳而未被充分探索。其解决方案的关键在于引入基于预训练阿姆哈拉语BERT和RoBERTa骨架的特定于阿姆哈拉语的密集检索模型,特别是RoBERTa-Base-Amharic-Embed模型,在MRR@10和Recall@10指标上分别相比最强的多语言基线Arctic Embed 2.0提升了17.6%和9.86%。此外,还训练了一个基于ColBERT的晚期交互检索模型,以进一步提升检索性能。

链接: https://arxiv.org/abs/2505.19356
作者: Kidist Amde Mekonnen,Yosef Worku Alemneh,Maarten de Rijke
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages (excluding references and appendix), 10 figures. Accepted to ACL 2025 Findings. Public release includes dataset, code, and trained models: this https URL

点击查看摘要

Abstract:Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at this https URL.
zh

[NLP-167] Estimating Online Influence Needs Causal Modeling! Counterfactual Analysis of Social Media Engagement

【速读】: 该论文试图解决社交媒介中真实影响力识别的问题,即区分相关性与因果关系,特别是在分析虚假信息传播时。现有方法主要关注曝光度指标和网络结构,但未能捕捉外部时间信号如何触发用户参与的因果机制。解决方案的关键在于引入一种新的联合处理-结果框架,该框架利用现有的序列模型,同时适应政策时间点和参与效应,并将医疗健康领域的因果推断技术应用于社交媒介交互的序列特性中,以应对外部混杂信号带来的挑战。

链接: https://arxiv.org/abs/2505.19355
作者: Lin Tian,Marian-Andrei Rizoiu
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Understanding true influence in social media requires distinguishing correlation from causation–particularly when analyzing misinformation spread. While existing approaches focus on exposure metrics and network structures, they often fail to capture the causal mechanisms by which external temporal signals trigger engagement. We introduce a novel joint treatment-outcome framework that leverages existing sequential models to simultaneously adapt to both policy timing and engagement effects. Our approach adapts causal inference techniques from healthcare to estimate Average Treatment Effects (ATE) within the sequential nature of social media interactions, tackling challenges from external confounding signals. Through our experiments on real-world misinformation and disinformation datasets, we show that our models outperform existing benchmarks by 15–22% in predicting engagement across diverse counterfactual scenarios, including exposure adjustment, timing shifts, and varied intervention durations. Case studies on 492 social media users show our causal effect measure aligns strongly with the gold standard in influence estimation, the expert-based empirical influence.
zh

[NLP-168] GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance

【速读】: 该论文旨在解决知识增强的视觉问答(Knowledge-Based Visual Question Answering, KB-VQA)任务中,传统方法依赖显式知识库或大型语言模型(Large Language Models, LLMs)作为隐式知识源时,辅助文本可能与问题上下文不相关或包含干扰信息的问题。解决方案的关键在于提出一种四阶段框架——基于定位描述引导的知识增强视觉问答(Grounding Caption-Guided Knowledge-Based Visual Question Answering, GC-KBVQA),通过生成与问题相关的描述性文本,结合外部知识构建高度信息丰富的提示,从而有效提升LLMs在零样本视觉问答任务中的性能,且无需端到端的多模态训练。

链接: https://arxiv.org/abs/2505.19354
作者: Mohammad Mahdi Moradi,Sudhir Mudur
机构: Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image. Early methods relied on explicit knowledge bases to provide this auxiliary information. Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources. While KB-VQA methods have demonstrated promising results, their potential remains constrained as the auxiliary text provided may not be relevant to the question context, and may also include irrelevant information that could misguide the answer predictor. We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA), which enables LLMs to effectively perform zero-shot VQA tasks without the need for end-to-end multimodal training. Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information. This is combined with knowledge from external sources to create highly informative prompts for the LLM. GC-KBVQA can address a variety of VQA tasks, and does not require task-specific fine-tuning, thus reducing both costs and deployment complexity by leveraging general-purpose, pre-trained LLMs. Comparison with competing KB-VQA methods shows significantly improved performance. Our code will be made public.
zh

[NLP-169] Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation

【速读】: 该论文试图解决生成式AI(Generative AI)在代码生成应用中与人类程序员协作时所引发的哲学与技术问题,特别是如何区分人类与机器在代码生成中的错误根源及其对语义一致性、安全性和认知边界的影响。解决方案的关键在于构建“错误架构”(Architectures of Error)的框架,通过Dennett的机械功能主义和Rescher的方法论实用主义,揭示人类认知错误与人工智能随机性错误的根本差异,并借助Floridi的抽象层次理论深入分析这些错误维度的相互作用与演变。

链接: https://arxiv.org/abs/2505.19353
作者: Camilo Chacón Sartori
机构: Artificial Intelligence Research Institute (IIIA-CSIC), \orgaddress\cityBellaterra, \postcode08193, \stateBarcelona, \countrySpain
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: preprint

点击查看摘要

Abstract:With the rise of generative AI (GenAI), Large Language Models are increasingly employed for code generation, becoming active co-authors alongside human programmers. Focusing specifically on this application domain, this paper articulates distinct ``Architectures of Error’’ to ground an epistemic distinction between human and machine code generation. Examined through their shared vulnerability to error, this distinction reveals fundamentally different causal origins: human-cognitive versus artificial-stochastic. To develop this framework and substantiate the distinction, the analysis draws critically upon Dennett’s mechanistic functionalism and Rescher’s methodological pragmatism. I argue that a systematic differentiation of these error profiles raises critical philosophical questions concerning semantic coherence, security robustness, epistemic limits, and control mechanisms in human-AI collaborative software development. The paper also utilizes Floridi’s levels of abstraction to provide a nuanced understanding of how these error dimensions interact and may evolve with technological advancements. This analysis aims to offer philosophers a structured framework for understanding GenAI’s unique epistemological challenges, shaped by these architectural foundations, while also providing software engineers a basis for more critically informed engagement.
zh

[NLP-170] PatentScore: Multi-dimensional Evaluation of LLM -Generated Patent Claims

【速读】: 该论文试图解决现有自然语言生成(NLG)评估指标在专利文档的结构和法律特性评估中存在不足的问题,特别是针对专利权利要求(claim)生成质量的评估。解决方案的关键在于提出PatentScore,一个面向专利权利要求的多维评估框架,其核心包括:权利要求分析的层级分解、基于法律和技术标准的领域特定验证模式,以及在结构、语义和法律维度上的评分机制,从而实现对专利生成质量的全面评估。

链接: https://arxiv.org/abs/2505.19345
作者: Yongmin Yoo,Qiongkai Xu,Longbing Cao
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language generation (NLG) metrics play a central role in evaluating generated texts, but are not well suited for the structural and legal characteristics of patent documents. Large language models (LLMs) offer strong potential in automating patent generation, yet research on evaluating LLM-generated patents remains limited, especially in evaluating the generation quality of patent claims, which are central to defining the scope of protection. Effective claim evaluation requires addressing legal validity, technical accuracy, and structural compliance. To address this gap, we introduce PatentScore, a multi-dimensional evaluation framework for assessing LLM-generated patent claims. PatentScore incorporates: (1) hierarchical decomposition for claim analysis; (2) domain-specific validation patterns based on legal and technical standards; and (3) scoring across structural, semantic, and legal dimensions. Unlike general-purpose NLG metrics, PatentScore reflects patent-specific constraints and document structures, enabling evaluation beyond surface similarity. We evaluate 400 GPT-4o-mini generated Claim 1s and report a Pearson correlation of r = 0.819 with expert annotations, outperforming existing NLG metrics. Furthermore, we conduct additional evaluations using open models such as Claude-3.5-Haiku and Gemini-1.5-flash, all of which show strong correlations with expert judgments, confirming the robustness and generalizability of our framework.
zh

[NLP-171] ODIN: A NL2SQL Recommender to Handle Schema Ambiguity

【速读】: 该论文试图解决自然语言到结构化查询语言(NL2SQL)系统在企业环境中因模式歧义(schema ambiguity)导致的准确性问题,特别是在复杂数据库模式下,多个表和列具有语义相似名称时,系统难以准确生成对应的SQL查询。解决方案的关键在于引入ODIN,一个NL2SQL推荐引擎,其通过考虑模糊模式组件的不同解释,生成一组潜在的SQL查询,并根据歧义程度动态调整建议数量,同时通过用户反馈学习以个性化未来查询推荐。

链接: https://arxiv.org/abs/2505.19302
作者: Kapil Vaidya,Abishek Sankararaman,Jialin Ding,Chuan Lei,Xiao Qin,Balakrishnan Narayanaswamy,Tim Kraska
机构: Amazon Web Services(亚马逊网络服务)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:NL2SQL (natural language to SQL) systems translate natural language into SQL queries, allowing users with no technical background to interact with databases and create tools like reports or visualizations. While recent advancements in large language models (LLMs) have significantly improved NL2SQL accuracy, schema ambiguity remains a major challenge in enterprise environments with complex schemas, where multiple tables and columns with semantically similar names often co-exist. To address schema ambiguity, we introduce ODIN, a NL2SQL recommendation engine. Instead of producing a single SQL query given a natural language question, ODIN generates a set of potential SQL queries by accounting for different interpretations of ambiguous schema components. ODIN dynamically adjusts the number of suggestions based on the level of ambiguity, and ODIN learns from user feedback to personalize future SQL query recommendations. Our evaluation shows that ODIN improves the likelihood of generating the correct SQL query by 1.5-2 \times compared to baselines.
zh

[NLP-172] SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中受限于内部参数空间,无法获取实时信息和理解物理世界的问题。解决方案的关键在于提出SituatedThinker框架,该框架通过情境化思考(situated thinking)将模型的推理过程与现实世界情境相结合,自适应地融合内部知识与外部信息,并利用强化学习激励模型与现实世界进行交互以获取信息和反馈,从而突破模型的知识边界并提升推理能力。

链接: https://arxiv.org/abs/2505.19300
作者: Junnan Liu,Linhao Luo,Thuy-Trang Vu,Gholamreza Haffari
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs’ access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at this https URL.
zh

[NLP-173] A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations

【速读】: 该论文试图解决在高风险人工智能决策场景中,生成忠实的自由文本解释(free-text explanations)的难题,这一问题在于语言模型难以生成此类解释,而人类也难以评估其质量。论文提出了一种用于衡量预测-解释(Prediction-EXplanation, PEX)一致性的方法,该方法通过扩展证据权重的概念来量化自由文本解释对预测的支持或反对程度,从而作为解释忠实性的重要方面。解决方案的关键在于利用这种一致性度量,并通过直接偏好优化提升生成解释的一致性,实验结果显示该方法在三个模型家族中均有效提升了解释的一致性,最高可达292.3%。

链接: https://arxiv.org/abs/2505.19299
作者: Lingjun Zhao,Hal Daumé III
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.
zh

[NLP-174] owards Reliable Large Audio Language Model ACL2025

【速读】: 该论文试图解决大型音频语言模型(Large Audio Language Models, LALMs)在知识边界识别和主动拒绝回答未知问题方面的不足,以提升其可靠性。解决方案的关键在于系统性地探索多种增强可靠性的方法,包括无需训练的多模态思维链(Multi-modal Chain-of-Thought, MCoT)和基于训练的监督微调(Supervised Fine-Tuning, SFT),同时提出新的评估指标——可靠性增益指数(Reliability Gain Index, RGI),以更准确地衡量不同方法的有效性。研究结果表明,训练-free 和训练-based 方法均能在不同程度上提升 LALMs 的可靠性,并揭示了可靠性意识作为一种“元能力”可在不同音频模态间迁移。

链接: https://arxiv.org/abs/2505.19294
作者: Ziyang Ma,Xiquan Li,Yakun Song,Wenxi Chen,Chenpeng Du,Jian Wu,Yuanzhe Chen,Zhuo Chen,Yuping Wang,Yuxuan Wang,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); ByteDance (字节跳动); Shanghai Innovation Institute (上海创新研究院)
类目: ound (cs.SD); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don’t know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a “meta ability”, which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.
zh

[NLP-175] 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

【速读】: 该论文试图解决现有基于真实任务的长上下文评估基准存在的两个主要问题:一是缺乏有效的指标来区分模型的基线能力与真正的长上下文能力,导致跨模型比较不清晰;二是基准通常采用固定输入长度,限制了其在不同模型中的适用性,并无法揭示模型在处理长上下文时的失效点。解决方案的关键在于引入一个可控制长度的长上下文基准以及一种能够解耦基线知识与真正长上下文能力的新指标。

链接: https://arxiv.org/abs/2505.19293
作者: Wang Yang,Hongye Jin,Shaochen Zhong,Song Jiang,Qifan Wang,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); Texas A&M University (德克萨斯A&M大学); Rice University (莱斯大学); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks – e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model’s baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
zh

[NLP-176] A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)知识结构模式的缺乏研究问题,特别是其知识在图结构中的表现及其与节点度等图结构性质的关系。解决方案的关键在于从图论角度量化LLMs的知识,并揭示知识同质性(knowledge homophily)现象,即拓扑上接近的实体表现出相似的知识水平,进而基于局部邻居开发图机器学习模型来估计实体知识,从而实现有效的知识验证。

链接: https://arxiv.org/abs/2505.19286
作者: Utkarsh Sahu,Zhisheng Qi,Yongjia Lei,Ryan A. Rossi,Franck Dernoncourt,Nesreen K. Ahmed,Mahantesh M Halappanavar,Yao Ma,Yu Wang
机构: University of Oregon(俄勒冈大学); Adobe Research(Adobe研究院); Cisco AI Research(思科人工智能研究); Pacific Northwest National Laboratory(太平洋西北国家实验室); Rensselaer Polytechnic Institute(伦斯勒理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Large language models have been extensively studied as neural knowledge bases for their knowledge access, editability, reasoning, and explainability. However, few works focus on the structural patterns of their knowledge. Motivated by this gap, we investigate these structural patterns from a graph perspective. We quantify the knowledge of LLMs at both the triplet and entity levels, and analyze how it relates to graph structural properties such as node degree. Furthermore, we uncover the knowledge homophily, where topologically close entities exhibit similar levels of knowledgeability, which further motivates us to develop graph machine learning models to estimate entity knowledge based on its local neighbors. This model further enables valuable knowledge checking by selecting triplets less known to LLMs. Empirical results show that using selected triplets for fine-tuning leads to superior performance.
zh

[NLP-177] Next Token Prediction Is a Dead End for Creativity

【速读】: 该论文试图解决当前生成式 AI (Generative AI) 在创造性表达上的局限性问题,特别是其基于token预测的架构与真实创造力之间的根本性错配。论文认为,尽管下一代token模型在语言生成方面取得了显著进展,但其设计更倾向于表面连贯性而非自发性、原创性和即兴风险。解决方案的关键在于将创造力重新定义为一种交互过程,而非预测性输出,从而构建更加富有表现力、响应性并符合人类创作实践的AI系统。

链接: https://arxiv.org/abs/2505.19277
作者: Ibukun Olatunji,Mark Sheppard
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages including references

点击查看摘要

Abstract:This paper argues that token prediction is fundamentally misaligned with real creativity. While next-token models have enabled impressive advances in language generation, their architecture favours surface-level coherence over spontaneity, originality, and improvisational risk. We use battle rap as a case study to expose the limitations of predictive systems, demonstrating that they cannot truly engage in adversarial or emotionally resonant exchanges. By reframing creativity as an interactive process rather than a predictive output, we offer a vision for AI systems that are more expressive, responsive, and aligned with human creative practice.
zh

[NLP-178] Unveiling Dual Quality in Product Reviews: An NLP-Based Approach ACL2025

【速读】: 该论文试图解决消费者在不同市场中面对相同产品出现质量不一致的问题,即所谓的双质量问题(dual quality problem)。解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术来检测此类差异,并通过构建一个包含1,957条评论的波兰语数据集,其中540条明确指出了双质量问题,从而实现对问题的有效识别与分析。研究还探讨了多种方法,如SetFit与sentence-transformers结合、基于transformer的编码器以及大语言模型(LLMs)的应用,并评估了多语言迁移的有效性。

链接: https://arxiv.org/abs/2505.19254
作者: Rafał Poświata,Marcin Michał Mirończuk,Sławomir Dadas,Małgorzata Grębowiec,Michał Perełkiewicz
机构: National Information Processing Institute (国家信息处理研究所)
类目: Computation and Language (cs.CL)
备注: Accepted for ACL 2025 Industry Track

点击查看摘要

Abstract:Consumers often face inconsistent product quality, particularly when identical products vary between markets, a situation known as the dual quality problem. To identify and address this issue, automated techniques are needed. This paper explores how natural language processing (NLP) can aid in detecting such discrepancies and presents the full process of developing a solution. First, we describe in detail the creation of a new Polish-language dataset with 1,957 reviews, 540 highlighting dual quality issues. We then discuss experiments with various approaches like SetFit with sentence-transformers, transformer-based encoders, and LLMs, including error analysis and robustness verification. Additionally, we evaluate multilingual transfer using a subset of opinions in English, French, and German. The paper concludes with insights on deployment and practical applications.
zh

[NLP-179] PATS: Process-Level Adaptive Thinking Mode Switching

【速读】: 该论文试图解决当前大语言模型(Large-Language Models, LLMs)在处理不同难度问题时采用固定推理策略所导致的性能与效率不平衡问题。现有方法虽尝试通过训练-free 的快慢思维系统切换来应对不同难度的问题,但受限于粗粒度的解决方案级策略调整。论文提出的解决方案关键在于引入一种新的推理范式:过程级自适应思维模式切换(Process-Level Adaptive Thinking Mode Switching, PATS),该方法使LLMs能够根据每个步骤的难度动态调整推理策略,从而优化准确性和计算效率之间的平衡。其核心机制包括过程奖励模型(Process Reward Models, PRMs)与束搜索(Beam Search)的集成,以及渐进式模式切换和错误步骤惩罚机制。

链接: https://arxiv.org/abs/2505.19250
作者: Yi Wang,Junxiao Liu,Shimao Zhang,Jiajun Chen,Shujian Huang
机构: National Key Laboratory for Novel Software Technology, Nanjing University (国家关键软件技术实验室,南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.
zh

[NLP-180] LLLM s: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

【速读】: 该论文旨在系统性地分析大型语言模型(Large Language Model, LLM)的局限性研究现状,通过数据驱动和半自动化的方法对2022至2024年间相关文献进行综述。其解决方案的关键在于采用自下而上的方法,结合关键词过滤、基于生成式 AI (Generative AI) 的分类以及主题聚类(包括HDBSCAN+BERTopic和LlooM两种方法),从25万篇ACL和arXiv论文中筛选出14,648篇相关文献,并构建了带有标注摘要的数据集,以提供对LLM局限性研究趋势的量化分析。

链接: https://arxiv.org/abs/2505.19240
作者: Aida Kostikova,Zhipin Wang,Deidamea Bajri,Ole Pütz,Benjamin Paaßen,Steffen Eger
机构: University of Bielefeld(比勒费尔德大学); University of Technology Nuremberg(纽伦堡技术大学); University of Mannheim(曼海姆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This manuscript is currently under review at ACM Computing Surveys

点击查看摘要

Abstract:Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLM (LLLMs) from 2022 to 2024 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that LLM-related research increases over fivefold in ACL and fourfold in arXiv. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by late 2024. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2024. We release a dataset of annotated abstracts and a validated methodology, and offer a quantitative view of trends in LLM limitations research.
zh

[NLP-181] Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在创造力评估方面的挑战,特别是当前评估方法过度依赖低效且昂贵的人类判断,而现有的自动化方法在泛化性和与人类判断的一致性方面存在不足。解决方案的关键在于提出一种基于成对比较的框架,通过共享上下文指令提升评估的一致性,并构建了一个包含100K+人类级和1M+合成创造性指令-响应对的大型数据集CreataSet。基于该数据集训练的LLM评估器CrEval在与人类判断的一致性方面表现出显著优势。

链接: https://arxiv.org/abs/2505.19236
作者: Qian Cao,Xiting Wang,Yuzhuo Yuan,Yahui Liu,Fang Luo,Ruihua Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.
zh

[NLP-182] GUARDIAN: Safeguarding LLM Multi-Agent Collaborations with Temporal Graph Modeling

【速读】: 该论文试图解决多智能体协作中的安全问题,例如幻觉放大、错误注入与传播等。其解决方案的关键在于提出GUARDIAN,通过将多智能体协作过程建模为离散时间的时序属性图,显式捕捉幻觉和错误的传播动态,并采用无监督的编码器-解码器架构结合增量训练范式,从潜在嵌入中学习重构节点属性和图结构,从而精准识别异常节点和边。此外,基于信息瓶颈理论的图抽象机制在压缩时序交互图的同时保留关键模式,进一步提升了安全性与效率。

链接: https://arxiv.org/abs/2505.19234
作者: Jialong Zhou,Lichao Wang,Xiao Yang
机构: King’s College London (国王学院伦敦大学); Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) enables the development of intelligent agents capable of engaging in complex and multi-turn dialogues. However, multi-agent collaboration face critical safety challenges, such as hallucination amplification and error injection and propagation. This paper presents GUARDIAN, a unified method for detecting and mitigating multiple safety concerns in GUARDing Intelligent Agent collaboratioNs. By modeling the multi-agent collaboration process as a discrete-time temporal attributed graph, GUARDIAN explicitly captures the propagation dynamics of hallucinations and errors. The unsupervised encoder-decoder architecture incorporating an incremental training paradigm, learns to reconstruct node attributes and graph structures from latent embeddings, enabling the identification of anomalous nodes and edges with unparalleled precision. Moreover, we introduce a graph abstraction mechanism based on the Information Bottleneck Theory, which compresses temporal interaction graphs while preserving essential patterns. Extensive experiments demonstrate GUARDIAN’s effectiveness in safeguarding LLM multi-agent collaborations against diverse safety vulnerabilities, achieving state-of-the-art accuracy with efficient resource utilization.
zh

[NLP-183] he Overthinkers DIET: Cutting Token Calories with DIfficulty-AwarE Training

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理过程中过度思考导致响应过长、效率低下的问题。其核心解决方案是提出DIET(Difficulty-Aware Training)框架,通过将问题难度动态整合到强化学习(Reinforcement Learning, RL)过程中,优化性能与效率的权衡。DIET的关键在于通过调节token惩罚强度并根据估计的任务难度调整目标长度,实现对token的智能压缩,从而在减少token数量的同时提升推理性能。此外,论文还提出了Advantage Weighting技术,以解决组归一化RL算法中奖励加权的缺陷,进一步提升方法的稳定性与有效性。

链接: https://arxiv.org/abs/2505.19217
作者: Weize Chen,Jiarui Yuan,Tailin Jin,Ning Ding,Huimin Chen,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Recent large language models (LLMs) exhibit impressive reasoning but often over-think, generating excessively long responses that hinder efficiency. We introduce DIET ( DIfficulty-AwarE Training), a framework that systematically cuts these “token calories” by integrating on-the-fly problem difficulty into the reinforcement learning (RL) process. DIET dynamically adapts token compression strategies by modulating token penalty strength and conditioning target lengths on estimated task difficulty, to optimize the performance-efficiency trade-off. We also theoretically analyze the pitfalls of naive reward weighting in group-normalized RL algorithms like GRPO, and propose Advantage Weighting technique, which enables stable and effective implementation of these difficulty-aware objectives. Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. Beyond raw token reduction, we show two crucial benefits largely overlooked by prior work: (1) DIET leads to superior inference scaling. By maintaining high per-sample quality with fewer tokens, it enables better scaling performance via majority voting with more samples under fixed computational budgets, an area where other methods falter. (2) DIET enhances the natural positive correlation between response length and problem difficulty, ensuring verbosity is appropriately allocated, unlike many existing compression methods that disrupt this relationship. Our analyses provide a principled and effective framework for developing more efficient, practical, and high-performing LLMs.
zh

[NLP-184] When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面临道德准则与奖励或激励直接冲突时的行为问题,尤其是在代理角色中可能产生的伦理对齐挑战。其解决方案的关键在于引入MoralSim,这是一个用于模拟社会困境中道德行为的框架,通过在囚徒困境和公共物品博弈中嵌入具有道德色彩的情境,系统性地评估LLMs在不同道德框架下的行为表现,从而揭示其在伦理规范与收益最大化策略冲突时的行为模式。

链接: https://arxiv.org/abs/2505.19212
作者: Steffen Backmann,David Guzman Piedrahita,Emanuel Tewolde,Rada Mihalcea,Bernhard Schölkopf,Zhijing Jin
机构: ETH Zürich(ETH Zurich); University of Zurich(苏黎世大学); Carnegie Mellon University(卡内基梅隆大学); University of Michigan(密歇根大学); Max Planck Institute for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所,图宾根); University of Toronto(多伦多大学); Vector Institute(向量研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs’ moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner’s dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent’s “self-interest” may conflict with ethical expectations. Our code is available at this https URL.
zh

[NLP-185] MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

【速读】: 该论文试图解决现有大型语言模型(Large Language Models, LLMs)在科学假设生成任务中仅能产出粗粒度假设,缺乏关键的方法论和实验细节的问题。其解决方案的关键在于引入并形式化定义了细粒度科学假设发现任务,将该任务建模为组合优化问题,并提出一种分层搜索方法,逐步细化假设内容,从一般概念推进到具体的实验配置。该方法通过平滑奖励景观,提升了优化效果,从而生成更详细、可实验的假设。

链接: https://arxiv.org/abs/2505.19209
作者: Zonglin Yang,Wanhao Liu,Ben Gao,Yujie Liu,Wei Li,Tong Xie,Lidong Bing,Wanli Ouyang,Erik Cambria,Dongzhan Zhou
机构: Nanyang Technological University (南洋理工大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Wuhan University (武汉大学); National University of Singapore (新加坡国立大学); University of New South Wales (新南威尔士大学); MiroMind (米罗思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs’ capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM’s internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines.
zh

[NLP-186] SpeakStream: Streaming Text-to-Speech with Interleaved Data

【速读】: 该论文试图解决传统文本到语音(Text-to-Speech, TTS)系统在流式大语言模型(Large Language Models, LLMs)应用于对话AI时存在的延迟瓶颈问题。传统TTS系统通常基于完整话语进行训练和推理,即使在优化推理速度的情况下,与流式LLM输出结合时仍会引入不可接受的延迟,这在需要低首词延迟的响应式对话代理中尤为关键。论文提出的解决方案是SpeakStream,其关键在于采用仅解码器架构,通过在交错文本-语音数据上使用下一步预测损失进行训练,并在推理过程中逐步生成语音以适应流式输入文本,从而实现低延迟的流式TTS。

链接: https://arxiv.org/abs/2505.19206
作者: Richard He Bai,Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.
zh

[NLP-187] DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding

【速读】: 该论文旨在解决将推测解码(Speculative Decoding, SD)方法应用于视觉语言模型(Vision-Language Models, VLMs)中时存在的效率与对齐问题,从而加速自回归生成过程。其解决方案的关键在于提出DREAM框架,该框架通过三个核心创新实现高效、准确的多模态解码:一是基于交叉注意力机制将目标模型的中间特征注入到草稿模型中以提升对齐效果;二是根据注意力熵进行自适应中间特征选择,以指导草稿模型的高效训练;三是通过视觉标记压缩降低草稿模型的延迟。这些技术共同提升了多模态基准测试中的推理吞吐量和推测草稿接受长度。

链接: https://arxiv.org/abs/2505.19201
作者: Yunhai Hu,Tianhua Xia,Zining Liu,Rahul Raman,Xingyu Liu,Bo Bao,Eric Sather,Vithursan Thangarasa,Sai Qian Zhang
机构: New York University (纽约大学); University of Pennsylvania (宾夕法尼亚大学); Cerebras System (Cerebras系统)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: this https URL
zh

[NLP-188] Misleading through Inconsistency: A Benchmark for Political Inconsistencies Detection

【速读】: 该论文试图解决政治声明中不一致性的检测问题(inconsistency detection),这类不一致的言论属于一种虚假信息,会削弱公众信任并给问责带来挑战。解决方案的关键在于提出了一项不一致性检测任务,并构建了一个包含698对人工标注的政治声明及其解释的语料库,以促进自然语言处理(NLP)在该领域的研究。该数据集来源于德国的Wahl-O-Mat和瑞士的Smartvote等投票助手平台,反映了现实中的政治议题,为自动检测不一致性提供了资源支持。

链接: https://arxiv.org/abs/2505.19191
作者: Nursulu Sagimbayeva,Ruveyda Betül Bahçeci,Ingmar Weber
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures. Accepted for publication in the Proceedings of 1st Workshop on Misinformation Detection in the Era of LLMs (MisD) at ICWSM-2025

点击查看摘要

Abstract:Inconsistent political statements represent a form of misinformation. They erode public trust and pose challenges to accountability, when left unnoticed. Detecting inconsistencies automatically could support journalists in asking clarification questions, thereby helping to keep politicians accountable. We propose the Inconsistency detection task and develop a scale of inconsistency types to prompt NLP-research in this direction. To provide a resource for detecting inconsistencies in a political domain, we present a dataset of 698 human-annotated pairs of political statements with explanations of the annotators’ reasoning for 237 samples. The statements mainly come from voting assistant platforms such as Wahl-O-Mat in Germany and Smartvote in Switzerland, reflecting real-world political issues. We benchmark Large Language Models (LLMs) on our dataset and show that in general, they are as good as humans at detecting inconsistencies, and might be even better than individual humans at predicting the crowd-annotated ground-truth. However, when it comes to identifying fine-grained inconsistency types, none of the model have reached the upper bound of performance (due to natural labeling variation), thus leaving room for improvement. We make our dataset and code publicly available.
zh

[NLP-189] LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在测试时推理过程中存在冗余功能元素导致计算资源消耗过大的问题,这些功能元素包括验证过程、替代解决方案和错误修正等。论文提出的解决方案关键在于引入PIR(Perplexity-based Importance Refinement)框架,该框架通过量化评估每个推理步骤对答案预测置信度的影响,系统地识别并选择性地剪枝低重要性的功能步骤,同时保留核心的渐进式推理路径,从而生成更简洁且高效的推理链。

链接: https://arxiv.org/abs/2505.19187
作者: Yang Xiao,Jiashuo Wang,Ruifeng Yuan,Chunpu Xu,Kaishuai Xu,Wenjie Li,Pengfei Liu
机构: The Hong Kong Polytechnic University (香港理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9% to +6.6%) with significantly reduced token usage (-3% to -41%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.
zh

[NLP-190] wo LLM s debate both are certain theyve won

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在动态、对抗性辩论场景中是否能够准确调整其置信度的问题。研究的关键在于构建一个包含多轮互动和零和结构的辩论环境,以模拟现实中的不确定性并评估模型在面对对立观点时的信念更新能力。通过组织多轮政策辩论并让模型在每轮后私下评估其获胜信心,研究揭示了LLMs在置信度校准方面的显著缺陷,包括系统性高估、信心升级、相互高估、自我辩论偏差以及私有推理与公开置信度不一致等问题。

链接: https://arxiv.org/abs/2505.19184
作者: Minh Nhat Nguyen,Pradyumna Shyama Prasad
机构: National University of Singapore (新加坡国立大学); Apart Research (Apart Research)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed =75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models’ private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLM outputs are deployed without careful review in assistant roles or agentic settings.
zh

[NLP-191] Assistant-Guided Mitigation of Teacher Preference Bias in LLM -as-a-Judge

【速读】: 该论文试图解决在使用大型语言模型(Large Language Models, LLMs)作为评估者(LLM-as-a-Judge)时,由教师模型生成的评估数据所引入的教师偏好偏差(teacher preference bias)问题。解决方案的关键在于引入一个不偏向教师模型响应的辅助模型,以补充训练数据,并提出AGDe-Judge框架,该框架通过三个阶段对训练数据中的标签和反馈进行去偏处理,从而有效减少教师偏好偏差,同时保持在多个评估基准上的高性能。

链接: https://arxiv.org/abs/2505.19176
作者: Zhuo Liu,Moxin Li,Xun Deng,Qifan Wang,Fuli Feng
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Meta AI (Meta人工智能)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model’s responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at this https URL.
zh

[NLP-192] SpokenNativQA: Multilingual Everyday Spoken Queries for LLM s

【速读】: 该论文试图解决在多语言口语查询背景下对大型语言模型(Large Language Models, LLMs)进行基准测试的问题,这一领域仍处于探索阶段。解决方案的关键在于提出SpokenNativQA,这是首个面向多语言和文化背景的口语问答(Spoken Question-Answering, SQA)数据集,旨在评估LLMs在真实对话场景中的表现。该数据集包含约33,000个自然口语化的问答对,涵盖多种语言,包括低资源语言和方言丰富的语言,通过引入语音变化、口音和语言多样性,弥补了传统文本问答数据集的不足。

链接: https://arxiv.org/abs/2505.19163
作者: Firoj Alam,Md Arid Hasan,Shammur Absar Chowdhury
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Spoken Question Answering, Multilingual LLMs, Speech-based Evaluation, Dialectal Speech, Low-resource Languages, Multimodal Benchmarking, Conversational AI, Speech-to-Text QA, Real-world Interaction, Natural Language Understanding

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various disciplines and tasks. However, benchmarking their capabilities with multilingual spoken queries remains largely unexplored. In this study, we introduce SpokenNativQA, the first multilingual and culturally aligned spoken question-answering (SQA) dataset designed to evaluate LLMs in real-world conversational settings. The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages, including low-resource and dialect-rich languages, providing a robust benchmark for assessing LLM performance in speech-based interactions. SpokenNativQA addresses the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. We benchmark different ASR systems and LLMs for SQA and present our findings. We released the data at (this https URL) and the experimental scripts at (this https URL) for the research community.
zh

[NLP-193] Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLM s

【速读】: 该论文旨在解决视频大语言模型(Video-LLMs)在处理长视频序列时因自回归特性导致的推理延迟增加问题。其关键解决方案是提出一种名为Sparse-to-Dense (StD) 的解码策略,该策略结合了稀疏的top-K注意力机制与密集的全注意力机制,通过快速(稀疏)模型推测多个token并由慢速(密集)模型并行验证,从而在不损失模型性能的前提下显著提升视频处理速度。

链接: https://arxiv.org/abs/2505.19155
作者: Xuan Zhang,Cunxiao Du,Sicheng Yu,Jiawei Wu,Fengzhuo Zhang,Wei Gao,Qian Liu
机构: Singapore Management University (新加坡管理大学); Sea AI Lab (Sea人工智能实验室); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94 \times walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.
zh

[NLP-194] Shifting AI Efficiency From Model-Centric to Data-Centric Compression

【速读】: 该论文试图解决随着大语言模型(Large Language Models, LLMs)和多模态大语言模型(Multi-modal LLMs, MLLMs)参数规模不断增长所带来的计算瓶颈问题,特别是长序列中自注意力机制的二次复杂度问题。解决方案的关键在于将研究重点从模型压缩转向数据压缩,具体表现为通过令牌压缩(Token Compression)减少模型训练或推理过程中的令牌数量,从而提升AI效率。论文提出令牌压缩作为新的研究前沿,并通过数学框架分析其在处理长上下文任务中的重要性与优势。

链接: https://arxiv.org/abs/2505.19147
作者: Xuyang Liu,Zichen Wen,Shaobo Wang,Junjie Chen,Zhishan Tao,Yubo Wang,Xiangqi Jin,Chang Zou,Yiyu Wang,Chenfei Liao,Xu Zheng,Honggang Chen,Weijia Li,Xuming Hu,Conghui He,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Sichuan University (四川大学); University of Electronic Science & Technology of China (电子科技大学); Shanghai AI Laboratory (上海人工智能实验室); Sun Yat-sen University (中山大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project: \url{ this https URL }

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbfwe argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community’s advancement.
zh

[NLP-195] RetrieveAll: A Multilingual Named Entity Recognition Framework with Large Language Models

【速读】: 该论文试图解决多语言命名实体识别(NER)中低资源和中等资源语言性能不足的问题,以及现有多语言NER方法在多语言适应过程中面临的语言干扰问题,包括不同语言间的特征冲突和高资源语言对低资源语言特征的抑制。解决方案的关键在于提出一种基于动态LoRA的通用多语言NER框架RetrieveAll,该框架通过解耦跨语言的任务特定特征,实现了高效的动态适应性,并引入了一种跨粒度知识增强方法,充分利用数据的内在潜力,从而将传统“提示引导推理”范式提升为“提示驱动学习”。

链接: https://arxiv.org/abs/2505.19128
作者: Jin Zhang,Fan Gao,Linyu Li,Yongbin Yu,Xiangxiang Wang,Nyima Tashi,Gadeng Luosang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of large language models has led to significant performance breakthroughs in named entity recognition (NER) for high-resource languages, yet there remains substantial room for improvement in low- and medium-resource languages. Existing multilingual NER methods face severe language interference during the multi-language adaptation process, manifested in feature conflicts between different languages and the competitive suppression of low-resource language features by high-resource languages. Although training a dedicated model for each language can mitigate such interference, it lacks scalability and incurs excessive computational costs in real-world applications. To address this issue, we propose RetrieveAll, a universal multilingual NER framework based on dynamic LoRA. The framework decouples task-specific features across languages and demonstrates efficient dynamic adaptability. Furthermore, we introduce a cross-granularity knowledge augmented method that fully exploits the intrinsic potential of the data without relying on external resources. By leveraging a hierarchical prompting mechanism to guide knowledge injection, this approach advances the paradigm from “prompt-guided inference” to “prompt-driven learning.” Experimental results show that RetrieveAll outperforms existing baselines; on the PAN-X dataset, it achieves an average F1 improvement of 12.1 percent.
zh

[NLP-196] MMATH: A Multilingual Benchmark for Mathematical Reasoning

【速读】: 该论文试图解决大型语言模型在多语言复杂推理任务中的能力不足问题,尤其是现有研究主要集中在简单任务(如MGSM)上,而对多语言复杂推理的探索较为有限。其解决方案的关键在于引入MMATH基准,这是一个涵盖10种语系多样语言的374个高质量数学问题的多语言复杂推理数据集,并通过实验发现,让模型在英语中进行推理并在目标语言中回答,能够同时提升性能并保持目标语言的一致性。

链接: https://arxiv.org/abs/2505.19126
作者: Wenyang Luo,Wayne Xin Zhao,Jing Sha,Shijin Wang,Ji-Rong Wen
机构: Renmin University of China (中国人民大学); iFLYTEK Research (Central China) (科大讯飞研究(华中)); iFLYTEK Co., Ltd. (科大讯飞公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at this https URL.
zh

[NLP-197] Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在伦理偏见的问题,特别是针对全球讨论且可能敏感的话题进行验证和比较。其解决方案的关键在于引入多语言敏感问题与答案数据集(Multilingual Sensitive Questions Answers Dataset, MSQAD),通过收集来自人权观察的17个话题的新闻文章,并生成多语言的敏感问题及对应回答,进而利用统计假设检验分析不同语言和话题下的偏见情况,揭示跨语言差异导致的伦理偏见的普遍性。

链接: https://arxiv.org/abs/2505.19121
作者: Seunguk Yu,Juhwan Choi,Youngbin Kim
机构: Chung-Ang University (忠南大学); AITRICS (AITRICS)
类目: Computation and Language (cs.CL)
备注: ACL 2025 main conference

点击查看摘要

Abstract:Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions Answers Dataset (MSQAD), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.
zh

[NLP-198] Controlling Language Confusion in Multilingual LLM s

【速读】: 该论文试图解决大型语言模型在低资源场景下出现的语言混淆(language confusion)问题,即模型生成的响应部分或全部为非预期语言,这会严重影响用户体验。解决方案的关键在于通过引入适当的惩罚项来抑制语言混淆的生成,具体而言,采用ORPO(Objective Regularization for Preference Optimization)方法,在标准监督微调(SFT)基础上增加对不期望输出风格的惩罚,从而在不降低整体模型性能的前提下,有效抑制高解码温度下的语言混淆现象。

链接: https://arxiv.org/abs/2505.19116
作者: Nahyun Lee,Yeongseo Woo,Hyunwoo Ko,Guijin Son
机构: Chungang University (忠南大学); OneLineAI (OneLineAI)
类目: Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:Large language models often suffer from language confusion, a phenomenon where responses are partially or entirely generated in unintended languages. This can critically impact user experience in low-resource settings. We hypothesize that conventional supervised fine-tuning exacerbates this issue because the softmax objective focuses probability mass only on the single correct token but does not explicitly penalize cross-lingual mixing. Interestingly, by observing loss trajectories during the pretraining phase, we observe that models fail to learn to distinguish between monolingual and language-confused text. Additionally, we find that ORPO, which adds penalties for unwanted output styles to standard SFT, effectively suppresses language-confused generations even at high decoding temperatures without degrading overall model performance. Our findings suggest that incorporating appropriate penalty terms can mitigate language confusion in low-resource settings with limited data.
zh

[NLP-199] Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识密集型多跳推理任务中面临的挑战,尤其是由于缺乏中间指导导致的检索不准确和中间推理错误的问题。其解决方案的关键在于提出一种自评引导的迭代推理方法(Self-Critique Guided Iterative Reasoning, SiGIR),通过端到端训练使模型能够通过问题分解迭代解决复杂问题,并具备对中间推理步骤进行自我评估的能力,从而在迭代过程中进行分支探索并选择有前景的推理路径。

链接: https://arxiv.org/abs/2505.19112
作者: Zheng Chu,Huiming Fan,Jingchang Chen,Qianyu Wang,Mingda Yang,Jiafeng Liang,Zhongjie Wang,Hao Li,Guo Tang,Ming Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by 8.6% . Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: this https URL.
zh

[NLP-200] CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models ACL2025

【速读】: 该论文试图解决在跨语言(cross-lingual)和跨模态(cross-modal)场景下大型语言模型(Large Language Models, LLMs)中幻觉(hallucination)问题的研究不足问题,当前研究多局限于单一场景,缺乏对两者结合情况的系统探索。解决方案的关键是引入一个新颖的联合跨语言与跨模态幻觉基准(Cross-lingual and Cross-modal Hallucinations benchmark, CCHall),该基准同时涵盖跨语言和跨模态幻觉场景,能够全面评估LLMs在联合跨语言与跨模态任务中的表现。

链接: https://arxiv.org/abs/2505.19108
作者: Yongheng Zhang,Xu Liu,Ruoxi Zhou,Qiguang Chen,Hao Fei,Wenpeng Lu,Libo Qin
机构: Central South University (中南大学); Soochow University (苏州大学); National University of Singapore (新加坡国立大学); Qilu University of Technology (山东理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 Main Conference

点击查看摘要

Abstract:Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
zh

[NLP-201] WHISTRESS: Enriching Transcriptions with Sentence Stress Detection INTERSPEECH2025

【速读】: 该论文试图解决语音转录系统中句子重音(sentence stress)检测的问题,以更准确地捕捉说话者的意图。解决方案的关键在于提出WHISTRESS,这是一种无需对齐的增强转录系统的方法,并结合了TINYSTRESS-15K,一个通过完全自动化数据集创建过程生成的可扩展合成训练数据集。该方法在不需要额外输入先验信息的情况下表现出色,并且在多种基准测试中展现出强大的零样本泛化能力。

链接: https://arxiv.org/abs/2505.19103
作者: Iddo Yosha,Dorin Shteyman,Yossi Adi
机构: The Hebrew University of Jerusalem, Israel
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech2025

点击查看摘要

Abstract:Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: this https URL.
zh

[NLP-202] ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning ACL2025

【速读】: 该论文试图解决传统直接偏好优化(Direct Preference Optimization, DPO)在对齐大语言模型(Large Language Models, LLMs)时存在的问题,即其依赖二元偏好优化,仅对整个响应进行奖励或惩罚,而未考虑细粒度的段落正确性,导致优化结果不理想。解决方案的关键在于提出自适应句级偏好优化(Adaptive Sentence-level Preference Optimization, ASPO),通过动态计算基于模型预测的句级自适应奖励,实现更精确的偏好优化,从而提升多模态模型的对齐效果。

链接: https://arxiv.org/abs/2505.19100
作者: Yeyuan Wang,Dehong Gao,Rujiao Long,Lei Yi,Linbo Jin,Libin Yang,Xiaoyan Cai
机构: Northwestern Polytechnical University, School of Automation, Xi’an, China; Northwestern Polytechnical University, School of Cybersecurity, Xi’an, China; Binjiang Institute of Artificial Intelligence, ZJUT, Hangzhou, China; Alibaba Group, Hangzhou, China
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACL 2025 findings

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.
zh

[NLP-203] ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

【速读】: 该论文试图解决当前大型视觉-语言模型(Large Vision-Language Models, VLMs)在处理文本密集型图像时的阅读理解能力评估不足的问题。现有研究主要关注视觉理解任务,如图表、颜色方案和OCR等,但缺乏对VLMs在文本-rich图像中进行有效阅读与推理能力的系统评估。为填补这一空白,作者提出了ReadBench,这是一个专门设计用于评估VLMs阅读理解能力的多模态基准。ReadBench将传统文本基准中的上下文转换为包含文本的图像,同时保持文本提示和问题不变,从而提供了一种更贴近实际应用场景的评估方式。该解决方案的关键在于通过结构化的方式将文本任务迁移至图像环境中,以更真实地反映VLMs在处理复杂文本图像时的实际表现。

链接: https://arxiv.org/abs/2505.19091
作者: Benjamin Clavié,Florian Brand
机构: Answer.AI (Answer.AI); University of Trier (特里尔大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心(DFKI))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks…), there is limited assessment of VLMs’ ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at this https URL .
zh

[NLP-204] Universal Reason er: A Single Composable Plug-and-Play Reason er for Frozen LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提升推理等特定技能时面临的计算资源消耗大和泛化能力下降的问题。传统参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法虽节省资源,但因架构依赖性需针对每个LLM主干重新训练。论文提出的解决方案是设计一个通用推理模块Universal Reasoner (UniR),其关键在于将奖励分解为独立训练的推理模块,通过预定义奖励进行训练,从而将轨迹级信号转化为分词级指导,并在推理阶段通过简单叠加输出logits与任何冻结的LLM结合,实现模块化组合与复杂推理,同时具备良好的弱到强泛化能力。

链接: https://arxiv.org/abs/2505.19075
作者: Jaemin Kim,Hangeol Chang,Hyunmin Hwang,Choonghan Kim,Jong Chul Ye
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms \addexisting baseline fine-tuning methods using the Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at this https URL
zh

[NLP-205] owards Harmonized Uncertainty Estimation for Large Language Models ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成结果的可靠性量化问题,特别是如何准确估计生成内容的不确定性。现有方法在指示性、平衡性和校准性之间难以取得协调一致的不确定性估计,从而限制了其在实际应用中的准确性。解决方案的关键在于提出CUE(Corrector for Uncertainty Estimation),该方法通过一个与目标LLM性能对齐数据集训练的轻量级模型,对不确定性得分进行调整,从而提升不确定性估计的准确性。

链接: https://arxiv.org/abs/2505.19073
作者: Rui Li,Jing Long,Muge Qi,Heming Xia,Lei Sha,Peiyi Wang,Zhifang Sui
机构: Peking University (北京大学); The Hong Kong Polytechnic University (香港理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.
zh

[NLP-206] UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)输出不确定性量化(Uncertainty Quantification, UQ)中因输出长度引起的偏差问题。现有许多UQ技术依赖于token概率,这导致了与输出长度相关的偏差,尽管部分方法尝试进行补偿,但此类偏差在长度归一化方法中仍然存在。解决方案的关键在于提出UNCERTAINTY-LINE: (Length-INvariant Estimation),这是一种后处理的去偏方法,通过将不确定性得分对输出长度进行回归,并利用残差作为校正后的长度无关估计,从而有效减少长度带来的偏差。

链接: https://arxiv.org/abs/2505.19060
作者: Roman Vashurin,Maiya Goloburda,Preslav Nakov,Maxim Panov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.
zh

[NLP-207] An Embarrassingly Simple Defense Against LLM Abliteration Attacks

【速读】: 该论文试图解决生成式 AI (Generative AI) 在面对“abliteration”攻击时,因单一潜在方向被抑制而导致拒绝行为失效的问题,从而生成不道德内容。解决方案的关键在于构建一个扩展的拒绝数据集(extended-refusal dataset),该数据集包含有害提示及其完整拒绝理由,通过在此数据集上微调模型,增强其拒绝能力,从而有效抵御abliteration攻击,同时保持模型的整体性能。

链接: https://arxiv.org/abs/2505.19056
作者: Harethah Abu Shairah,Hasan Abed Al Kader Hammoud,Bernard Ghanem,George Turkiyyah
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models’ refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.
zh

[NLP-208] Efficient Data Selection at Scale via Influence Distillation

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)高效训练中的数据选择问题,即如何从大量数据中挑选出对模型性能提升最有效的样本。其解决方案的关键在于提出了一种名为影响蒸馏(Influence Distillation)的新框架,该框架利用二阶信息对训练样本进行最优加权,通过蒸馏每个样本对目标分布的影响,为模型分配特定于模型的权重,从而指导微调过程以在目标领域取得优异表现。

链接: https://arxiv.org/abs/2505.19051
作者: Mahdi Nikdan,Vincent Cohen-Addad,Dan Alistarh,Vahab Mirrokni
机构: ISTA & Google Research; Google Research; ISTA & Red Hat AI; Google Research
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample’s influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a \textitlandmark-based approximation : influence is precisely computed for a small subset of “landmark” samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to 3.5\times faster selection.
zh

[NLP-209] SQUiD: Synthesizing Relational Databases from Unstructured Text

【速读】: 该论文试图解决如何将非结构化文本数据自动转换为关系型数据库的问题(Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents)。解决方案的关键在于提出一种名为SQUiD的神经符号框架,该框架通过四个阶段的分解任务,结合专门的技术手段,实现从原始文本中自动生成数据库模式并填充表数据。实验表明,SQUiD在多个数据集上均优于基线方法。

链接: https://arxiv.org/abs/2505.19025
作者: Mushtari Sadia,Zhenning Yang,Yunming Xiao,Ang Chen,Amrita Roy Chowdhury
机构: University of Michigan (密歇根大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.
zh

[NLP-210] CrosGrpsABS: Cross-Attention over Syntactic and Semantic Graphs for Aspect-Based Sentiment Analysis in a Low-Resource Language

【速读】: 该论文旨在解决低资源语言(如孟加拉语)在基于方面的情感分析(Aspect-Based Sentiment Analysis, ABSA)任务中面临的挑战,包括缺乏标注数据、预训练模型和优化超参数等问题。其解决方案的关键在于提出一种名为CrosGrpsABS的新型混合框架,该框架通过句法和语义图之间的双向交叉注意力机制,增强方面级情感分类效果。该方法结合了基于Transformer的上下文嵌入与图卷积网络,并依赖于基于规则的句法依存解析和语义相似性计算,从而有效融合局部句法结构与全局语义上下文,提升了在低资源和高资源环境下的情感分类性能。

链接: https://arxiv.org/abs/2505.19018
作者: Md. Mithun Hossain,Md. Shakil Hossain,Sudipto Chaki,Md. Rajib Hossain,Md. Saifur Rahman,A. B. M. Shawkat Ali
机构: Bangladesh University of Business and Technology (孟加拉国商业与技术大学); Chittagong University of Engineering and Technology (吉大港工程与技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) is a fundamental task in natural language processing, offering fine-grained insights into opinions expressed in text. While existing research has largely focused on resource-rich languages like English which leveraging large annotated datasets, pre-trained models, and language-specific tools. These resources are often unavailable for low-resource languages such as Bengali. The ABSA task in Bengali remains poorly explored and is further complicated by its unique linguistic characteristics and a lack of annotated data, pre-trained models, and optimized hyperparameters. To address these challenges, this research propose CrosGrpsABS, a novel hybrid framework that leverages bidirectional cross-attention between syntactic and semantic graphs to enhance aspect-level sentiment classification. The CrosGrpsABS combines transformerbased contextual embeddings with graph convolutional networks, built upon rule-based syntactic dependency parsing and semantic similarity computations. By employing bidirectional crossattention, the model effectively fuses local syntactic structure with global semantic context, resulting in improved sentiment classification performance across both low- and high-resource settings. We evaluate CrosGrpsABS on four low-resource Bengali ABSA datasets and the high-resource English SemEval 2014 Task 4 dataset. The CrosGrpsABS consistently outperforms existing approaches, achieving notable improvements, including a 0.93% F1-score increase for the Restaurant domain and a 1.06% gain for the Laptop domain in the SemEval 2014 Task 4 benchmark.
zh

[NLP-211] Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

【速读】: 该论文旨在解决多模态学习中跨模态交互不足以及静态融合策略未能充分利用不同模态互补性的问题(cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities)。其解决方案的关键在于提出一种新型的多模态Co-AttenDWG架构,该架构结合了双路径编码、维度门控的协同注意力机制以及先进的专家融合模块,以实现更精细的跨模态交互和更有效的特征融合。

链接: https://arxiv.org/abs/2505.19010
作者: Md. Mithun Hossain,Md. Shakil Hossain,Sudipto Chaki,M. F. Mridha
机构: Bangladesh University of Business and Technology (孟加拉国商业与技术大学); American International University-Bangladesh (美国国际大学-孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.
zh

[NLP-212] VerIPO: Cultivating Long Reasoning in Video-LLM s via Verifier-Gudied Iterative Policy Optimization

【速读】: 该论文旨在解决将强化学习(Reinforcement Learning, RL)应用于视频大语言模型(Video-LLMs)时面临的数据准备瓶颈以及长链式思维(long chain-of-thoughts, CoTs)质量不稳定的问题。其解决方案的关键在于提出一种基于验证器引导的迭代策略优化方法(Verifier-guided Iterative Policy Optimization, VerIPO),该方法通过在GRPO和直接偏好优化(Direct Preference Optimization, DPO)训练阶段之间引入一个滚动感知验证器(Rollout-Aware Verifier),构建GRPO-Verifier-DPO训练循环,从而提升模型生成高质量、长周期且上下文一致的CoTs能力。

链接: https://arxiv.org/abs/2505.19000
作者: Yunxin Li,Xinyu Chen,Zitao Li,Zhenyu Liu,Longyue Wang,Wenhan Luo,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳); Alibaba International Group (阿里巴巴国际集团); Division of AMC and Department of ECE, HKUST (HKUST机电工程系与AMC部门)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures, Project Link: this https URL

点击查看摘要

Abstract:Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream this http URL address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs’ capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO’s expansive search and DPO’s targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
zh

[NLP-213] FiLLM – A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM )

【速读】: 该论文旨在解决菲律宾语自然语言处理(Natural Language Processing, NLP)能力不足的问题,通过构建一个针对菲律宾语优化的大规模语言模型FiLLM来提升相关任务的性能。其解决方案的关键在于基于SeaLLM-7B 2.5模型,采用低秩适应(Low-Rank Adaptation, LoRA)微调方法,在保持任务特定性能的同时提高内存效率,并在多样化的菲律宾语数据集上进行训练与评估,以应对命名实体识别(Named Entity Recognition, NER)、词性标注(Part-of-Speech, POS) tagging、依存句法分析和文本摘要等关键NLP任务。

链接: https://arxiv.org/abs/2505.18995
作者: Carlos Jude G. Maminta,Isaiah Job Enriquez,Deandre Nigel Nunez,Michael B. Dela Fuente(Institution College of Computer and Information Sciences, Polytechnic University of the Philippines, Sta. Mesa, Manila)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.
zh

[NLP-214] STRICT: Stress Test of Rendering Images Containing Text

【速读】: 该论文旨在解决扩散模型在图像中生成一致且可读文本的局限性,这一问题通常归因于基于扩散生成的局部性偏差,该偏差限制了模型对长程空间依赖性的建模能力。论文提出的解决方案关键在于引入了一个名为STRICT的基准测试,用于系统评估扩散模型在图像中生成连贯且符合指令的文本的能力,涵盖文本最大长度、正确性与可读性以及指令遵循比例等多个维度。

链接: https://arxiv.org/abs/2505.18985
作者: Tianyu Zhang,Xinyu Wang,Zhenghan Tai,Lu Li,Jijun Chi,Jingrui Tian,Hailin He,Suyuchen Wang
机构: DIRO, Université de Montréal 2Mila - Quebec AI Institute 3McGill University 4University of Toronto 5University of Pennsylvania 6University of California, Los Angeles 7Southwestern University of Finance and Economics
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce \textbfSTRICT , a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at this https URL.
zh

[NLP-215] AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models

【速读】: 该论文试图解决现有数学推理基准测试主要为英语或基于翻译的版本,这可能导致语义偏差并掩盖语言特定的推理错误。其解决方案的关键是提出AI4Math,这是一个由105道原生西班牙语编写的大学水平数学问题组成的基准测试,涵盖七个高级领域,并附有逐步的人工解答。通过在四种配置下对六种大型语言模型进行评估,研究揭示了原生语言基准和领域特定评估的重要性,以发现标准指标未能捕捉到的推理失败。

链接: https://arxiv.org/abs/2505.18978
作者: Miguel Angel Peñaloza Perez(1 and 2 and 5),Bruno Lopez Orozco(1 and 2 and 3),Jesus Tadeo Cruz Soto(1 and 2 and 4),Michelle Bruno Hernandez(1 and 2),Miguel Angel Alvarado Gonzalez(1 and 2),Sandra Malagon(1 and 2) ((1) Carreras con Impacto, (2) Aixo Lab, (3) Facultad de Ciencias UNAM Mexico, (4) Facultad de Matematicas Universidad Veracruzana Mexico, (5) Centro de Investigación Cientifica y de Educacion Superior de Ensenada Baja California Mexico)
机构: Carreras con Impacto (Carreras con Impacto); Aixo Lab (Aixo Lab); Centro de Investigación Científica y de Educación Superior de Ensenada, Baja California, México. (Centro de Investigación Científica y de Educación Superior de Ensenada, Baja California, México); Facultad de Ciencias, UNAM, México. (Facultad de Ciencias, UNAM, México); Facultad de Matemáticas, Universidad Veracruzana, México. (Facultad de Matemáticas, Universidad Veracruzana, México)
类目: Computation and Language (cs.CL)
备注: 36 pages, 5 figures

点击查看摘要

Abstract:Existing mathematical reasoning benchmarks are predominantly English only or translation-based, which can introduce semantic drift and mask languagespecific reasoning errors. To address this, we present AI4Math, a benchmark of 105 original university level math problems natively authored in Spanish. The dataset spans seven advanced domains (Algebra, Calculus, Geometry, Probability, Number Theory, Combinatorics, and Logic), and each problem is accompanied by a step by step human solution. We evaluate six large language models GPT 4o, GPT 4o mini, o3 mini, LLaMA 3.3 70B, DeepSeek R1 685B, and DeepSeek V3 685B under four configurations: zero shot and chain of thought, each in Spanish and English. The top models (o3 mini, DeepSeek R1 685B, DeepSeek V3 685B) achieve over 70% accuracy, whereas LLaMA 3.3 70B and GPT-4o mini remain below 40%. Most models show no significant performance drop between languages, with GPT 4o even performing better on Spanish problems in the zero shot setting. Geometry, Combinatorics, and Probability questions remain persistently challenging for all models. These results highlight the need for native-language benchmarks and domain-specific evaluations to reveal reasoning failures not captured by standard metrics.
zh

[NLP-216] Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

【速读】: 该论文试图解决现有语言模型在复杂层次推理任务中对潜在层次结构建模能力不足的问题,特别是传统基于欧几里得嵌入的模型难以捕捉语义层次关系。其解决方案的关键在于提出Hierarchical Mamba (HiM),通过将高效的Mamba2与双曲几何的指数增长和弯曲特性相结合,学习具有层次感知能力的语言嵌入,从而提升对深层语言理解的能力。HiM通过将Mamba2处理的序列映射到Poincaré球或Lorentz流形,并引入可学习的曲率,结合双曲损失进行优化,以有效捕获不同层次间的相对距离,增强长程推理能力。

链接: https://arxiv.org/abs/2505.18973
作者: Sarang Patil,Ashish Parmanand Pandey,Ioannis Koutis,Mengjia Xu
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Selective state-space models have achieved great success in long-sequence modeling. However, their capacity for language representation, especially in complex hierarchical reasoning tasks, remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this limitation, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with exponential growth and curved nature of hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincare ball (via tangent-based mapping) or Lorentzian manifold (via cosine and sine-based mapping) with “learnable” curvature, optimized with a combined hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning. This makes it well-suited for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. We evaluated our HiM with four linguistic and medical datasets for mixed-hop prediction and multi-hop inference tasks. Experimental results demonstrated that: 1) Both HiM models effectively capture hierarchical relationships for four ontological datasets, surpassing Euclidean baselines. 2) HiM-Poincare captures fine-grained semantic distinctions with higher h-norms, while HiM-Lorentz provides more stable, compact, and hierarchy-preserving embeddings favoring robustness over detail.
zh

[NLP-217] Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE

【速读】: 该论文旨在解决知识图谱补全(Knowledge Graph Completion)中模型复杂度高且可解释性差的问题,提出了一种可解释且模块化的解决方案RelatE。RelatE的关键在于采用实数值的相位-模值分解方法,通过正弦相位对齐来编码关系模式,如对称性、逆性和复合性,从而在保持架构简洁性的同时实现与基于复数嵌入或深度神经网络的方法相当甚至更优的性能。

链接: https://arxiv.org/abs/2505.18971
作者: Abhijit Chakraborty,Chahana Dahal,Ashutosh Balasubramaniam,Tejas Anvekar,Vivek Gupta
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We revisit the efficacy of simple, real-valued embedding models for knowledge graph completion and introduce RelatE, an interpretable and modular method that efficiently integrates dual representations for entities and relations. RelatE employs a real-valued phase-modulus decomposition, leveraging sinusoidal phase alignments to encode relational patterns such as symmetry, inversion, and composition. In contrast to recent approaches based on complex-valued embeddings or deep neural architectures, RelatE preserves architectural simplicity while achieving competitive or superior performance on standard benchmarks. Empirically, RelatE outperforms prior methods across several datasets: on YAGO3-10, it achieves an MRR of 0.521 and Hit@10 of 0.680, surpassing all baselines. Additionally, RelatE offers significant efficiency gains, reducing training time by 24%, inference latency by 31%, and peak GPU memory usage by 22% compared to RotatE. Perturbation studies demonstrate improved robustness, with MRR degradation reduced by up to 61% relative to TransE and by up to 19% compared to RotatE under structural edits such as edge removals and relation swaps. Formal analysis further establishes the model’s full expressiveness and its capacity to represent essential first-order logical inference patterns. These results position RelatE as a scalable and interpretable alternative to more complex architectures for knowledge graph completion.
zh

[NLP-218] Learning to Explain: Prototype-Based Surrogate Models for LLM Classification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在决策过程中的透明性不足问题,以及现有解释方法在忠实性和可理解性方面的局限性。其解决方案的关键在于提出一种基于原型的替代框架——ProtoSurE,该框架通过设计可解释的替代模型,并利用句子级原型作为人类可理解的概念,实现了对LLMs的忠实且易于理解的解释。

链接: https://arxiv.org/abs/2505.18970
作者: Bowen Wei,Ziwei Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance on natural language tasks, but their decision-making processes remain largely opaque. Existing explanation methods either suffer from limited faithfulness to the model’s reasoning or produce explanations that humans find difficult to understand. To address these challenges, we propose \textbfProtoSurE, a novel prototype-based surrogate framework that provides faithful and human-understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as human-understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms SOTA explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.
zh

[NLP-219] System-1.5 Reasoning : Traversal in Language and Latent Spaces with Dynamic Shortcuts

【速读】: 该论文旨在解决传统链式思维(Chain-of-thought, CoT)推理在大型语言模型中因冗长的中间输出导致的效率低下问题,以及现有隐空间推理方法在计算资源分配上的不足。其解决方案的关键在于提出System-1.5 Reasoning框架,该框架通过动态分配计算资源实现自适应推理,引入两种动态快捷路径:模型深度快捷路径(DS)通过轻量适配器分支提前退出非关键标记以减少计算,同时允许关键标记继续深入Transformer层;步骤快捷路径(SS)通过复用解码步骤中的隐藏状态跳过冗余步骤,在隐空间中实现横向推理。

链接: https://arxiv.org/abs/2505.18962
作者: Xiaoqiang Wang,Suyuchen Wang,Yun Zhu,Bang Liu
机构: Mila - Quebec AI Institute (Mila-魁北克人工智能研究所); Apple (苹果); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative System-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent this http URL, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning).Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.
zh

[NLP-220] Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk?

【速读】: 该论文试图解决生成式 AI (Generative AI) 在评估投资风险偏好时的可信度问题,特别是在不同地区和人口统计学特征下的表现一致性。其解决方案的关键在于通过构建包含16个与风险相关的特征、覆盖10个国家和两种性别的1,720个用户档案,对主流AI模型(包括GPT-4、Claude 3.7、Gemini 1.5、LLaMA 3.1/3.3、DeepSeek-V3和Mistral-small)进行系统性评估,以揭示模型在评分分布和人口敏感性方面的显著差异,并强调在受监管的金融场景中进行严格、标准化AI系统评估的重要性。

链接: https://arxiv.org/abs/2505.18953
作者: Divij Chawla,Ashita Bhutada,Do Duc Anh,Abhinav Raghunathan,Vinod SP,Cathy Guo,Dar Win Liew,Prannaya Gupta,Rishabh Bhardwaj,Rajat Bhardwaj,Soujanya Poria
机构: Walled AI Labs(墙AI实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We evaluate the credibility of leading AI models in assessing investment risk appetite. Our analysis spans proprietary (GPT-4, Claude 3.7, Gemini 1.5) and open-weight models (LLaMA 3.1/3.3, DeepSeek-V3, Mistral-small), using 1,720 user profiles constructed with 16 risk-relevant features across 10 countries and both genders. We observe significant variance across models in score distributions and demographic sensitivity. For example, GPT-4o assigns higher risk scores to Nigerian and Indonesian profiles, while LLaMA and DeepSeek show opposite gender tendencies in risk classification. While some models (e.g., GPT-4o, LLaMA 3.1) align closely with expected scores in low- and mid-risk ranges, none maintain consistent performance across regions and demographics. Our findings highlight the need for rigorous, standardized evaluations of AI systems in regulated financial contexts to prevent bias, opacity, and inconsistency in real-world deployment.
zh

[NLP-221] BnMMLU: Measuring Massive Multitask Language Understanding in Bengali

【速读】: 该论文试图解决现有MMLU基准在低资源语言如孟加拉语(Bengali)中代表性不足的问题,从而无法全面评估语言模型的多任务语言理解能力。解决方案的关键是构建BnMMLU基准,该基准覆盖23个领域,包含138,949个问题-选项对,并采用多项选择格式以评估事实知识、应用型问题解决和推理能力,同时通过标注三种认知类别(事实知识、程序性应用和推理)深入分析模型在不同认知任务中的表现。

链接: https://arxiv.org/abs/2505.18951
作者: Saman Sarker Joy
机构: University of Malaya (马来亚大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 9 figures, 5 tables; Code dataset available at this https URL

点击查看摘要

Abstract:The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.
zh

[NLP-222] he Price of Format: Diversity Collapse in LLM s

【速读】: 该论文试图解决指令调优的大语言模型(LLM)在推理过程中因结构化模板导致的多样性崩溃(diversity collapse)问题,即模型对开放式输入生成语义相似的输出,从而削弱创造力和变异性。解决方案的关键在于揭示结构化标记对输出空间的显著约束,并指出格式一致性在结构敏感任务中的重要性,而输出多样性主要受结构化标记存在与否的影响,最小格式化可产生最多样化的输出。研究强调了当前提示设计在促进对齐的同时可能抑制多样性,提出了需要更加注重多样性的提示设计与指令调优方法。

链接: https://arxiv.org/abs/2505.18949
作者: Longfei Yun,Chenyang An,Zilong Wang,Letian Peng,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model’s output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
zh

[NLP-223] MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

【速读】: 该论文旨在解决大型语言模型(LLM)在处理人类交流中的模糊性和语境细微差别时的不足,特别是在理解他人未明说的意图、情绪和信念等社会认知任务上的缺陷。其解决方案的关键在于提出MetaMind框架,该框架基于元认知的心理学理论,将社会理解分解为三个协作阶段:Theory of Mind (ToM) Agent生成用户心理状态假设,Domain Agent利用文化规范和伦理约束对假设进行细化,Response Agent生成符合语境的回应并验证与推断意图的一致性。通过这一结构,模型在多个基准测试中实现了最先进的性能,并首次使LLM在关键ToM任务上达到人类水平表现。

链接: https://arxiv.org/abs/2505.18943
作者: Xuanming Zhang,Yuxuan Chen,Min-Hsuan Yeh,Yixuan Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human social interactions depend on the ability to infer others’ unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework’s ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at this https URL.
zh

[NLP-224] Language Models Surface the Unwritten Code of Science and Society

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)如何继承人类偏见,并进一步利用这些偏见揭示社会“不成文规则”(如隐性刻板印象和启发式方法)的问题。其解决方案的关键在于构建一个概念框架,通过案例研究(以科学领域的同行评审为例)来揭示隐藏的规则,即审稿人关注但很少明确陈述的因素。该框架的核心是让LLMs生成自洽的假设,解释为何某篇论文在评分中表现更强,并通过迭代搜索未被现有假设解释的论文对,深入挖掘更深层次的假设。这一过程揭示了LLMs在推理过程中从强调理论严谨性等内在特征向强调外部联系叙述的转变,从而暴露了科学卓越观念中的神话与现实之间的差异。

链接: https://arxiv.org/abs/2505.18942
作者: Honglin Bao,Siyang Wu,Jiwoong Choi,Yingrong Mao,James A. Evans
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:This paper calls on the research community not only to investigate how human biases are inherited by large language models (LLMs) but also to explore how these biases in LLMs can be leveraged to make society’s “unwritten code” - such as implicit stereotypes and heuristics - visible and accessible for critique. We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review - the factors that reviewers care about but rarely state explicitly due to normative scientific expectations. The idea of the framework is to push LLMs to speak out their heuristics through generating self-consistent hypotheses - why one paper appeared stronger in reviewer scoring - among paired papers submitted to 45 computer science conferences, while iteratively searching deeper hypotheses from remaining pairs where existing hypotheses cannot explain. We observed that LLMs’ normative priors about the internal characteristics of good science extracted from their self-talk, e.g. theoretical rigor, were systematically updated toward posteriors that emphasize storytelling about external connections, such as how the work is positioned and connected within and across literatures. This shift reveals the primacy of scientific myths about intrinsic properties driving scientific excellence rather than extrinsic contextualization and storytelling that influence conceptions of relevance and significance. Human reviewers tend to explicitly reward aspects that moderately align with LLMs’ normative priors (correlation = 0.49) but avoid articulating contextualization and storytelling posteriors in their review comments (correlation = -0.14), despite giving implicit reward to them with positive scores. We discuss the broad applicability of the framework, leveraging LLMs as diagnostic tools to surface the tacit codes underlying human society, enabling more precisely targeted responsible AI.
zh

[NLP-225] REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing

【速读】: 该论文试图解决大型语言模型在知识编辑过程中出现的过拟合问题,即事实更新可能超出预期范围,导致编辑目标在语境不适宜的情况下被过度强调。解决方案的关键在于提出REACT(Representation Extraction And Controllable Tuning)框架,该框架包含两个阶段:第一阶段通过定制化刺激提取潜在事实表示,并利用主成分分析结合可学习线性变换计算每个实例的方向性“信念偏移”向量;第二阶段则通过预训练分类器控制扰动幅度,在上下文必要时对隐藏状态进行可控修改,从而实现精确且可控制的知识编辑。

链接: https://arxiv.org/abs/2505.18933
作者: Haitian Zhong,Yuhuan Liu,Ziyang Xu,Guofan Liu,Qiang Liu,Shu Wu,Zhe Zhao,Liang Wang,Tieniu Tan
机构: NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences (国家模式识别重点实验室,自动化研究所,中国科学院); Cuiying Honors College, Lanzhou University (兰州大学萃英学院); Department of Mathematics, The Chinese University of Hong Kong (数学系,香港中文大学); Tencent (腾讯)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it’s contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional “belief shift” vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
zh

[NLP-226] Can Large Language Models Infer Causal Relationships from Real-World Text?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在真实世界文本中推理因果关系的能力不足的问题。现有研究主要基于合成文本,这些文本包含显性描述的简单因果关系,无法反映现实任务的复杂性。论文的关键解决方案是构建一个来自真实学术文献的基准数据集,该数据集涵盖了不同长度、因果关系复杂度(包括显性与隐性程度、事件数量和因果关系数量)以及领域和子领域的多样化文本,旨在更真实地评估LLMs的因果推理能力。

链接: https://arxiv.org/abs/2505.18931
作者: Ryan Saklad,Aman Chadha,Oleg Pavlov,Raha Moraffah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work primarily focuses on synthetically generated texts which involve simple causal relationships explicitly mentioned in the text. This fails to reflect the complexities of real-world tasks. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature which includes diverse texts with respect to length, complexity of relationships (different levels of explicitness, number of events, and causal relationships), and domains and sub-domains. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on state-of-the-art LLMs evaluated on our proposed benchmark demonstrate significant challenges, with the best-performing model achieving an average F1 score of only 0.477. Analysis reveals common pitfalls: difficulty with implicitly stated information, in distinguishing relevant causal factors from surrounding contextual details, and with connecting causally relevant information spread across lengthy textual passages. By systematically characterizing these deficiencies, our benchmark offers targeted insights for further research into advancing LLM causal reasoning.
zh

[NLP-227] Meta-aware Learning in text-to-SQL Large Language Model

【速读】: 该论文旨在解决文本到SQL(text-to-SQL)任务中由于复杂领域信息和数据库结构带来的挑战。其解决方案的关键在于提出一种元认知学习框架,该框架通过整合领域知识、数据库模式、思维链推理过程以及元数据关系,提升SQL生成的质量。该框架包含四种学习策略:基于模式的学习、思维链(Chain-of-Thought, CoT)学习、知识增强学习和关键信息分词,通过微调大型语言模型(Large Language Models, LLMs),使其更好地理解数据库结构和元数据信息,从而在业务领域内提升SQL生成的准确性与多任务能力,并减少灾难性遗忘问题。

链接: https://arxiv.org/abs/2505.18929
作者: Wenda Zhang
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Keywords: text-to-SQL LLM, fine-tuning, meta-aware leanring, metadata, chain-of-thought, BigQuery SQL, business database

点击查看摘要

Abstract:The advancements of Large language models (LLMs) have provided great opportunities to text-to-SQL tasks to overcome the main challenges to understand complex domain information and complex database structures in business applications. In this paper, we propose a meta-aware learning framework to integrate domain knowledge, database schema, chain-of-thought reasoning processes, and metadata relationships to improve the SQL generation quality. The proposed framework includes four learning strategies: schema-based learning, Chain-of-Thought (CoT) learning, knowledge-enhanced learning, and key information tokenization. This approach provides a comprehensive understanding of database structure and metadata information towards LLM through fine-tuning to improve its performance on SQL generation within business domains. Through two experimental studies, we have demonstrated the superiority of the proposed methods in execution accuracy, multi-task SQL generation capability, and reduction of catastrophic forgetting.
zh

[NLP-228] Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments

【速读】: 该论文旨在解决在线平台评论区中骚扰内容泛滥的问题,此类内容损害用户体验和福祉。研究通过基准测试三种领先的大型语言模型(OpenAI GPT-4.1、Google Gemini 1.5 Pro 和 Anthropic Claude 3 Opus)在包含高骚扰性评论的语料库上的表现,评估其在检测有害消息方面的有效性。解决方案的关键在于构建一个统一的提示框架和确定性设置,以实现模型性能的公平比较,并揭示现有模型在处理讽刺、编码侮辱和混合语言俚语等复杂情境时的局限性。研究还强调了结合互补模型、引入对话上下文以及针对低资源语言和隐性骚扰进行微调的重要性,以提升自动化内容审核系统的整体效果。

链接: https://arxiv.org/abs/2505.18927
作者: Amel Muminovic(International Balkan University)
机构: International Balkan University (国际巴尔干大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 9 pages, 3 tables, 1 figure. Not yet submitted to a journal. Feedback welcome

点击查看摘要

Abstract:As online platforms grow, comment sections increasingly host harassment that undermines user experience and well-being. This study benchmarks three leading large language models, OpenAI GPT-4.1, Google Gemini 1.5 Pro, and Anthropic Claude 3 Opus, on a corpus of 5,080 YouTube comments sampled from high-abuse threads in gaming, lifestyle, food vlog, and music channels. The dataset comprises 1,334 harmful and 3,746 non-harmful messages in English, Arabic, and Indonesian, annotated independently by two reviewers with substantial agreement (Cohen’s kappa = 0.83). Using a unified prompt and deterministic settings, GPT-4.1 achieved the best overall balance with an F1 score of 0.863, precision of 0.887, and recall of 0.841. Gemini flagged the highest share of harmful posts (recall = 0.875) but its precision fell to 0.767 due to frequent false positives. Claude delivered the highest precision at 0.920 and the lowest false-positive rate of 0.022, yet its recall dropped to 0.720. Qualitative analysis showed that all three models struggle with sarcasm, coded insults, and mixed-language slang. These results underscore the need for moderation pipelines that combine complementary models, incorporate conversational context, and fine-tune for under-represented languages and implicit abuse. A de-identified version of the dataset and full prompts is publicly released to promote reproducibility and further progress in automated content moderation.
zh

[NLP-229] SCRum-9: Multilingual Stance Classification over Rumours on Social Media

【速读】: 该论文试图解决谣言立场分类(Rumour Stance Classification)中的多语言数据不足及标注复杂性问题,其解决方案的关键在于构建了SCRum-9数据集,该数据集包含7,516条来自X的推文-回复对,覆盖9种语言,并链接到2,100个经过事实核查的声明,同时通过多位标注者进行复杂标注以反映标注者内部和跨标注者的一致性差异。

链接: https://arxiv.org/abs/2505.18916
作者: Yue Li,Jake Vasilakes,Zhixue Zhao,Carolina Scarton
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce SCRum-9, a multilingual dataset for Rumour Stance Classification, containing 7,516 tweet-reply pairs from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages (9), linking examples to more fact-checked claims (2.1k), and including complex annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least three native speakers per language, totalling around 405 hours of annotation and 8,150 dollars in compensation. Experiments on SCRum-9 show that it is a challenging benchmark for both state-of-the-art LLMs (e.g. Deepseek) as well as fine-tuned pre-trained models, motivating future work in this area.
zh

[NLP-230] Federated Retrieval-Augmented Generation: A Systematic Mapping Study

【速读】: 该论文旨在解决在隐私敏感领域中部署大型语言模型时,如何在保护数据隐私的同时提升模型的 factual accuracy(事实准确性)问题。其解决方案的关键在于结合联邦学习(Federated Learning, FL)与检索增强生成(Retrieval-Augmented Generation, RAG),通过联邦学习实现分布式模型训练而不暴露原始数据,同时利用RAG机制将生成结果与外部知识进行对齐,从而在保障隐私的前提下提高自然语言处理任务的知识密集型表现。

链接: https://arxiv.org/abs/2505.18906
作者: Abhijit Chakraborty,Chahana Dahal,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); Westminster University (威斯敏斯特大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Federated Retrieval-Augmented Generation (Federated RAG) combines Federated Learning (FL), which enables distributed model training without exposing raw data, with Retrieval-Augmented Generation (RAG), which improves the factual accuracy of language models by grounding outputs in external knowledge. As large language models are increasingly deployed in privacy-sensitive domains such as healthcare, finance, and personalized assistance, Federated RAG offers a promising framework for secure, knowledge-intensive natural language processing (NLP). To the best of our knowledge, this paper presents the first systematic mapping study of Federated RAG, covering literature published between 2020 and 2025. Following Kitchenham’s guidelines for evidence-based software engineering, we develop a structured classification of research focuses, contribution types, and application domains. We analyze architectural patterns, temporal trends, and key challenges, including privacy-preserving retrieval, cross-client heterogeneity, and evaluation limitations. Our findings synthesize a rapidly evolving body of research, identify recurring design patterns, and surface open questions, providing a foundation for future work at the intersection of RAG and federated systems.
zh

[NLP-231] Building a Functional Machine Translation Corpus for Kpelle

【速读】: 该论文旨在解决低资源语言Kpelle在机器翻译领域的数据稀缺问题,通过构建首个公开可用的英语-Kpelle平行语料库,推动该语言的自然语言处理(Natural Language Processing, NLP)研究。解决方案的关键在于对Meta的No Language Left Behind (NLLB)模型进行微调,并利用数据增强技术,在Kpelle到英语方向上实现了高达30的BLEU分数,验证了数据扩展的有效性。此外,该数据集还为语音识别和语言建模等更广泛的NLP任务提供了支持。

链接: https://arxiv.org/abs/2505.18905
作者: Kweku Andoh Yamoah,Jackson Weako,Emmanuel J. Dorley
机构: University of Florida (佛罗里达大学); Liberian Language Institute (利比里亚语言研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta’s No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle’s potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.
zh

[NLP-232] StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos

【速读】: 该论文旨在改进当前的幽默检测计算模型,通过提出一个包含七个语言(英语、法语、西班牙语、意大利语、葡萄牙语、匈牙利语和捷克语)的脱口秀多模态数据集来解决这一问题。该数据集超过330小时,是撰写时最大且最多样化的此类数据集。其关键解决方案在于将幽默检测任务从传统的二元序列分类框架转变为词级序列标注,以考虑序列的全部上下文并捕捉自然对话中连续的笑话标记机制。此外,作者还提出了一种基于语音识别错误增强自动笑声检测的方法。

链接: https://arxiv.org/abs/2505.18903
作者: Valentin Barriere,Nahuel Gomez,Leo Hemamou,Sofia Callejas,Brian Ravenet
机构: Universidad de Chile – DCC (智利大学-计算机科学系); Universidad de Chile – DIE (智利大学-电气工程系); INRIA Chile (法国国家信息与自动化研究所智利分部); Université Paris Saclay – LISN (巴黎萨克雷大学-里斯纳实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aiming towards improving current computational models of humor detection, we propose a new multimodal dataset of stand-up comedies, in seven languages: English, French, Spanish, Italian, Portuguese, Hungarian and Czech. Our dataset of more than 330 hours, is at the time of writing the biggest available for this type of task, and the most diverse. The whole dataset is automatically annotated in laughter (from the audience), and the subpart left for model validation is manually annotated. Contrary to contemporary approaches, we do not frame the task of humor detection as a binary sequence classification, but as word-level sequence labeling, in order to take into account all the context of the sequence and to capture the continuous joke tagging mechanism typically occurring in natural conversations. As par with unimodal baselines results, we propose a method for e propose a method to enhance the automatic laughter detection based on Audio Speech Recognition errors. Our code and data are available online: this https URL
zh

[NLP-233] CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

【速读】: 该论文试图解决当前AI代理在商业场景中性能评估缺乏真实、公开数据的问题,以及现有基准在环境、数据和代理-用户交互方面的保真度不足,难以覆盖多样化的业务场景和行业。其解决方案的关键在于提出CRMArena-Pro,这是一个针对LLM代理在多样化专业场景中进行全面、真实评估的新基准,包含十九个专家验证的任务,涵盖销售、服务及“配置、定价与报价”流程,并引入由多种角色引导的多轮交互和强大的保密意识评估机制。

链接: https://arxiv.org/abs/2505.18878
作者: Kung-Hsiang Huang,Akshara Prabhakar,Onkar Thorat,Divyansh Agarwal,Prafulla Kumar Choubey,Yixin Mao,Silvio Savarese,Caiming Xiong,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and ‘configure, price, and quote’ processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.
zh

[NLP-234] Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing ACL2025

【速读】: 该论文旨在解决跨领域科学信息通俗化表述(lay paraphrasing)的问题,传统研究多集中于单一领域如生物医学,而随着交叉学科研究的兴起,亟需一种能够理解和转换多个技术领域知识的方法。其解决方案的关键在于提出Sci-LoRA模型,该模型通过在多个科学领域上微调的LoRA(Low-Rank Adaptation)混合体进行动态权重生成与应用,从而根据输入文本调整不同领域的影响力,无需显式领域标签。此外,Sci-LoRA在数据和模型层面整合信息,实现领域专业知识与泛化能力的平衡,提升了跨领域通俗化表述的适应性与性能。

链接: https://arxiv.org/abs/2505.18867
作者: Ming Cheng,Jiaying Gong,Hoda Eldardiry
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 3 figures, ACL 2025 Findings

点击查看摘要

Abstract:Lay paraphrasing aims to make scientific information accessible to audiences without technical backgrounds. However, most existing studies focus on a single domain, such as biomedicine. With the rise of interdisciplinary research, it is increasingly necessary to comprehend knowledge spanning multiple technical fields. To address this, we propose Sci-LoRA, a model that leverages a mixture of LoRAs fine-tuned on multiple scientific domains. In particular, Sci-LoRA dynamically generates and applies weights for each LoRA, enabling it to adjust the impact of different domains based on the input text, without requiring explicit domain labels. To balance domain-specific knowledge and generalization across various domains, Sci-LoRA integrates information at both the data and model levels. This dynamic fusion enhances the adaptability and performance across various domains. Experimental results across twelve domains on five public datasets show that Sci-LoRA significantly outperforms state-of-the-art large language models and demonstrates flexible generalization and adaptability in cross-domain lay paraphrasing.
zh

[NLP-235] Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

【速读】: 该论文试图解决语音增强的多模态大语言模型(Multimodal Large Language Models, MLLMs)在语音输入场景下的安全漏洞问题,特别是针对其对对抗性攻击的脆弱性。解决方案的关键在于提出一种基于令牌级别的对抗攻击方法,该方法利用对模型语音分词机制的访问权限,生成能够绕过对齐保护并诱导违规输出的对抗性令牌序列,并将其合成音频提示以实现攻击目标。

链接: https://arxiv.org/abs/2505.18864
作者: Binhao Ma,Hanqing Guo,Zhengping Jay Luo,Rui Duan
机构: University of Missouri-Kansas City (密苏里大学堪萨斯城分校); Rider University (里德大学); University of Hawai’i at Mānoa (夏威夷大学马诺阿分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model’s speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.
zh

[NLP-236] Writing Like the Best: Exemplar-Based Expository Text Generation ACL2025

【速读】: 该论文试图解决在生成说明性文本时,现有方法依赖大量示例数据、难以适应特定主题内容以及长文本连贯性不足的问题。解决方案的关键在于提出自适应模仿(Adaptive Imitation)的概念,并引入一种新颖的递归计划-适应(RePA)框架,该框架通过细粒度的计划-适应过程利用大语言模型(LLMs)实现有效的自适应模仿,同时通过两种记忆结构增强输入清晰度和输出连贯性。

链接: https://arxiv.org/abs/2505.18859
作者: Yuxiang Liu,Kevin Chen-Chuan Chang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025. Camera-ready version

点击查看摘要

Abstract:We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics–imitativeness, adaptiveness, and adaptive-imitativeness–using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.
zh

[NLP-237] Inference Compute-Optimal Video Vision Language Models ACL

【速读】: 该论文试图解决在固定推理计算预算下,视频视觉语言模型中语言模型规模、帧数以及每帧视觉标记数量这三个关键扩展因素的最优分配问题(optimal allocation of inference compute)。传统方法通常专注于优化模型效率或提升性能,而未考虑资源约束,本文则通过大规模训练扫描和精细的参数建模来确定计算最优边界。解决方案的关键在于通过实验分析任务性能与扩展因素及微调数据量之间的关系,并揭示数据量变化对计算最优边界的动态影响,从而为实际选择扩展因素提供指导。

链接: https://arxiv.org/abs/2505.18855
作者: Peiqi Wang,ShengYun Peng,Xuewen Zhang,Hanchao Yu,Yibo Yang,Lifu Huang,Fujun Liu,Qifan Wang
机构: MIT(麻省理工学院); Georgia Tech(佐治亚理工学院); Meta(元); UC Davis(加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Annual Meeting of the Association for Computational Linguistics (ACL), 2025

点击查看摘要

Abstract:This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.
zh

[NLP-238] Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

【速读】: 该论文试图解决扩散模型在文本生成任务中适应性差的问题,尤其是由于文本的离散性质导致的语义结构保留与词元解码之间的矛盾。解决方案的关键在于提出一种名为Smoothie的新颖扩散方法,该方法通过基于语义相似性的逐步平滑词元嵌入,结合了连续潜在空间和类别单纯形空间的优势,从而在保持自然解码过程的同时实现渐进的信息去除。

链接: https://arxiv.org/abs/2505.18853
作者: Alexander Shabalin,Viacheslav Meshchaninov,Dmitry Vetrov
机构: HSE University (高等经济大学); Constructor University (构造大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 2 figures, 8 tables

点击查看摘要

Abstract:Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at this https URL.
zh

[NLP-239] Signal Image or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework

【速读】: 该论文试图解决在生成式心电图语言模型(Electrocardiogram-Language Models, ELMs)中,如何选择最有效的心电图(ECG)输入表示形式的问题。其解决方案的关键在于对三种候选表示形式——原始时间序列信号、渲染图像和离散符号序列进行系统性比较,并通过多数据集和多评估指标的实验验证,确定符号表示在统计显著性方面优于其他两种形式。此外,研究还通过消融实验分析了模型架构、ECG持续时间和令牌预算等因素的影响,以提供下一代ELMs开发的明确指导。

链接: https://arxiv.org/abs/2505.18847
作者: William Han,Chaojing Duan,Zhepeng Cen,Yihang Yao,Xiaoyu Song,Atharva Mhaskar,Dylan Leong,Michael A. Rosenberg,Emerson Liu,Ding Zhao
机构: Carnegie Mellon University (卡内基梅隆大学); Allegheny Health Network (阿勒格尼健康网络); University of Colorado (科罗拉多大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 2 figures, 8 tables

点击查看摘要

Abstract:Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, analyzing waveform morphology, identifying contributing factors, and proposing patient-specific action plans. To realize this potential, researchers are curating instruction-tuning datasets that pair ECGs with textual dialogues and are training ELMs on these resources. Yet before scaling ELMs further, there is a fundamental question yet to be explored: What is the most effective ECG input representation? In recent works, three candidate representations have emerged-raw time-series signals, rendered images, and discretized symbolic sequences. We present the first comprehensive benchmark of these modalities across 6 public datasets and 5 evaluation metrics. We find symbolic representations achieve the greatest number of statistically significant wins over both signal and image inputs. We further ablate the LLM backbone, ECG duration, and token budget, and we evaluate robustness to signal perturbations. We hope that our findings offer clear guidance for selecting input representations when developing the next generation of ELMs.
zh

[NLP-240] Multi-Party Conversational Agents : A Survey

【速读】: 该论文旨在解决多参与者对话代理(Multi-party Conversational Agents, MPCAs)在同时与多个参与者进行对话时所面临的挑战,包括对参与者心理状态的建模、对话内容的理解以及对话流的推理与预测。其解决方案的关键在于引入心智理论(Theory of Mind, ToM),以实现对参与者心理状态的准确建模,并结合传统机器学习、大型语言模型(Large Language Models, LLMs)和多模态系统等方法提升对话理解与生成能力。研究强调了ToM在构建智能MPCAs中的核心作用,并指出多模态理解是未来具有潜力但尚未充分探索的方向。

链接: https://arxiv.org/abs/2505.18845
作者: Sagar Sapkota,Mohammad Saqib Hasan,Mubarak Shah,Santu Karmaker
机构: University of Central Florida (佛罗里达中部大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-party Conversational Agents (MPCAs) are systems designed to engage in dialogue with more than two participants simultaneously. Unlike traditional two-party agents, designing MPCAs faces additional challenges due to the need to interpret both utterance semantics and social dynamics. This survey explores recent progress in MPCAs by addressing three key questions: 1) Can agents model each participants’ mental states? (State of Mind Modeling); 2) Can they properly understand the dialogue content? (Semantic Understanding); and 3) Can they reason about and predict future conversation flow? (Agent Action Modeling). We review methods ranging from classical machine learning to Large Language Models (LLMs) and multi-modal systems. Our analysis underscores Theory of Mind (ToM) as essential for building intelligent MPCAs and highlights multi-modal understanding as a promising yet underexplored direction. Finally, this survey offers guidance to future researchers on developing more capable MPCAs.
zh

[NLP-241] Dont Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中缺乏对视觉信息的动态访问问题,即当前MLLMs通常仅在初始阶段处理视觉输入并依赖内部记忆进行推理,无法在推理过程中根据需要重新访问视觉内容。解决方案的关键在于引入v1,这是一种轻量级扩展,通过简单的“点击与复制”机制,使模型能够在推理过程中动态检索相关的图像区域,从而实现基于模型演化假设的上下文感知视觉标记访问。

链接: https://arxiv.org/abs/2505.18842
作者: Jiwan Chung,Junhyeok Kim,Siyeol Kim,Jaeyoung Lee,Min Soo Kim,Youngjae Yu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model’s evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks – MathVista, MathVision, and MathVerse – demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.
zh

[NLP-242] On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

【速读】: 该论文试图解决在基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)训练过程中出现的Lazy Likelihood Displacement (LLD)问题,即正确响应的似然在训练过程中出现轻微增加甚至下降的现象。解决方案的关键在于提出一种名为NTHR的方法,该方法通过降低对导致LLD的token的惩罚权重,从而缓解这一问题。NTHR利用GRPO的分组结构,以正确响应作为锚点来识别具有影响力的token,相较于基于DPO的方法更具针对性和有效性。

链接: https://arxiv.org/abs/2505.18830
作者: Wenlong Deng,Yi Ren,Muchen Li,Danica J. Sutherland,Xiaoxiao Li,Christos Thrampoulidis
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute (向量研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO’s widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO’s learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO’s group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.
zh

[NLP-243] AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

【速读】: 该论文旨在解决现代大型推理模型在效率与效果之间难以平衡的问题,特别是在处理简单问题时常常生成冗长的推理链。其解决方案的关键在于提出AdaCtrl框架,该框架通过难度感知的自适应推理预算分配和显式的用户控制来优化推理过程。AdaCtrl能够根据自我评估的问题难度动态调整推理长度,并允许用户手动控制预算以优先考虑效率或效果,其核心实现依赖于两阶段训练流程:初始冷启动微调阶段和难度感知强化学习阶段,从而提升模型的自适应推理能力和难度评估准确性。

链接: https://arxiv.org/abs/2505.18822
作者: Shijue Huang,Hongru Wang,Wanjun Zhong,Zhaochen Su,Jiazhan Feng,Bowen Cao,Yi R. Fung
机构: Hong Kong University of Science and Technology (香港科技大学); The Chinese University of Hong Kong (香港中文大学); Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model’s adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.
zh

[NLP-244] ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models

【速读】: 该论文试图解决通用大语言模型(Large Language Models, LLMs)在对齐下游任务时所面临的高昂成本问题,包括构建任务特定的指令对和进行大量训练调整。解决方案的关键在于提出一种名为ALPS(Attention Localization and Pruning Strategy)的高效算法,该算法通过定位最任务敏感的注意力头并限制注意力训练更新到这些头,从而减少对齐成本。实验结果表明,该方法在微调过程中仅激活10%的注意力参数,同时在三个任务上相比基线模型提升了2%的性能,并且识别出的任务特定注意力头具有跨数据集的可迁移性,有助于缓解知识遗忘问题。

链接: https://arxiv.org/abs/2505.18799
作者: Hao Chen,Haoze Li,Zhiqing Xiao,Lirong Gao,Qi Zhang,Xiaomeng Hu,Ningtao Wang,Xing Fu,Junbo Zhao
机构: Zhejiang University(浙江大学); Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures, 14 tables

点击查看摘要

Abstract:Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant costs, including constructing task-specific instruction pairs and extensive training adjustments. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the \textit\textbfAttention \textbfLocalization and \textbfPruning \textbfStrategy (\textbfALPS), an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only \textbf10% of attention parameters during fine-tuning while achieving a \textbf2% performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.
zh

[NLP-245] From Output to Evaluation: Does Raw Instruction-Tuned Code LLM s Output Suffice for Fill-in-the-Middle Code Generation?

【速读】: 该论文试图解决在填充中间(fill-in-the-middle, FIM)代码生成任务中,大型语言模型(Large Language Models, LLMs)输出中频繁出现的冗余代码问题,这一问题影响了自动评估的准确性。解决方案的关键在于通过监督微调(supervised fine-tuning)提升LLMs生成与上下文无缝集成的代码能力,从而减少对后处理的需求。研究结果表明,经过微调的Qwen2.5-Coder模型在HumanEval Infilling和SAFIM基准测试中表现出色,尤其是在中间部分为完整代码行时无需后处理,但在中间部分为随机代码片段时仍需进行后处理。

链接: https://arxiv.org/abs/2505.18789
作者: Wasi Uddin Ahmad,Somshubra Majumdar,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \textttQwen2.5-Coder (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emphmiddle consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emphmiddle is a random span of code.
zh

[NLP-246] A generalised editor calculus (Short Paper)

【速读】: 该论文试图解决如何为任意语言构造一个专用的语法导向编辑器的问题,该编辑器能够在允许不完整程序存在的情况下保证语法错误的不存在性。解决方案的关键在于提出一种广义的语法导向编辑演算,并将其编码为一种扩展的简单类型lambda演算,该演算包含元组、布尔值、模式匹配和不动点。

链接: https://arxiv.org/abs/2505.18778
作者: Benjamin Bennetzen,Peter Buus Steffensen,Hans Hüttel,Nikolaj Rossander Kristensen,Andreas Tor Mortensen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 21 figures

点击查看摘要

Abstract:In this paper, we present a generalization of a syntax-directed editor calculus, which can be used to instantiate a specialized syntax-directed editor for any language, given by some abstract syntax. The editor calculus guarantees the absence of syntactical errors while allowing incomplete programs. The generalized editor calculus is then encoded into a simply typed lambda calculus, extended with pairs, booleans, pattern matching and fixed points
zh

[NLP-247] Disentangling Knowledge Representations for Large Language Model Editing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行知识编辑时,无法有效保持与已编辑知识共享相同主体但关系和对象不同的细粒度无关知识事实的问题。这一挑战源于主体表示中编码了多个属性,导致目标知识与细粒度无关知识在表示空间中发生纠缠,从而在编辑过程中容易受到意外修改。解决方案的关键在于提出DiKE(Disentangles Knowledge representations for LLM Editing),其核心包括两个组件:知识表示解缠模块(Knowledge Representation Disentanglement, KRD)用于将主体表示分解为目标相关和无关部分,以及基于解缠的知识编辑模块(Disentanglement-based Knowledge Edit, DKE)用于仅更新目标相关部分并显式保留无关部分。此外,DiKE通过矩阵理论推导出一种闭式、秩一参数更新方法,以实现高效且侵入性最小的编辑。

链接: https://arxiv.org/abs/2505.18774
作者: Mengqi Zhang,Zisheng Zhou,Xiaotian Ye,Qiang Liu,Zhaochun Ren,Zhumin Chen,Pengjie Ren
机构: Shandong University (山东大学); Beijing University of Posts and Telecommunications (北京邮电大学); NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别国家重点实验室和多媒体与智能系统研究中心); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Editing has emerged as a promising solution for efficiently updating embedded knowledge in large language models (LLMs). While existing approaches demonstrate effectiveness in integrating new knowledge and preserving the original capabilities of LLMs, they fail to maintain fine-grained irrelevant knowledge facts that share the same subject as edited knowledge but differ in relation and object. This challenge arises because subject representations inherently encode multiple attributes, causing the target and fine-grained irrelevant knowledge to become entangled in the representation space, and thus vulnerable to unintended alterations during editing. To address this, we propose DiKE, a novel approach that Disentangles Knowledge representations for LLM Editing (DiKE). DiKE consists of two key components: a Knowledge Representation Disentanglement (KRD) module that decomposes the subject representation into target-knowledgerelated and -unrelated components, and a Disentanglement-based Knowledge Edit (DKE) module that updates only the target-related component while explicitly preserving the unrelated one. We further derive a closed-form, rank-one parameter update based on matrix theory to enable efficient and minimally invasive edits. To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge. Extensive experiments across multiple LLMs demonstrate that DiKE substantially improves fine-grained irrelevant knowledge preservation while maintaining competitive general editing performance.
zh

[NLP-248] owards an automatic method for generating topical vocabulary test forms for specific reading passages

【速读】: 该论文试图解决在STEM等特定领域中,如何有效评估学生背景知识以预测其对特定文本理解能力的问题。现有方法缺乏可实时部署和评分的自动化测量工具。解决方案的关键在于开发K-tool系统,该系统能够自动检测文本主题并生成与主题相关的词汇测试,通过识别高度相关词汇及具有相似特征但关联性较低的词汇,构建背景知识测评形式,从而反映学生的知识状态。

链接: https://arxiv.org/abs/2505.18762
作者: Michael Flor,Zuowei Wang,Paul Deane,Tenaha O’Reilly
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: This manuscript was accepted to be published as an ETS Research Report. Keywords topics; vocabulary; background knowledge; automatic item generation; assessment; reading comprehension

点击查看摘要

Abstract:Background knowledge is typically needed for successful comprehension of topical and domain specific reading passages, such as in the STEM domain. However, there are few automated measures of student knowledge that can be readily deployed and scored in time to make predictions on whether a given student will likely be able to understand a specific content area text. In this paper, we present our effort in developing K-tool, an automated system for generating topical vocabulary tests that measure students’ background knowledge related to a specific text. The system automatically detects the topic of a given text and produces topical vocabulary items based on their relationship with the topic. This information is used to automatically generate background knowledge forms that contain words that are highly related to the topic and words that share similar features but do not share high associations to the topic. Prior research indicates that performance on such tasks can help determine whether a student is likely to understand a particular text based on their knowledge state. The described system is intended for use with middle and high school student population of native speakers of English. It is designed to handle single reading passages and is not dependent on any corpus or text collection. In this paper, we describe the system architecture and present an initial evaluation of the system outputs.
zh

[NLP-249] How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对系统性控制的无关上下文(irrelevant context, IC)时推理鲁棒性不足的问题。解决方案的关键在于构建一个名为Grade School Math with Distracting Context (GSM-DC)的合成基准,通过精确注入干扰项来构造符号推理图,从而实现对模型推理路径选择和算术准确性的严格评估。此外,论文提出了一种基于过程奖励模型引导的分步树搜索方法,以提升模型在分布外场景下的鲁棒性。

链接: https://arxiv.org/abs/2505.18761
作者: Minglai Yang,Ethan Huang,Liang Zhang,Mihai Surdeanu,William Wang,Liangming Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 9 figure, 4 tables

点击查看摘要

Abstract:We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models’ (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
zh

[NLP-250] Few-Shot Optimization for Sensor Data Using Large Language Models : A Case Study on Fatigue Detection

【速读】: 该论文旨在解决传感器基础分类任务中少样本示例选择的质量问题,特别是在面对具有重叠模式和高个体差异的生理信号时,如何提升示例选择的准确性。解决方案的关键在于提出一种基于混合欧几里得距离与大语言模型(Large Language Models, LLMs)的优化方法(HED-LM),通过结合数值相似性与上下文相关性进行候选示例的筛选与重排序,从而提高少样本提示的鲁棒性与性能。

链接: https://arxiv.org/abs/2505.18754
作者: Elsen Ronando,Sozo Inoue
机构: 未知
类目: Computation and Language (cs.CL)
备注: 43 pages, 18 figures. Accepted for publication in MDPI Sensors (2025). Final version before journal publication

点击查看摘要

Abstract:In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13 \pm 10.71%, outperforming both random selection (59.30 \pm 10.13%) and distance-only filtering (67.61 \pm 11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.
zh

[NLP-251] Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning

【速读】: 该论文试图解决大语言模型中上下文学习(in-context learning, ICL)的内部机制问题,特别是如何通过统一框架将注意力头与任务向量联系起来,以解释隐藏状态在不同层中的演化过程。解决方案的关键在于分析影响分类任务性能的两个几何因素:查询隐藏状态的可分性和对齐性,并通过分层动态分析揭示出一个两阶段机制:早期层中出现可分性,后期层中发展对齐性。此外,消融实验表明,先前标记头驱动可分性,而归纳头和任务向量增强对齐性,从而为ICL的底层机制提供了统一的解释。

链接: https://arxiv.org/abs/2505.18752
作者: Haolin Yang,Hakaze Cho,Yiqiao Zhong,Naoya Inoue
机构: University of Chicago (芝加哥大学); JAIST (日本信息科学技术学院); University of Wisconsin - Madison (威斯康星大学麦迪逊分校); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注: 45 pages, 49 figures

点击查看摘要

Abstract:The unusual properties of in-context learning (ICL) have prompted investigations into the internal mechanisms of large language models. Prior work typically focuses on either special attention heads or task vectors at specific layers, but lacks a unified framework linking these components to the evolution of hidden states across layers that ultimately produce the model’s output. In this paper, we propose such a framework for ICL in classification tasks by analyzing two geometric factors that govern performance: the separability and alignment of query hidden states. A fine-grained analysis of layer-wise dynamics reveals a striking two-stage mechanism: separability emerges in early layers, while alignment develops in later layers. Ablation studies further show that Previous Token Heads drive separability, while Induction Heads and task vectors enhance alignment. Our findings thus bridge the gap between attention heads and task vectors, offering a unified account of ICL’s underlying mechanisms.
zh

[NLP-252] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges

【速读】: 该论文旨在解决现有文本到SQL(Text-to-SQL)数据集在领域专业知识和复杂数学推理方面的不足,从而推动更强大的推理驱动型文本到SQL系统的发展。其关键解决方案是提出一个新型数据集LogicCat,该数据集专注于复杂推理和思维链分析,涵盖物理、算术、常识和假设推理,并包含4,038个英文问题、对应的唯一SQL查询以及12,114条分步推理注释,覆盖45个不同领域的数据库。

链接: https://arxiv.org/abs/2505.18744
作者: Tao Liu,Hongying Zan,Yifan Li,Dixuan Zhang,Lulu Kong,Haixin Liu,Jiaming Hou,Aoze Zheng,Rui Li,Yiming Qiao,Zewei Luo,Qi Wang,Zhiqiang Zhang,Jiaxi Li,Supeng Liu,Kunli Zhang,Min Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:Text-to-SQL is a fundamental task in natural language processing that seeks to translate natural language questions into meaningful and executable SQL queries. While existing datasets are extensive and primarily focus on business scenarios and operational logic, they frequently lack coverage of domain-specific knowledge and complex mathematical reasoning. To address this gap, we present a novel dataset tailored for complex reasoning and chain-of-thought analysis in SQL inference, encompassing physical, arithmetic, commonsense, and hypothetical reasoning. The dataset consists of 4,038 English questions, each paired with a unique SQL query and accompanied by 12,114 step-by-step reasoning annotations, spanning 45 databases across diverse domains. Experimental results demonstrate that LogicCat substantially increases the difficulty for state-of-the-art models, with the highest execution accuracy reaching only 14.96%. Incorporating our chain-of-thought annotations boosts performance to 33.96%. Benchmarking leading public methods on Spider and BIRD further underscores the unique challenges presented by LogicCat, highlighting the significant opportunities for advancing research in robust, reasoning-driven text-to-SQL systems. We have released our dataset code at this https URL.
zh

[NLP-253] Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization ACL2025

【速读】: 该论文试图解决现有直接偏好优化(Direct Preference Optimization, DPO)方法在对齐大语言模型(Large Language Models, LLMs)与人类偏好时存在的不足,即所有令牌被赋予相等的重要性,而人类更关注语义上有意义的部分,导致无关或噪声令牌对DPO损失产生不成比例的影响。解决方案的关键在于提出一种基于最优传输的令牌加权方案(Optimal Transport-based token weighting scheme for enhancing direct Preference Optimization, OTPO),通过强调语义上有意义的令牌对并弱化不相关部分,引入一种上下文感知的令牌加权机制,从而获得更对比性的奖励差异估计。

链接: https://arxiv.org/abs/2505.18720
作者: Meng Li,Guangda Huzhang,Haibo Zhang,Xiting Wang,Anxiang Zeng
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学人工智能学院); LLM Team, Shopee Pte. Ltd. (Shopee 大语言模型团队); Beijing Key Laboratory of Research on Large Models and Intelligent Governance and Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE (大型模型与智能治理北京市重点实验室和下一代智能搜索与推荐工程研究中心,教育部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 11 figures. Accepted by ACL 2025 (main)

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbfOptimal \textbfTransport-based token weighting scheme for enhancing direct \textbfPreference \textbfOptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO’s effectiveness in improving instruction-following ability across various settings\footnoteCode is available at this https URL…
zh

[NLP-254] Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer ACL2025

【速读】: 该论文旨在解决微调模型在特定领域外表现不佳以及模型冗余的问题。其关键解决方案是通过引入一种名为神经参数搜索(Neural Parameter Search, NPS-Pruning)的新方法,利用任务向量机制对微调模型进行预处理,计算其与原始预训练模型的差异,并在低秩子空间中搜索任务向量的神经参数以提高剪枝效率。该方法有效提升了知识迁移、知识融合及压缩模型部署的效果。

链接: https://arxiv.org/abs/2505.18713
作者: Guodong Du,Zitao Fang,Jing Li,Junlin Li,Runhua Jiang,Shuyang Yu,Yifei Guo,Yangneng Chen,Sim Kuan Goh,Ho-Kin Tang,Daojing He,Honghai Liu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Xiamen University Malaysia (厦门大学马来西亚校区)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ACL2025 Main

点击查看摘要

Abstract:Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: this https URL.
zh

[NLP-255] Improving Bangla Linguistics: Advanced LSTM Bi-LSTM and Seq2Seq Models for Translating Sylheti to Modern Bangla

【速读】: 该论文试图解决将标准或现代孟加拉语(Bangla)翻译为当地口语的锡尔赫特语(Sylheti)的问题,以促进本地语言的交流与技术发展。解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术,构建一个综合系统,通过训练三种模型(LSTM、Bi-LSTM 和 Seq2Seq),其中 LSTM 模型在性能上表现最佳,达到了 89.3% 的准确率。

链接: https://arxiv.org/abs/2505.18709
作者: Sourav Kumar Das,Md. Julkar Naeen,MD. Jahidul Islam,Md. Anisul Haque Sajeeb,Narayan Ranjan Chakraborty,Mayen Uddin Mojumdar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)

点击查看摘要

Abstract:Bangla or Bengali is the national language of Bangladesh, people from different regions don’t talk in proper Bangla. Every division of Bangladesh has its own local language like Sylheti, Chittagong etc. In recent years some papers were published on Bangla language like sentiment analysis, fake news detection and classifications, but a few of them were on Bangla languages. This research is for the local language and this particular paper is on Sylheti language. It presented a comprehensive system using Natural Language Processing or NLP techniques for translating Pure or Modern Bangla to locally spoken Sylheti Bangla language. Total 1200 data used for training 3 models LSTM, Bi-LSTM and Seq2Seq and LSTM scored the best in performance with 89.3% accuracy. The findings of this research may contribute to the growth of Bangla NLP researchers for future more advanced innovations.
zh

[NLP-256] A General Knowledge Injection Framework for ICD Coding ACL2025

【速读】: 该论文旨在解决医疗文本的ICD编码任务中长期存在的长尾分布问题和代码特定证据标注不足的问题。现有方法通常仅关注单一类型的知识,并设计复杂的专用模块,导致可扩展性和效果受限。论文提出的解决方案关键在于GKI-ICD框架,该框架无需额外设计专用模块,即可集成ICD描述、ICD同义词和ICD层级三种关键知识,充分利用其差异性和互补性,从而有效提升ICD编码性能。

链接: https://arxiv.org/abs/2505.18708
作者: Xu Zhang,Kun Zhang,Wenxin Ma,Rongsheng Wang,Chenxu Wu,Yingtai Li,S. Kevin Zhou
机构: USTC(中国科学技术大学); MIRACLE Center, Suzhou Institute for Advance Research, USTC(多功能智能计算与先进研究苏州中心,中国科学技术大学); Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology(江苏省多模态数字孪生技术重点实验室); State Key Laboratory of Precision and Intelligent Chemistry, USTC(精密与智能化学国家重点实验室,中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings

点击查看摘要

Abstract:ICD Coding aims to assign a wide range of medical codes to a medical text document, which is a popular and challenging task in the healthcare domain. To alleviate the problems of long-tail distribution and the lack of annotations of code-specific evidence, many previous works have proposed incorporating code knowledge to improve coding performance. However, existing methods often focus on a single type of knowledge and design specialized modules that are complex and incompatible with each other, thereby limiting their scalability and effectiveness. To address this issue, we propose GKI-ICD, a novel, general knowledge injection framework that integrates three key types of knowledge, namely ICD Description, ICD Synonym, and ICD Hierarchy, without specialized design of additional modules. The comprehensive utilization of the above knowledge, which exhibits both differences and complementarity, can effectively enhance the ICD coding performance. Extensive experiments on existing popular ICD coding benchmarks demonstrate the effectiveness of GKI-ICD, which achieves the state-of-the-art performance on most evaluation metrics. Code is available at this https URL.
zh

[NLP-257] owards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task

【速读】: 该论文试图解决如何在语义上下文中统一整合观点(opinion)的问题,其核心挑战在于不同表述形式下观点语义表示的差异性。解决方案的关键在于提出统一的观点概念本体(Unified Opinion Concepts, UOC),该本体基于自然语言处理(NLP)中广泛研究的观点维度以及通过符号描述的语义结构,实现了观点的统一表征。此外,论文还提出了统一观点概念抽取(Unified Opinion Concept Extraction, UOCE)任务,并构建了手动扩展和重新标注的评估数据集及定制化的评估指标,以衡量抽取观点与UOC语义的一致性。

链接: https://arxiv.org/abs/2505.18703
作者: Gaurav Negi,Dhairya Dalal,Omnia Zayed,Paul Buitelaar
机构: Insight SFI Research Centre for Data Analytics (Insight SFI 数据分析研究中心); Data Science Institute (数据科学研究所); University of Galway (加利韦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Unified Opinion Concepts (UOC) ontology to integrate opinions within their semantic context. The UOC ontology bridges the gap between the semantic representation of opinion across different formulations. It is a unified conceptualisation based on the facets of opinions studied extensively in NLP and semantic structures described through symbolic descriptions. We further propose the Unified Opinion Concept Extraction (UOCE) task of extracting opinions from the text with enhanced expressivity. Additionally, we provide a manually extended and re-annotated evaluation dataset for this task and tailored evaluation metrics to assess the adherence of extracted opinions to UOC semantics. Finally, we establish baseline performance for the UOCE task using state-of-the-art generative models.
zh

[NLP-258] Benchmarking and Rethinking Knowledge Editing for Large Language Models

【速读】: 该论文试图解决现有知识编辑方法在评估目标和实验设置上存在不一致的问题,以及参数修改或外部记忆整合等方法在现实条件下表现不佳的局限性。其解决方案的关键在于引入一种名为Selective Contextual Reasoning (SCR)的简单且直接的基线方法,该方法通过基于上下文的推理实现更稳健的知识更新,实验证明其在多种评估维度和场景下均优于现有方法。

链接: https://arxiv.org/abs/2505.18690
作者: Guoxiu He,Xin Song,Futing Wang,Aixin Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2503.05212

点击查看摘要

Abstract:Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
zh

[NLP-259] Large Language Models in the Task of Automatic Validation of Text Classifier Predictions

【速读】: 该论文试图解决文本分类中机器学习模型训练与验证样本收集过程中依赖人工标注者所带来的劳动密集、成本高及可扩展性差的问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)替代人类标注者,以测试分类器预测的正确性,从而确保模型质量并支持高质量的增量学习。

链接: https://arxiv.org/abs/2505.18688
作者: Aleksandr Tsymbalov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model’s entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.
zh

[NLP-260] From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

【速读】: 该论文试图解决健康领域中信息过载(infodemics)和健康虚假信息(health misinformation)的传播问题,特别是生成式 AI (Generative AI) 技术加速了虚假信息的扩散。解决方案的关键在于构建一个大规模多模态虚假信息数据集 MM Health,该数据集包含34,746篇新闻文章,涵盖文本和视觉信息,并区分了人类生成内容(5,776篇)与 AI 生成内容(28,880篇)。此外,作者通过三个任务(可靠性检查、原创性检查和细粒度 AI 检测)验证了现有先进模型在区分信息可靠性和来源方面的不足,从而为多模态层次上的虚假信息检测提供了支持。

链接: https://arxiv.org/abs/2505.18685
作者: Zhihao Zhang,Yiran Zhang,Xiyue Zhou,Liting Huang,Imran Razzak,Preslav Nakov,Usman Naseem
机构: Macquarie University, Australia; University of Sydney, Australia; University of Technology Sydney, Australia; MBZUAI, UAE
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.
zh

[NLP-261] ULUN: Transparent and Adaptable Low-resource Machine Translation

【速读】: 该论文试图解决低资源语言机器翻译(Machine Translation, MT)系统在专业领域中表现不佳的问题。传统的方法通常需要模型微调,这对非技术用户和小型组织来说不具可行性。解决方案的关键在于提出Tulun,这是一个结合神经机器翻译与基于大型语言模型(Large Language Model, LLM)的术语感知后编辑框架,通过现有术语表和翻译记忆进行引导,从而提升翻译准确性并支持领域专业知识的融入。

链接: https://arxiv.org/abs/2505.18683
作者: Raphaël Merx,Hanna Suominen,Lois Hong,Nick Thieberger,Trevor Cohn,Ekaterina Vylomova
机构: The University of Melbourne (墨尔本大学); The Australian National University (澳大利亚国立大学); University of Turku (图尔库大学); Maluk Timor (帝汶)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.
zh

[NLP-262] PD3F: A Pluggable and Dynamic DoS-Defense Framework Against Resource Consumption Attacks Targeting Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对资源消耗攻击时的安全性问题,此类攻击可通过耗尽计算资源严重降低服务器性能甚至导致崩溃。现有研究缺乏有效的缓解策略,使得实际部署的LLMs存在未解决的安全风险。论文提出的解决方案是可插拔且动态的拒绝服务攻击防御框架(Pluggable and Dynamic DoS-Defense Framework, PD^3F),其关键在于采用双阶段防御机制:在输入侧通过资源指数指导动态请求轮询调度以减少高并发场景下的资源消耗,在输出侧引入自适应末端抑制机制以提前终止过多的恶意生成。实验表明,PD^3F显著缓解了资源消耗攻击,提升了用户访问能力。

链接: https://arxiv.org/abs/2505.18680
作者: Yuanhe Zhang,Xinyue Wang,Haoran Gao,Zhenhong Zhou,Fanyu Meng,Yuyao Zhang,Sen Su
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), due to substantial computational requirements, are vulnerable to resource consumption attacks, which can severely degrade server performance or even cause crashes, as demonstrated by denial-of-service (DoS) attacks designed for LLMs. However, existing works lack mitigation strategies against such threats, resulting in unresolved security risks for real-world LLM deployments. To this end, we propose the Pluggable and Dynamic DoS-Defense Framework ( PD^3F ), which employs a two-stage approach to defend against resource consumption attacks from both the input and output sides. On the input side, we propose the Resource Index to guide Dynamic Request Polling Scheduling, thereby reducing resource usage induced by malicious attacks under high-concurrency scenarios. On the output side, we introduce the Adaptive End-Based Suppression mechanism, which terminates excessive malicious generation early. Experiments across six models demonstrate that PD^3F significantly mitigates resource consumption attacks, improving users’ access capacity by up to 500% during adversarial load. PD^3F represents a step toward the resilient and resource-aware deployment of LLMs against resource consumption attacks.
zh

[NLP-263] Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts

【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)人工制品(如模型、数据集等)的研究框架不明确的问题,这导致了研究与实际应用之间的脱节。其解决方案的关键在于开发一个三组件系统,通过自动提取关键要素(手段、目标、利益相关者),并利用可解释的规则和上下文推理来建立它们之间的联系,从而实现对研究框架的自动化分析。

链接: https://arxiv.org/abs/2505.18677
作者: Eric Chamoun,Nedjma Ousidhoum,Michael Schlichtkrull,Andreas Vlachos
机构: University of Cambridge (剑桥大学); Cardiff University (卡迪夫大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.
zh

[NLP-264] Can MLLM s Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉理解与空间推理任务中的能力评估不足问题。其解决方案的关键在于引入ReasonMap基准,该基准包含来自13个国家30个城市的高分辨率交通地图以及1,008对涵盖两种题型和三种模板的问答对,并设计了双层级评估流程以准确评估答案的正确性与质量。通过该基准,研究揭示了开源模型与闭源模型在推理能力上的差异,并强调了视觉感知在细粒度视觉推理任务中的必要性。

链接: https://arxiv.org/abs/2505.18675
作者: Sicheng Feng,Song Wang,Shuyi Ouyang,Lingdong Kong,Zikai Song,Jianke Zhu,Huan Wang,Xinchao Wang
机构: Westlake University (西湖大学); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have recently achieved significant progress in visual tasks, including semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on complex tasks involving mathematics and logic. However, their capacity for reasoning tasks involving fine-grained visual understanding remains insufficiently evaluated. To address this gap, we introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs. ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern: among open-source models, base models outperform reasoning ones, while the opposite trend is observed in closed-source models. Additionally, performance generally degrades when visual inputs are masked, indicating that while MLLMs can leverage prior knowledge to answer some questions, fine-grained visual reasoning tasks still require genuine visual perception for strong performance. Our benchmark study offers new insights into visual reasoning and contributes to investigating the gap between open-source and closed-source models.
zh

[NLP-265] Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在跨语言任务中表现不一致的问题,特别是其在不同语言间的性能差异。解决方案的关键在于提出一种基于束搜索(beam search)和LLM驱动的模拟方法,用于高效生成能够暴露英语与其他目标语言之间性能差异的双语问题对。通过这种方法构建了一个包含16种语言、超过6000对双语数据的新数据集,实验证明该方法能精确且经济地识别出跨语言弱点,并揭示了语言相似性与跨语言弱点之间的关系。

链接: https://arxiv.org/abs/2505.18673
作者: Zixiang Xu,Yanbo Wang,Yue Huang,Xiuying Chen,Jieyu Zhao,Meng Jiang,Xiangliang Zhang
机构: MBZUAI(穆巴达拉人工智能中心); University of Notre Dame(圣母大学); University of Southern California(南加利福尼亚大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025. Code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at this https URL.
zh

[NLP-266] ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation NEURIPS2025

【速读】: 该论文试图解决大型视觉-语言模型(LVLMs)在理解和生成信息图图表(infographic charts)时面临的挑战,因为这些图表的视觉和结构复杂性超出了传统纯图表训练数据的范围。解决方案的关键在于构建一个大规模的数据集——ChartGalaxy,该数据集通过归纳过程识别出75种图表类型、330种图表变体和68种布局模板,并利用这些元素程序化生成合成信息图图表,从而提升LVLMs在多模态推理与生成任务中的性能。

链接: https://arxiv.org/abs/2505.18668
作者: Zhen Li,Yukai Guo,Duan Li,Xinyuan Guo,Bowen Li,Lanxi Xiao,Shenyu Qiao,Jiashu Chen,Zijian Wu,Hui Zhang,Xinhuan Shu,Shixia Liu
机构: Tsinghua University (清华大学); Newcastle University (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 63 pages, submitted to NeurIPS 2025 Datasets and Benchmarks Track

点击查看摘要

Abstract:Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 330 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.
zh

[NLP-267] Robustness in Large Language Models : A Survey of Mitigation Strategies and Evaluation Metrics

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的鲁棒性(robustness)问题,即如何确保模型在面对多样化输入时仍能保持稳定和可靠的表现。其解决方案的关键在于系统性地分析非鲁棒性的来源,包括模型内在限制、数据驱动的脆弱性以及外部对抗因素,并综述当前先进的缓解策略,以推动鲁棒性评估方法的发展和实际应用的改进。

链接: https://arxiv.org/abs/2505.18658
作者: Pankaj Kumar,Subhankar Mishra
机构: National Institute of Science Education and Research (国家科学教育与研究学院); Homi Bhabha National Institute (霍米·巴巴国家研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.
zh

[NLP-268] Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change ACL

【速读】: 该论文旨在解决当前自然语言处理模型在气候相关任务中评估标准不统一的问题,提出了一种名为Climate-Eval的综合性基准测试框架。该基准整合了现有数据集与一个专门为此次发布开发的新闻分类数据集,形成了涵盖25项任务、基于13个数据集的评估体系,覆盖了气候话语的关键方面。解决方案的关键在于构建一个标准化的评估套件,以系统性地衡量大语言模型(LLMs)在这些任务中的表现,并通过在零样本和少样本设置下对开源LLMs(参数规模从2B到70B)进行广泛评估,分析其在气候变化领域的优势与局限性。

链接: https://arxiv.org/abs/2505.18653
作者: Murathan Kurfalı,Shorouq Zahra,Joakim Nivre,Gabriele Messori
机构: RISE Research Institutes of Sweden (RISE 瑞典研究机构); Uppsala University (乌普萨拉大学); Swedish Centre for Impacts of Climate Extremes (climes) (瑞典气候极端影响中心)
类目: Computation and Language (cs.CL)
备注: Accepted to ClimateNLP 2025@ACL

点击查看摘要

Abstract:Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.
zh

[NLP-269] On the Emergence of Linear Analogies in Word Embeddings

【速读】: 该论文试图解释词向量模型(如Word2Vec和GloVe)中出现的线性类比结构(linear analogy structure)的理论起源,这一结构表现为词向量之间的线性关系,例如$ W_\text{king} - W_\text{man} + W_\text{woman} \approx W_\text{queen} $。其解决方案的关键在于引入一个生成式模型(generative model),该模型基于词的二元语义属性定义词,并通过属性间的交互推导出共现概率。该模型能够解析地再现线性类比结构,并自然解释此前观察到的多个现象,包括主特征向量中的类比结构、嵌入维度增加时的增强与饱和、对数变换的影响以及在移除特定词对后类比结构的持续存在。

链接: https://arxiv.org/abs/2505.18651
作者: Daniel J. Korchinski,Dhruva Karkada,Yasaman Bahri,Matthieu Wyart
机构: Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); UC Berkeley (加州大学伯克利分校); Google DeepMind (谷歌深度思维); John Hopkins (约翰霍普金斯大学); EPFL (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
备注: Main: 12 pages, 3 figures. Appendices: 8 pages, 7 figures

点击查看摘要

Abstract:Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability P(i,j) of words i and j in text corpora. The resulting vectors W_i not only group semantically similar words but also exhibit a striking linear analogy structure – for example, W_\textking - W_\textman + W_\textwoman \approx W_\textqueen – whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix M(i,j) = P(i,j)/P(i)P(j) , (ii) strengthens and then saturates as more eigenvectors of M (i, j) , which controls the dimension of the embeddings, are included, (iii) is enhanced when using \log M(i,j) rather than M(i,j) , and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
zh

[NLP-270] SEW: Self-Evolving Agent ic Workflows for Automated Code Generation

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在处理复杂编码任务时,依赖人工设计的多智能体系统(multi-agent systems)所带来的适应性不足问题。当前方法在智能体拓扑结构和提示词设计上均需手动干预,限制了其自动适应不同编码问题的能力。论文提出的解决方案关键在于提出一种自进化框架Self-Evolving Workflow (SEW),该框架能够自动生成并优化多智能体工作流,通过自我演化机制提升编码任务的性能,实验结果表明,在LiveCodeBench等基准数据集上,SEW相比仅使用基础LLM提升了高达33%的性能。

链接: https://arxiv.org/abs/2505.18646
作者: Siwei Liu,Jinyuan Fang,Han Zhou,Yingxu Wang,Zaiqiao Meng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbfSelf-\textbfEvolving \textbfWorkflow (\textbfSEW), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.
zh

[NLP-271] Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)蒸馏过程中,小型语言模型(Small Language Model, SLM)在学习长推理过程时存在的两个问题:一是长推理过程导致训练时核心推理标记的梯度过度平滑,使得SLM难以掌握推理逻辑;二是推理速度慢,因为SLM必须生成完整的推理过程才能得出答案。解决方案的关键在于提出分块训练(Chunk-wise Training, CWT),通过启发式搜索将推理过程划分为语义连贯的块,并在每一轮训练中仅让SLM学习一个块,从而提高核心推理标记的占比并隔离非推理块。基于CWT进一步提出了跳思训练(Skip-thinking Training, STT),使SLM能够自动跳过非推理中间块以加速推理过程,同时保持准确性。

链接: https://arxiv.org/abs/2505.18642
作者: Xiao Chen,Sihang Zhou,Ke Liang,Xiaoyu Sun,Xinwang Liu
机构: National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
zh

[NLP-272] Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models

【速读】: 该论文试图解决低资源语言(如Dzongkha)在大型语言模型(Large Language Models, LLMs)中的性能不足问题。其解决方案的关键是构建了一个平行的Dzongkha与英语测试题数据集(DZEN),涵盖多种科学主题,并通过该数据集评估LLMs在不同语言和提示策略下的表现,发现添加英语翻译可提高Dzongkha问题回答的准确性。

链接: https://arxiv.org/abs/2505.18638
作者: Md. Tanzib Hosain,Rajan Das Gupta,Md. Kishor Morol
机构: American International University-Bangladesh (美国国际大学-孟加拉国); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 20 figures

点击查看摘要

Abstract:In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: this https URL.
zh

[NLP-273] DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM -Based Medical Consultation

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的医疗咨询(Medical Consultation, MC)方法在处理MC任务时无法准确捕捉其双重性质的问题,即症状询问(一个序列决策过程)和疾病诊断(一个分类问题)之间的差异。现有方法因未能有效区分和优化这两个子任务,导致症状询问不充分和疾病诊断不可靠。解决方案的关键在于提出一种名为DDO的新型LLM框架,通过协同多智能体工作流解耦并独立优化这两个子任务,实现双决策优化(Dual-Decision Optimization)。

链接: https://arxiv.org/abs/2505.18630
作者: Zhihao Jia,Mingyi Jia,Junwen Duan,Jianxin Wang
机构: Central South University (中南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbfDDO, a novel LLM-based framework that performs \textbfDual-\textbfDecision \textbfOptimization by decoupling and independently optimizing the the two sub-tasks through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task.
zh

[NLP-274] MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

【速读】: 该论文试图解决歌词翻译中同时实现语义准确传递与音乐节奏、音节数结构及诗歌风格保持的问题,特别是在动画音乐剧中需与视觉和听觉线索对齐的复杂挑战。解决方案的关键在于提出多语言、多模态的歌词翻译基准MAVL,并基于此构建Syllable-Constrained Audio-Video LLM with Chain-of-Thought(SylAVL-CoT),该方法通过利用音频视频线索并强制音节约束,生成更自然且可唱的歌词。

链接: https://arxiv.org/abs/2505.18614
作者: Woohyun Cho,Youngmin Kim,Sunghyun Lee,Youngjae Yu
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 28 pages, 8 figures

点击查看摘要

Abstract:Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
zh

[NLP-275] PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLM s

【速读】: 该论文旨在解决长链式思维(long-CoT)大语言模型(LLMs)在使用后训练键值缓存(KV Cache)量化技术时所面临的性能下降问题。具体而言,现有方法在长-CoT场景中会导致累积误差增大和短上下文校准不足,从而影响模型性能。其解决方案的关键在于提出一种渐进式混合精度KV缓存量化(PM-KVQ)方法,通过分块渐进量化策略减少累积误差,并采用基于位置插值的校准策略在不增加额外开销的情况下提升校准长度。

链接: https://arxiv.org/abs/2505.18610
作者: Tengxuan Liu,Shiyao Li,Jiayi Yang,Tianchen Zhao,Feng Zhou,Xiaohui Song,Guohao Dai,Shengen Yan,Huazhong Yang,Yu Wang
机构: Tsinghua University (清华大学); Infinigence-AI (Infinigence-AI); Columbia University (哥伦比亚大学); OPPO AI Center (OPPO人工智能中心); Shanghai Jiaotong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at this https URL.
zh

[NLP-276] RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations INTERSPEECH2025

【速读】: 该论文旨在解决可控且富有表现力的文本到语音(Text-to-Speech, TTS)合成在23种印度语言及英语中的应用问题,特别是如何通过丰富的文本描述来指导语音生成。解决方案的关键在于构建了RASMALAI数据集,该数据集包含13,000小时的语音和2400万条带有细粒度属性(如说话人身份、口音、情感、风格和背景条件)的文本描述注释,并基于此开发了IndicParlerTTS,这是首个开源的、由文本描述引导的印度语言TTS系统。该系统能够高质量地生成特定说话人的语音,准确遵循文本描述并合成指定属性,同时在不同语言间有效迁移表达特征。

链接: https://arxiv.org/abs/2505.18609
作者: Ashwin Sankar,Yoach Lacombe,Sherry Thomas,Praveen Srinivasa Varadhan,Sanchit Gandhi,Mitesh M Khapra
机构: AI4Bharat, WSAI
类目: Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.
zh

[NLP-277] Flex-Judge: Think Once Judge Anywhere

【速读】: 该论文旨在解决生成式模型与人类偏好对齐过程中,依赖人工生成的奖励信号所带来的高成本和泛化能力不足的问题。传统方法虽然使用大型语言模型(LLM)作为代理评估者(LLM-as-a-Judge)以降低人工标注成本,但通常需要大量特定模态的训练数据且在跨模态任务中泛化效果有限。解决方案的关键在于提出Flex-Judge,一种基于推理引导的多模态评估模型,其通过利用少量文本推理数据实现跨多种模态和评估格式的鲁棒泛化,核心思想是结构化的文本推理解释内含可迁移的决策模式,从而有效支持多模态判断。

链接: https://arxiv.org/abs/2505.18601
作者: Jongwoo Ko,Sungnyun Kim,Sungwoo Cho,Se-Young Yun
机构: KAIST AI (KAIST人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The code is available at this https URL

点击查看摘要

Abstract:Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
zh

[NLP-278] Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models

【速读】: 该论文试图解决数字平台上虚假信息检测中传统方法的局限性,这些方法主要依赖静态分类,无法捕捉现实世界中事实核查的复杂过程。其解决方案的关键在于引入一种名为Debate-to-Detect (D2D)的多智能体辩论(Multi-Agent Debate, MAD)框架,将虚假信息检测重新定义为结构化的对抗性辩论,并通过五阶段辩论流程(包括开场陈述、反驳、自由辩论、总结陈述和判断)以及多维评估机制,提升检测的准确性和可解释性。

链接: https://arxiv.org/abs/2505.18596
作者: Chen Han,Wenzhen Zheng,Xijin Tang
机构: School of Advanced Interdisciplinary Sciences, UCAS (中国科学院大学高级交叉科学学院); State Key Laboratory of Mathematical Sciences, AMSS, CAS (数学科学国家重点实验室,中科院数学与系统科学研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D’s capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. The code will be open-sourced in a future release.
zh

[NLP-279] Safety Alignment via Constrained Knowledge Unlearning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全对齐方面仍易受到越狱攻击的问题,现有防御机制未能完全删除模型中的有害知识,导致攻击能够绕过安全措施并生成有害输出。解决方案的关键在于提出一种新颖的安全对齐策略——受限知识遗忘(Constrained Knowledge Unlearning, CKU),其核心是通过评分多层感知机(MLP)层中的神经元,识别与有用知识相关的神经元子集U,并在遗忘过程中剪枝该子集的梯度,从而在保留有价值知识的同时有效减少有害内容,提升模型安全性而不影响整体性能。

链接: https://arxiv.org/abs/2505.18588
作者: Zesheng Shi,Yucheng Zhou,Jing Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.
zh

[NLP-280] RvLLM : LLM Runtime Verification with Domain Knowledge

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成输出时可能出现的不一致或错误问题,这些问题限制了其在高风险领域中的可靠性和可信度。解决方案的关键在于引入领域知识,通过设计一种通用的规范语言ESL,使领域专家能够以轻量且直观的方式定制领域特定的断言,并结合运行时验证框架RvLLM,对LLM的输出进行验证。该方法有效提升了LLM输出的准确性,为解决LLM在推理过程中因可解释性有限和缺乏形式化保证而导致的低级错误提供了潜在的长期解决方案。

链接: https://arxiv.org/abs/2505.18585
作者: Yedi Zhang,Sun Yi Emma,Annabelle Lee Jia En,Annabelle Lee Jia En,Jin Song Dong
机构: National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) have emerged as a dominant AI paradigm due to their exceptional text understanding and generation capabilities. However, their tendency to generate inconsistent or erroneous outputs challenges their reliability, especially in high-stakes domains requiring accuracy and trustworthiness. Existing research primarily focuses on detecting and mitigating model misbehavior in general-purpose scenarios, often overlooking the potential of integrating domain-specific knowledge. In this work, we advance misbehavior detection by incorporating domain knowledge. The core idea is to design a general specification language that enables domain experts to customize domain-specific predicates in a lightweight and intuitive manner, supporting later runtime verification of LLM outputs. To achieve this, we design a novel specification language, ESL, and introduce a runtime verification framework, RvLLM, to validate LLM output against domain-specific constraints defined in ESL. We evaluate RvLLM on three representative tasks: violation detection against Singapore Rapid Transit Systems Act, numerical comparison, and inequality solving. Experimental results demonstrate that RvLLM effectively detects erroneous outputs across various LLMs in a lightweight and flexible manner. The results reveal that despite their impressive capabilities, LLMs remain prone to low-level errors due to limited interpretability and a lack of formal guarantees during inference, and our framework offers a potential long-term solution by leveraging expert domain knowledge to rigorously and efficiently verify LLM outputs.
zh

[NLP-281] Removal of Hallucination on Hallucination: Debate-Augmented RAG ACL2025

【速读】: 该论文试图解决Retrieval-Augmented Generation (RAG)中因错误或偏见的检索导致生成过程中的幻觉(hallucination)问题,即“幻觉上的幻觉”。解决方案的关键在于提出一种无需训练的框架——Debate-Augmented RAG (DRAG),其核心是将Multi-Agent Debate (MAD)机制融入检索和生成阶段。在检索阶段,通过倡导者、反对者和裁判之间的结构化辩论来提升检索质量与事实可靠性;在生成阶段,引入非对称信息角色和对抗性辩论以增强推理鲁棒性并减少事实不一致性。

链接: https://arxiv.org/abs/2505.18581
作者: Wentao Hu,Wengyu Zhang,Yiyang Jiang,Chen Jason Zhang,Xiaoyong Wei,Qing Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at this https URL.
zh

[NLP-282] Enhancing Efficiency and Exploration in Reinforcement Learning for LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)过程中存在的效率低下和探索能力受限的问题。现有方法在RL过程中对所有问题分配相等的rollout数量,导致对简单问题的训练收益有限,而对复杂问题则需要更多rollout以采样正确答案,从而造成资源浪费。此外,RL虽然提升了响应精度,但限制了模型的探索能力,可能导致性能无法超越未经过RL的基线模型。论文提出的关键解决方案是基于问题难度动态分配rollout预算,以提高训练效率,并引入自适应动态温度调整策略,以维持熵的稳定水平,从而保证足够的探索能力,使LLMs在提升响应精度的同时保持探索潜力。

链接: https://arxiv.org/abs/2505.18573
作者: Mengqi Liao,Xiangyu Xi,Ruinian Chen,Jia Leng,Yangen Hu,Ke Zeng,Shuai Liu,Huaiyu Wan
机构: Beijing Jiaotong University (北京交通大学); Meituan (美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model’s exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: this https URL
zh

[NLP-283] From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化认知对齐方面存在的偏差问题,特别是其对西方文化(尤其是美国文化)的偏好。解决方案的关键在于提出CultureSteer,这是一种集成文化感知引导机制的方法,旨在将语义表示引导至具有文化特异性的空间,从而提升模型在跨文化语义关联中的表现。

链接: https://arxiv.org/abs/2505.18562
作者: Xunlian Dai,Li Zhou,Benyou Wang,Haizhou Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Wuhan University of Technology (武汉理工大学); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through lexical-semantic patterns. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism to guide semantic representations toward culturally specific spaces. Experiments show that current LLMs exhibit significant bias toward Western cultural (notably in American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, surpassing prompt-based methods in capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
zh

[NLP-284] AG-INSTRUCT: Controlled Instruction Complexity Enhancement through Structure-based Augmentation

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)训练中高质量指令数据生成的问题,特别是现有方法在有效控制指令复杂度方面存在的不足。其解决方案的关键在于提出TAG-INSTRUCT框架,通过结构化语义压缩和可控难度增强来提升指令复杂度,具体表现为将指令压缩到紧凑的标签空间,并通过强化学习引导的标签扩展系统性地增加复杂性。

链接: https://arxiv.org/abs/2505.18557
作者: He Zhu,Zhiwen Ruan,Junyou Su,Xingwei He,Wenjia Zhang,Yun Chen,Guanhua Chen
机构: Southern University of Science and Technology (南方科技大学); Peking University (北京大学); Shanghai University of Finance and Economics (上海财经大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present TAG-INSTRUCT, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, TAG-INSTRUCT compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that TAG-INSTRUCT outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.
zh

[NLP-285] Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中意图感知防护机制在面对恶意操控时的鲁棒性不足问题。现有研究虽已将意图检测应用于增强LLMs的内容审核防护,但其在对抗性攻击下的有效性尚未得到充分探索。论文提出了一种两阶段的基于意图的提示优化框架IntentPrompt,其关键在于通过迭代优化提示并引入反馈机制,将有害查询转化为结构化大纲,并进一步重构为陈述式叙述,从而提升红队测试中的越狱成功率。该方法在多个公开基准和黑盒LLM上均表现出优于现有越狱技术的效果,揭示了LLMs安全机制中的关键弱点。

链接: https://arxiv.org/abs/2505.18556
作者: Jun Zhuang,Haibo Jin,Ye Zhang,Zhengjian Kang,Wenbin Zhang,Gaby G. Dagher,Haohan Wang
机构: Boise State University (博伊西州立大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Pittsburgh (匹兹堡大学); New York University (纽约大学); Florida International University (佛罗里达国际大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, under review. TL;DR: We propose a new two-stage intent-based prompt-refinement framework, IntentPrompt, that aims to explore the vulnerability of LLMs’ content moderation guardrails by refining prompts into benign-looking declarative forms via intent manipulation for red-teaming purposes

点击查看摘要

Abstract:Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs’ moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our “FSTR+SPIN” variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs’ safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.
zh

[NLP-286] Unraveling Misinformation Propagation in LLM Reasoning

【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)的推理过程中,由于用户输入的错误信息(misinformation)导致模型输出准确性下降的问题。其解决方案的关键在于识别错误信息在推理过程中的传播路径,并通过在推理早期阶段应用事实性修正(factual corrections)来有效减少错误信息的扩散。此外,研究还表明,在合成数据上进行微调以包含早期阶段的修正可以显著提升模型推理的准确性。

链接: https://arxiv.org/abs/2505.18555
作者: Yiyang Feng,Yichen Wang,Shaobo Cui,Boi Faltings,Mina Lee,Jiawei Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 14 figures, 4 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning, positioning them as promising tools for supporting human problem-solving. However, what happens when their performance is affected by misinformation, i.e., incorrect inputs introduced by users due to oversights or gaps in knowledge? Such misinformation is prevalent in real-world interactions with LLMs, yet how it propagates within LLMs’ reasoning process remains underexplored. Focusing on mathematical reasoning, we present a comprehensive analysis of how misinformation affects intermediate reasoning steps and final answers. We also examine how effectively LLMs can correct misinformation when explicitly instructed to do so. Even with explicit instructions, LLMs succeed less than half the time in rectifying misinformation, despite possessing correct internal knowledge, leading to significant accuracy drops (10.02% - 72.20%). Further analysis shows that applying factual corrections early in the reasoning process most effectively reduces misinformation propagation, and fine-tuning on synthesized data with early-stage corrections significantly improves reasoning factuality. Our work offers a practical approach to mitigating misinformation propagation.
zh

[NLP-287] MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors

【速读】: 该论文试图解决如何在四个教学维度(错误识别、错误定位、提供指导和可操作性)上对AI导师的回答进行有效评估的问题。解决方案的关键在于采用统一的训练流程,对单一指令微调的语言模型进行多任务微调,而无需进行任何任务特定的架构调整,并引入一种基于分歧感知的集成推理策略,以提高预测的可靠性。

链接: https://arxiv.org/abs/2505.18549
作者: Baraa Hikal,Mohamed Basem,Islam Oshallah,Ali Hamdi
机构: MSA University (MSA大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.
zh

[NLP-288] Composable Cross-prompt Essay Scoring by Merging Models

【速读】: 该论文旨在解决跨提示词自动作文评分(cross-prompt automated essay scoring, AES)中,传统方法需要联合训练所有源提示词数据集并可能涉及重新访问源数据集带来的隐私问题与效率低下的问题。其解决方案的关键在于提出一种无需源数据的适应方法,通过选择性地融合独立训练的源模型参数,而非直接使用源数据集,从而避免隐私泄露并提升效率。具体而言,该方法通过任务向量的线性组合模拟联合训练,并引入先验编码的信息最大化(Prior-encoded Information Maximization, PIM)作为无监督目标,以优化参数融合系数,增强模型的评分区分能力。

链接: https://arxiv.org/abs/2505.18548
作者: Sanwoo Lee,Kun Liang,Yunfang Wu
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in cross-prompt automated essay scoring (AES) typically train models jointly on all source prompts, often requiring additional access to unlabeled target prompt essays simultaneously. However, using all sources is suboptimal in our pilot study, and re-accessing source datasets during adaptation raises privacy concerns. We propose a source-free adaptation approach that selectively merges individually trained source models’ parameters instead of datasets. In particular, we simulate joint training through linear combinations of task vectors – the parameter updates from fine-tuning. To optimize the combination’s coefficients, we propose Prior-encoded Information Maximization (PIM), an unsupervised objective which promotes the model’s score discriminability regularized by priors pre-computed from the sources. We employ Bayesian optimization as an efficient optimizer of PIM. Experimental results with LLMs on in-dataset and cross-dataset adaptation show that our method (1) consistently outperforms training jointly on all sources, (2) maintains superior robustness compared to other merging methods, (3) excels under severe distribution shifts where recent leading cross-prompt methods struggle, all while retaining computational efficiency.
zh

[NLP-289] B-score: Detecting biases in large language models using response history ICML2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在回答问题时表现出的偏见问题,特别是在涉及主观、随机、客观等不同类型的问题时。其解决方案的关键在于利用多轮对话中模型自身先前回答的观察,使模型能够自我“去偏”(de-bias),尤其是在需要随机、无偏答案的问题上。此外,研究提出了一种新的度量标准B-score,用于有效检测各类问题中的偏见,并通过该指标显著提升了对LLM答案验证的准确性。

链接: https://arxiv.org/abs/2505.18545
作者: An Vo,Mohammad Reza Taesiri,Daeyoung Kim,Anh Totti Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICML 2025 (Main track)

点击查看摘要

Abstract:Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to “de-bias” themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: this https URL.
zh

[NLP-290] Business as textitRulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLM s

【速读】: 该论文旨在解决业务流程中嵌入的规则流(rule flows)在现有研究中被忽视的问题,即如何从商业文档中自动提取结构化的业务规则及其依赖关系。其解决方案的关键在于引入了一个名为\textbfBPRF的中文标注数据集,并提出了一种基于大语言模型(LLMs)的框架\textbfExIde,用于自动提取业务规则并识别其依赖关系,从而为更自动化和可解释的业务流程自动化奠定基础。

链接: https://arxiv.org/abs/2505.18542
作者: Chen Yang,Ruping Xu,Ruizhe Li,Bin Cao,Jing Fan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Process mining aims to discover, monitor and optimize the actual behaviors of real processes. While prior work has mainly focused on extracting procedural action flows from instructional texts, rule flows embedded in business documents remain underexplored. To this end, we introduce a novel annotated Chinese dataset, \textbfBPRF, which contains 50 business process documents with 326 explicitly labeled business rules across multiple domains. Each rule is represented as a Condition, Action pair, and we annotate logical dependencies between rules (sequential, conditional, or parallel). We also propose \textbfExIde, a framework for automatic business rule extraction and dependency relationship identification using large language models (LLMs). We evaluate ExIde using 12 state-of-the-art (SOTA) LLMs on the BPRF dataset, benchmarking performance on both rule extraction and dependency classification tasks of current LLMs. Our results demonstrate the effectiveness of ExIde in extracting structured business rules and analyzing their interdependencies for current SOTA LLMs, paving the way for more automated and interpretable business process automation.
zh

[NLP-291] Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

【速读】: 该论文试图解决如何通过强化学习微调(Reinforcement Fine-Tuning, RFT)提升多模态大语言模型(Multimodal Large Language Models, MLLMs)的推理能力问题。其解决方案的关键在于系统性地总结RFT在提升MLLM推理能力方面的五项核心改进:多样化模态、多样化任务与领域、更优的训练算法、丰富的基准测试以及蓬勃发展的工程框架。这些改进为RFT在MLLM中的高效应用提供了理论支持与实践指导。

链接: https://arxiv.org/abs/2505.18536
作者: Haoyuan Sun,Jiaqi Wu,Bo Xia,Yifu Luo,Yifei Zhao,Kai Qin,Xufei Lv,Tiantian Zhang,Yongzhe Chang,Xueqian Wang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at this https URL.
zh

[NLP-292] metaTextGrad: Automatically optimizing language model optimizers

【速读】: 该论文试图解决当前基于大型语言模型(Large Language Models, LLMs)的优化器通常由人工设计且未经过优化的问题,同时这些优化器是通用型的,未能针对特定任务进行定制化。解决方案的关键在于提出一种元优化器——metaTextGrad,其核心包含两个关键组件:元提示优化器和元结构优化器,通过这两者的结合显著提升了多个基准测试中的性能,相对于最佳基线平均绝对性能提升了高达6%。

链接: https://arxiv.org/abs/2505.18524
作者: Guowei Xu,Mert Yuksekgonul,Carlos Guestrin,James Zou
机构: Tsinghua University (清华大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.
zh

[NLP-293] How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation

【速读】: 该论文试图解决序列建模架构对预训练语言模型基础能力的影响问题,具体关注不同架构如何影响模型的基础能力。其解决方案的关键在于提出一种有限领域预训练设置结合分布外测试,以有效揭示不同架构在基础能力上的差异,并通过分析发现状态序列建模架构相较于Transformer存在显著的能力退化。进一步的研究总结出一个关键设计原则:序列建模架构需具备全序列任意选择能力,以避免基础能力的下降。

链接: https://arxiv.org/abs/2505.18522
作者: Xin Lu,Yanyan Zhao,Si Wei,Shijin Wang,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); iFLYTEK Co., Ltd (科大讯飞股份有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained language models represented by the Transformer have been proven to possess strong base capabilities, and the representative self-attention mechanism in the Transformer has become a classic in sequence modeling architectures. Different from the work of proposing sequence modeling architecture to improve the efficiency of attention mechanism, this work focuses on the impact of sequence modeling architectures on base capabilities. Specifically, our concern is: How exactly do sequence modeling architectures affect the base capabilities of pre-trained language models? In this work, we first point out that the mixed domain pre-training setting commonly adopted in existing architecture design works fails to adequately reveal the differences in base capabilities among various architectures. To address this, we propose a limited domain pre-training setting with out-of-distribution testing, which successfully uncovers significant differences in base capabilities among architectures at an early stage. Next, we analyze the base capabilities of stateful sequence modeling architectures, and find that they exhibit significant degradation in base capabilities compared to the Transformer. Then, through a series of architecture component analysis, we summarize a key architecture design principle: A sequence modeling architecture need possess full-sequence arbitrary selection capability to avoid degradation in base capabilities. Finally, we empirically validate this principle using an extremely simple Top-1 element selection architecture and further generalize it to a more practical Top-1 chunk selection architecture. Experimental results demonstrate our proposed sequence modeling architecture design principle and suggest that our work can serve as a valuable reference for future architecture improvements and novel designs.
zh

[NLP-294] AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

【速读】: 该论文试图解决传统基于大语言模型(Large Language Models, LLMs)的列表重排序(listwise reranking)方法在计算资源分配上的不足,即固定大小的小子集进行重排序会忽略查询难度和文档分布,导致效率低下。解决方案的关键在于提出AcuRank框架,该框架通过基于不确定性估计的贝叶斯TrueSkill模型动态调整计算量和计算目标,从而实现对相关性估计的迭代优化,并在达到足够置信度后停止不必要的更新,有效提升了重排序的准确性和效率。

链接: https://arxiv.org/abs/2505.18512
作者: Soyoung Yoon,Gyuwan Kim,Gyu-Hwung Cho,Seung-won Hwang
机构: Seoul National University (首尔国立大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Yonsei University (延世大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 3 figures. The first two authors contributed equally. Author order is randomly determined via coin toss

点击查看摘要

Abstract:Listwise reranking with large language models (LLMs) enhances top-ranked results in retrieval-based applications. Due to the limit in context size and high inference cost of long context, reranking is typically performed over a fixed size of small subsets, with the final ranking aggregated from these partial results. This fixed computation disregards query difficulty and document distribution, leading to inefficiencies. We propose AcuRank, an adaptive reranking framework that dynamically adjusts both the amount and target of computation based on uncertainty estimates over document relevance. Using a Bayesian TrueSkill model, we iteratively refine relevance estimates until reaching sufficient confidence levels, and our explicit modeling of ranking uncertainty enables principled control over reranking behavior and avoids unnecessary updates to confident predictions. Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy-efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models.
zh

[NLP-295] Knowledge Grafting of Large Language Models

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)研究中的跨能力迁移问题,特别是在多任务整合、模型压缩和持续学习中的应用。现有方法主要针对小型同构模型,难以有效适用于大型异构模型,导致知识蒸馏和参数高效微调(PEFT)方法在知识吸收和灾难性遗忘方面存在局限。解决方案的关键在于提出GraftLLM,通过SkillPack格式将源模型能力存储到目标模型中,该方法保留通用能力、减少参数冲突,并支持无遗忘的持续学习与模型融合,同时采用模块感知的自适应压缩策略以确保高效存储并保持任务特定知识。

链接: https://arxiv.org/abs/2505.18502
作者: Guodong Du,Xuanning Zhou,Junlin Li,Zhuo Li,Zesheng Shi,Wanyu Lin,Ho-Kin Tang,Xiucheng Li,Fangming Liu,Wenya Wang,Min Zhang,Jing Li
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳); The Hong Kong Polytechnic University (香港理工大学); Nanyang Technological University (南洋理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model’s intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: this https URL.
zh

[NLP-296] he Prag matic Mind of Machines: Tracing the Emergence of Prag matic Competence in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中如何获得社会智能任务中所需的语用理解能力的问题。其解决方案的关键在于引入ALTPRAG数据集,该数据集基于“替代”(alternatives)的语用概念,通过配对语境恰当但语用上不同的续写实例,实现对模型语用解释和对比推理能力的细粒度评估。

链接: https://arxiv.org/abs/2505.18497
作者: Kefan Yu,Qingcheng Zeng,Weihao Xuan,Wanxin Li,Jingyi Wu,Rob Voigt
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution (Sravanthi et al. (2024)) and theory-of-mind reasoning (Shapira et al. (2024)), both of which require substantial pragmatic understanding. However, how LLMs acquire this competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, designed to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two contextually appropriate but pragmatically distinct continuations, enabling fine-grained assessment of both pragmatic interpretation and contrastive reasoning. We systematically evaluate 22 LLMs across key training stages: pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic reasoning. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.
zh

[NLP-297] Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications ACL

【速读】: 该论文旨在解决在移动设备上应用大语言模型(Large Language Models, LLMs)时,如何提升错误纠正能力的问题。其解决方案的关键在于利用LLMs生成高质量的错误纠正对数据集,并通过重加权样本以适应移动应用领域的数据分布,同时结合离线评估数据和小型隐私保护的本地语言模型得分,预测生产环境中的A/B测试指标,从而优化模型性能。此外,论文还提出了将合成数据与其他数据源混合的最佳实践,以进一步提升错误纠正任务在离线评估和生产环境中的表现。

链接: https://arxiv.org/abs/2505.18488
作者: Yanxiang Zhang,Zheng Xu,Shanshan Wu,Yuanbo Zhang,Daniel Ramage
机构: Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL Industry

点击查看摘要

Abstract:Error correction is an important capability when applying large language models (LLMs) to facilitate user typing on mobile devices. In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications. We first prompt LLMs with error correction domain knowledge to build a scalable and reliable addition to the existing data synthesis pipeline. We then adapt the synthetic data distribution to match the mobile application domain by reweighting the samples. The reweighting model is learnt by predicting (a handful of) live A/B test metrics when deploying LLMs in production, given the LLM performance on offline evaluation data and scores from a small privacy-preserving on-device language model. Finally, we present best practices for mixing our synthetic data with other data sources to improve model performance on error correction in both offline evaluation and production live A/B testing.
zh

[NLP-298] Investigating AI Rater Effects of Large Language Models : GPT Claude Gemini and DeepSeek

【速读】: 该论文试图解决在低风险评估中使用生成式 AI (Generative AI) 进行自动评分时,如何选择最可靠且产生最少评分者效应的模型问题。其解决方案的关键在于通过对比十种 LLM 与人类专家评分者在两种写作任务中的评分准确性、评分者内部一致性及评分者效应,采用二次加权 Kappa、Cronbach Alpha 和多维 Rasch 模型进行评估,从而确定性能最优的模型。

链接: https://arxiv.org/abs/2505.18486
作者: Hong Jiao,Dan Song,Won-Chan Lee
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely explored for automated scoring in low-stakes assessment to facilitate learning and instruction. Empirical evidence related to which LLM produces the most reliable scores and induces least rater effects needs to be collected before the use of LLMs for automated scoring in practice. This study compared ten LLMs (ChatGPT 3.5, ChatGPT 4, ChatGPT 4o, OpenAI o1, Claude 3.5 Sonnet, Gemini 1.5, Gemini 1.5 Pro, Gemini 2.0, as well as DeepSeek V3, and DeepSeek R1) with human expert raters in scoring two types of writing tasks. The accuracy of the holistic and analytic scores from LLMs compared with human raters was evaluated in terms of Quadratic Weighted Kappa. Intra-rater consistency across prompts was compared in terms of Cronbach Alpha. Rater effects of LLMs were evaluated and compared with human raters using the Many-Facet Rasch model. The results in general supported the use of ChatGPT 4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet with high scoring accuracy, better rater reliability, and less rater effects.
zh

[NLP-299] Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在教育场景中缺乏教学连贯性和真实教学行为的问题。其关键解决方案是提出Pedagogy-R1框架,该框架通过三项创新实现:基于知识蒸馏的管道用于筛选和优化模型输出以进行教学调优;Well-balanced Educational Benchmark (WBEB) 评估体系,涵盖学科知识、教学知识、追踪能力、作文评分及教师决策等多个维度;以及Chain-of-Pedagogy (CoP) 提示策略,用于生成和激发教师风格的推理过程。

链接: https://arxiv.org/abs/2505.18467
作者: Unggi Lee,Jaeyong Lee,Jiyeong Bae,Yeil Jeong,Junbo Koh,Gyeonggeon Lee,Gunho Lee,Taekyung Ahn,Hyeoncheol Kim
机构: Enuma, Inc.(Enuma公司); Seoul National University(首尔国立大学); Nanyang Technological University(南洋理工大学); Korea University(高丽大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Recent advances in large reasoning models (LRMs) show strong performance in structured domains such as mathematics and programming; however, they often lack pedagogical coherence and realistic teaching behaviors. To bridge this gap, we introduce Pedagogy-R1, a framework that adapts LRMs for classroom use through three innovations: (1) a distillation-based pipeline that filters and refines model outputs for instruction-tuning, (2) the Well-balanced Educational Benchmark (WBEB), which evaluates performance across subject knowledge, pedagogical knowledge, tracing, essay scoring, and teacher decision-making, and (3) a Chain-of-Pedagogy (CoP) prompting strategy for generating and eliciting teacher-style reasoning. Our mixed-method evaluation combines quantitative metrics with qualitative analysis, providing the first systematic assessment of LRMs’ pedagogical strengths and limitations.
zh

[NLP-300] Measuring South Asian Biases in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在评估过程中忽视交叉性和文化特定偏见的问题,特别是在南亚等多语言地区。其关键解决方案是构建一个基于文化的偏见词典,该词典捕捉了此前未被探索的交叉性维度,如性别、宗教、婚姻状况和子女数量,并利用该词典量化交叉性偏见以及自我去偏策略在开放式生成任务中的有效性。此外,论文还评估了两种自我去偏策略(简单和复杂提示)在减少印欧语系和达罗毗荼语系语言中文化特定偏见方面的效果。

链接: https://arxiv.org/abs/2505.18466
作者: Mamnuya Rinki,Chahat Raj,Anjishnu Mukherjee,Ziwei Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.
zh

[NLP-301] From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data

【速读】: 该论文试图解决在焦虑支持领域中,如何有效利用大型语言模型(Large Language Models, LLMs)提供可扩展且实时的心理健康支持的问题。其解决方案的关键在于通过真实用户生成的关于焦虑的帖子对LLMs(如GPT和Llama)进行提示和微调,以评估其在语言质量、安全性和支持性方面的表现。研究发现,使用自然主义的焦虑相关数据进行微调虽然提升了语言质量,但也增加了毒性与偏见并降低了情感响应能力,从而揭示了在未采取缓解策略的情况下,直接对社交媒体内容进行微调所带来的风险。

链接: https://arxiv.org/abs/2505.18464
作者: Ugur Kursuncu,Trilok Padhi,Gaurav Sinha,Abdulkadir Erol,Jaya Krishna Mandivarapu,Christopher R. Larrison
机构: Microsoft(微软)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The growing demand for accessible mental health support, compounded by workforce shortages and logistical barriers, has led to increased interest in utilizing Large Language Models (LLMs) for scalable and real-time assistance. However, their use in sensitive domains such as anxiety support remains underexamined. This study presents a systematic evaluation of LLMs (GPT and Llama) for their potential utility in anxiety support by using real user-generated posts from the r/Anxiety subreddit for both prompting and fine-tuning. Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness. Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness. While LLMs exhibited limited empathy, GPT was evaluated as more supportive overall. Our findings highlight the risks of fine-tuning LLMs on unprocessed social media content without mitigation strategies.
zh

[NLP-302] A Survey of LLM times DATA

【速读】: 该论文试图解决大型语言模型(LLM)与数据管理(DATA)之间双向关系的整合问题,旨在全面梳理二者在数据处理、存储、服务以及数据管理任务中的协同作用。其解决方案的关键在于通过DATA4LLM和LLM4DATA两个方向实现技术融合:一方面,通过高效的数据处理、存储与服务技术为LLM提供高质量、多样化和实时的数据支持;另一方面,利用LLM的强大能力推动数据管理领域的自动化与智能化,包括数据操作、分析及系统优化等任务。

链接: https://arxiv.org/abs/2505.18458
作者: Xuanhe Zhou,Junxuan He,Wei Zhou,Haodong Chen,Zirui Tang,Haoyu Zhao,Xin Tong,Guoliang Li,Youmin Chen,Jun Zhou,Zhaojun Sun,Binyuan Hui,Shuo Wang,Conghui He,Zhiyuan Liu,Jingren Zhou,Fan Wu
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团); Shanghai AI Laboratory (上海人工智能实验室)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Please refer to the paper list at: this https URL

点击查看摘要

Abstract:The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
zh

[NLP-303] Anchored Diffusion Language Model

【速读】: 该论文试图解决扩散语言模型(Diffusion Language Models, DLMs)在似然建模和生成文本质量上落后于自回归(autoregressive, AR)模型的问题。其关键在于,重要标记(如关键词或低频词)在前向过程中被过早掩码,限制了上下文信息的准确重建。为此,作者提出了锚定扩散语言模型(Anchored Diffusion Language Model, ADLM),其核心是通过一个锚定网络首先预测重要标记的分布,随后基于这些锚定预测来估计缺失标记的似然,从而提升模型性能。

链接: https://arxiv.org/abs/2505.18456
作者: Litu Rout,Constantine Caramanis,Sanjay Shakkottai
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches
zh

[NLP-304] Hybrid Latent Reasoning via Reinforcement Learning

【速读】: 该论文试图解决传统自回归推理(autoregressive reasoning)在大型语言模型(LLMs)中的局限性,特别是其依赖离散的思维链(CoT)路径导致的信息不充分问题,以及隐式推理(latent reasoning)与LLMs连续生成范式不兼容的问题。解决方案的关键在于提出一种基于强化学习(RL)的混合推理策略优化方法(Hybrid Reasoning Policy Optimization, HRPO),该方法通过可学习的门控机制将先验隐藏状态整合到采样的标记中,并在训练初期主要依赖标记嵌入,逐步引入更多隐藏特征,从而在保持LLMs生成能力的同时,激励使用离散和连续表示进行混合推理。

链接: https://arxiv.org/abs/2505.18454
作者: Zhenrui Yue,Bowen Jin,Huimin Zeng,Honglei Zhuang,Zhen Qin,Jinsung Yoon,Lanyu Shang,Jiawei Han,Dong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Google(谷歌); LMU(慕尼黑路德维希马克西米利安大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs’ generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
zh

[NLP-305] MedScore: Factuality Evaluation of Free-Form Medical Answers

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成医学答案时存在的事实性不准确问题,尤其是在分解-验证(decompose-then-verify)评估流程中,LLMs生成的内容可能包含幻觉或模糊引用。现有事实性评估系统在医学领域表现不佳,因为它们主要针对客观、以实体为中心的文本进行评估,而医学回答具有条件依赖性、对话性、假设性和句法多样性等特点,导致分解为有效事实变得困难。论文提出的解决方案是MedScore,其关键在于通过条件感知的方式将医学答案分解为有效的事实,能够提取比现有方法多三倍的有效事实,从而减少幻觉和模糊引用,并保留事实的条件依赖性。

链接: https://arxiv.org/abs/2505.18452
作者: Heyuan Huang,Alexandra DeLucia,Vijay Murari Tiyyala,Mark Dredze
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.
zh

[NLP-306] μ-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

【速读】: 该论文旨在解决大型基础模型计算需求高导致的效率问题,提出了一种无需微调的激活感知压缩技术。其解决方案的关键在于通过计算高效的校准过程,实现针对每个提示的自适应激活感知剪枝,从而在推理阶段降低复杂度,该方法被形式化为一种称为μ-MoE(mixture of micro-experts)的结构,能够动态适应任务或提示相关的结构稀疏性。

链接: https://arxiv.org/abs/2505.18451
作者: Toshiaki Koike-Akino,Jing Liu,Ye Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called \mu -MoE. Several experiments demonstrate that \mu -MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.
zh

[NLP-307] BRIT: Bidirectional Retrieval over Unified Image-Text Graph

【速读】: 该论文试图解决在多模态文档(包含文本和图像)中进行检索增强生成(Retrieval-Augmented Generation, RAG)时,如何有效处理跨模态的复杂问答问题。传统方法在文本查询上的改进未能充分适用于多模态场景,尤其是在微调不适用的情况下。论文提出的解决方案是BRIT框架,其关键在于通过构建一个统一的多模态图结构,将文档中的多种文本-图像关联整合,并根据查询提取特定子图,从而实现跨模态的多跳信息检索与生成。

链接: https://arxiv.org/abs/2505.18450
作者: Ainulla Khan,Yamada Moyuru,Srinidhi Akella
机构: Fujitsu Research India (富士通印度研究部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models. While recent advancements have mainly focused on improving RAG for text-based queries, RAG on multi-modal documents containing both texts and images has not been fully explored. Especially when fine-tuning does not work. This paper proposes BRIT, a novel multi-modal RAG framework that effectively unifies various text-image connections in the document into a multi-modal graph and retrieves the texts and images as a query-specific sub-graph. By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieve not only directly query-relevant images and texts but also further relevant contents to answering complex cross-modal multi-hop questions. To evaluate the effectiveness of BRIT, we introduce MM-RAG test set specifically designed for multi-modal question answering tasks that require to understand the text-image relations. Our comprehensive experiments demonstrate the superiority of BRIT, highlighting its ability to handle cross-modal questions on the multi-modal documents.
zh

[NLP-308] Efficient Long CoT Reasoning in Small Language Models

【速读】: 该论文试图解决如何将长链式思维(long chain-of-thought, CoT)推理能力有效地蒸馏到小型语言模型(SLMs)中的问题。由于长CoT通常包含大量冗余内容,这使得SLMs在学习过程中面临挑战,因为它们的容量和泛化能力相对有限。解决方案的关键在于提出一种简单但有效的步骤剪枝方法,以去除长CoT中不必要的步骤,并通过在线策略方法让SLMs自身生成有效且有用的长CoT训练数据,从而实现高效长CoT推理的学习并保持竞争性性能。

链接: https://arxiv.org/abs/2505.18440
作者: Zhaoyang Wang,Jinqi Jiang,Tian Qiu,Hui Liu,Xianfeng Tang,Huaxiu Yao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small language models (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.
zh

[NLP-309] Voice of a Continent: Mapping Africas Speech Technology Frontier

【速读】: 该论文试图解决非洲地区语言多样性在语音技术中显著不足的问题,这一问题阻碍了数字包容性的发展。解决方案的关键在于系统地映射非洲的语音数据集与技术,并构建了一个新的综合性基准SimbaBench,以此推动下游非洲语音任务的研究。通过SimbaBench,研究者提出了Simba模型系列,在多种非洲语言和语音任务中实现了最先进的性能,同时揭示了资源可用性、数据集质量、领域多样性和语言家族关系对模型性能的影响。

链接: https://arxiv.org/abs/2505.18436
作者: AbdelRahim Elmadany,Sang Yun Kwon,Hawau Olamide Toyin,Alcides Alcoba Inciarte,Hanan Aldarmaki,Muhammad Abdul-Mageed
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.
zh

[NLP-310] Retrieval Augmented Generation-based Large Language Models for Bridging Transportation Cybersecurity Legal Knowledge Gaps

【速读】: 该论文试图解决在连接和自动化交通系统发展过程中,联邦和州级机构面临的安全性和数据隐私法律框架更新问题,以及由此产生的对准确、针对性法律信息的需求。解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的大型语言模型(Large Language Model, LLM)框架,通过使用领域特定的问题集来引导响应生成,从而减少大语言模型中的幻觉现象,并借助检索机制提升输出的准确性和事实基础。

链接: https://arxiv.org/abs/2505.18426
作者: Khandakar Ashrafi Akbar,Md Nahiyan Uddin,Latifur Khan,Trayce Hockstad,Mizanur Rahman,Mashrur Chowdhury,Bhavani Thuraisingham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at the Transportation Research Board (TRB) Annual Meeting 2025, and subsequently submitted for publication consideration in the Transportation Research Record (TRR)

点击查看摘要

Abstract:As connected and automated transportation systems evolve, there is a growing need for federal and state authorities to revise existing laws and develop new statutes to address emerging cybersecurity and data privacy challenges. This study introduces a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) framework designed to support policymakers by extracting relevant legal content and generating accurate, inquiry-specific responses. The framework focuses on reducing hallucinations in LLMs by using a curated set of domain-specific questions to guide response generation. By incorporating retrieval mechanisms, the system enhances the factual grounding and specificity of its outputs. Our analysis shows that the proposed RAG-based LLM outperforms leading commercial LLMs across four evaluation metrics: AlignScore, ParaScore, BERTScore, and ROUGE, demonstrating its effectiveness in producing reliable and context-aware legal insights. This approach offers a scalable, AI-driven method for legislative analysis, supporting efforts to update legal frameworks in line with advancements in transportation technologies.
zh

[NLP-311] LatentLLM : Attention-Aware Joint Tensor Compression

【速读】: 该论文试图解决现代基础模型(如大型语言模型(Large Language Models, LLMs)和大型多模态模型(Large Multi-modal Models, LMMs))在计算和内存资源消耗过大的问题。其解决方案的关键在于提出一种新的框架,将这些模型转换为低维潜在结构,通过扩展局部激活感知张量分解至全局注意力感知联合张量分解,从而在降低潜在维度的同时显著提升模型精度,实现计算和内存高效的LLMs/LMMs。

链接: https://arxiv.org/abs/2505.18413
作者: Toshiaki Koike-Akino,Xiangyu Chen,Jing Liu,Ye Wang, Pu (Perry)Wang,Matthew Brand
机构: Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 16 figures

点击查看摘要

Abstract:Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
zh

[NLP-312] DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

【速读】: 该论文试图解决多模态时间点过程(Temporal Point Process, TPP)建模在大型语言模型(Large Language Models, LLMs)时代进展缓慢的问题,特别是现有数据集主要为单模态,限制了需要联合推理时间、文本和视觉信息的模型发展。解决方案的关键在于提出DanmakuTPPBench基准,其包含两个互补组件:(1) DanmakuTPP-Events,一个从Bilibili视频平台获取的新数据集,用户生成的弹幕自然形成带有精确时间戳、丰富文本内容和对应视频帧的多模态事件;(2) DanmakuTPP-QA,一个通过先进LLMs和多模态LLMs(MLLMs)驱动的多智能体管道构建的挑战性问答数据集,旨在实现复杂的时空视觉推理。

链接: https://arxiv.org/abs/2505.18411
作者: Yue Jiang,Jichu Li,Yang Liu,Dingkang Yang,Feng Zhou,Quyu Kong
机构: Fudan University (复旦大学); Center for Applied Statistics and School of Statistics, Renmin University of China (中国人民大学应用统计中心和统计学院); Alibaba Cloud (阿里云); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算北京先进创新中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods’ ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at this https URL
zh

[NLP-313] RaDeR: Reasoning -aware Dense Retrieval Models

【速读】: 该论文试图解决在基于推理的密集检索任务中,如何生成高质量且具有挑战性的样本以提升模型性能的问题。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的推理增强型检索轨迹和自省相关性评估,从而构建出多样且具有硬负样本的推理密集检索数据集。这一方法使得RaDeR模型在数学推理任务中表现出色,并在多个基准测试中优于现有基线模型。

链接: https://arxiv.org/abs/2505.18405
作者: Debrup Das,Sam O’ Nuallain,Razieh Rahimi
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 26 pages

点击查看摘要

Abstract:We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall this http URL, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore, RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work REASONIR, highlighting the quality of our synthesized training data.
zh

[NLP-314] NileChat: Towards Linguistically Diverse and Culturally Aware LLM s for Local Communities

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在低资源语言上的语言能力不足以及文化代表性缺失的问题。现有方法主要依赖于将英语语料库翻译生成合成数据,虽然在语言理解和翻译任务中表现出色,但模型往往偏向源语言文化,未能体现目标社区的文化遗产和价值观。该研究提出了一种方法,用于创建针对特定社区的合成与检索预训练数据,综合考虑其语言、文化遗产和文化价值观。解决方案的关键在于结合本地化语言、文化背景和价值体系,以提升模型对目标社区的适应性和文化契合度。

链接: https://arxiv.org/abs/2505.18383
作者: Abdellah El Mekki,Houdaifa Atou,Omer Nacar,Shady Shehata,Muhammad Abdul-Mageed
机构: The University of British Columbia(不列颠哥伦比亚大学); UM6P(摩洛哥皇家理工学院); PSU(宾夕法尼亚州立大学); MBZUAI(穆罕默德本扎耶德大学人工智能研究院); Invertible AI(可逆人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.
zh

[NLP-315] ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation

【速读】: 该论文旨在解决小规模架构在命令行接口(CLI)模拟中难以达到与大型预训练语言模型(PLM)相似可信度的问题,其核心挑战在于缺乏丰富的CLI交互数据集。解决方案的关键在于引入Shell Input-Output Environment (ShIOEnv),将命令构建建模为马尔可夫决策过程,通过执行候选命令并返回退出状态、输出和行为目标进展来评估动作效果。为应对组合参数状态-动作空间的不可行性,研究者从手册页中推导出上下文无关语法以屏蔽无效参数,并探索了随机和近端策略优化(PPO)采样方法,以生成四种探索策略,从而显著提升样本效率和数据集质量。

链接: https://arxiv.org/abs/2505.18374
作者: Jarrod Ragsdale,Rajendra Boppana
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 11 figures, conference preprint

点击查看摘要

Abstract:Command-line interfaces (CLIs) provide structured textual environments for system administration. Explorations have been performed using pre-trained language models (PLMs) to simulate these environments for safe interaction in high-risk environments. However, their use has been constrained to frozen, large parameter models like GPT. For smaller architectures to reach a similar level of believability, a rich dataset of CLI interactions is required. Existing public datasets focus on mapping natural-language tasks to commands, omitting crucial execution data such as exit codes, outputs, and environmental side effects, limiting their usability for behavioral modeling. We introduce a Shell Input -Output Environment (ShIOEnv), which casts command construction as a Markov Decision Process whose state is the partially built sequence and whose actions append arguments. After each action, ShIOEnv executes the candidate and returns its exit status, output, and progress toward a minimal-length behavioral objective. Due to the intractable nature of the combinatorial argument state-action space, we derive a context-free grammar from man pages to mask invalid arguments from being emitted. We explore random and proximal-policy optimization (PPO)-optimized sampling of unrestricted and grammar-masked action spaces to produce four exploration strategies. We observed that grammar masking and PPO significantly improve sample efficiency to produce a higher quality dataset (maximizing the number of arguments while minimizing redundancies). Policy-generated datasets of shell input-output behavior pairs are used to fine-tune CodeT5, where we observe 85% improvements in BLEU-4 when constraining the action space to grammar productions with an additional 26% improvement when applying PPO. The ShIOEnv environment and curated command behavior datasets are released for use in future research.
zh

[NLP-316] Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems ACL2025

【速读】: 该论文旨在解决企业搜索系统在检索准确且领域特定信息时面临的语义不匹配和术语重叠问题,这些问题会降低下游应用(如知识管理、客户支持和检索增强生成代理)的性能。其解决方案的关键在于提出一种可扩展的硬负样本挖掘框架,通过动态选择语义上具有挑战性但上下文无关的文档来增强已部署的重新排序模型,同时整合多种嵌入模型、执行降维并独特地选择硬负样本,以确保计算效率和语义精度。

链接: https://arxiv.org/abs/2505.18366
作者: Hansa Meghwani,Amit Agarwal,Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Srikant Panda
机构: Oracle AI(Oracle AI)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15% in MRR@3 and 19% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method’s generalizability and readiness for real-world applications. Comments: Accepted to ACL 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: H.3.3; I.2.6; I.2.7 Cite as: arXiv:2505.18366 [cs.IR] (or arXiv:2505.18366v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.18366 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-317] SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases

【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)系统中模式链接(schema linking)的问题,即如何在自然语言查询与数据库模式之间建立准确的映射关系,以提升生成SQL查询的准确性。其解决方案的关键在于提出一种零样本、无需训练的模式链接方法,首先基于外键关系构建模式图,随后通过单一提示词引导Gemini 2.5 Flash模型提取用户查询中的源表和目标表,再结合经典路径查找算法和后处理步骤,确定最优的表与列连接顺序,从而增强大语言模型(LLM)生成SQL的能力。该方法简单、成本低且可扩展性强,在BIRD基准测试中取得了最先进的性能。

链接: https://arxiv.org/abs/2505.18363
作者: AmirHossein Safdarian,Milad Mohammadi,Ehsan Jahanbakhsh,Mona Shahamat Naderi,Heshaam Faili
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to Gemini 2.5 Flash to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on the BIRD benchmark, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches. We conduct detailed ablation studies to examine the precision-recall trade-off in our framework. Additionally, we evaluate the execution accuracy of our schema filtering method compared to other approaches across various model sizes.
zh

[NLP-318] he Unreason able Effectiveness of Model Merging for Cross-Lingual Transfer in LLM s

【速读】: 该论文试图解决在低资源语言中,大型语言模型(Large Language Models, LLMs)在任务外表现不佳的问题,特别是在缺乏特定任务微调数据的情况下,如何有效实现跨语言迁移。其解决方案的关键在于利用模型参数中与数学推理和多语言能力相关的部分存在非重叠性这一特性,通过模块化框架将数学能力和语言能力的优化分配到不同的模型参数部分,从而提升微调效果。具体方法包括冻结参数或后期模型合并,以实现任务与目标语言参数化的分离优化。

链接: https://arxiv.org/abs/2505.18356
作者: Lucas Bandarkar,Nanyun Peng
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.
zh

[NLP-319] ask Specific Pruning with LLM -Sieve: How Many Parameters Does Your Task Really Need?

【速读】: 该论文试图解决在资源受限环境中部署大型语言模型(Large Language Models, LLMs)时,如何确定任务所需的最小参数量问题。其解决方案的关键在于提出LLM-Sieve框架,该框架通过学习任务感知的联合投影来更准确地近似输出行为,并利用遗传算法为每个矩阵发现差异化的剪枝水平,从而实现20-75%的参数减少,仅导致1-5%的精度下降。

链接: https://arxiv.org/abs/2505.18350
作者: Waleed Reda,Abhinav Jangda,Krishna Chintalapudi
机构: Microsoft Research(微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly being adopted for narrow tasks - such as medical question answering or sentiment analysis - and deployed in resource-constrained settings, a key question arises: how many parameters does a task actually need? In this work, we present LLM-Sieve, the first comprehensive framework for task-specific pruning of LLMs that achieves 20-75% parameter reduction with only 1-5% accuracy degradation across diverse domains. Unlike prior methods that apply uniform pruning or rely on low-rank approximations of weight matrices or inputs in isolation, LLM-Sieve (i) learns task-aware joint projections to better approximate output behavior, and (ii) employs a Genetic Algorithm to discover differentiated pruning levels for each matrix. LLM-Sieve is fully compatible with LoRA fine-tuning and quantization, and uniquely demonstrates strong generalization across datasets within the same task domain. Together, these results establish a practical and robust mechanism to generate smaller performant task-specific models.
zh

[NLP-320] Model Editing with Graph-Based External Memory

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的幻觉问题和过时参数知识的问题,同时克服现有后训练模型编辑方法中存在的过拟合和灾难性遗忘问题。其解决方案的关键在于提出一种基于双曲几何和图神经网络的新型框架HYPE,通过三个核心组件实现精确且稳定的模型编辑:一是利用庞加莱嵌入构建双曲图以保持层次关系并避免父概念编辑对子概念的意外影响;二是采用莫比乌斯变换更新机制,在双曲流形内保持结构一致性;三是结合梯度掩码与周期性图神经网络参数重置的双重稳定策略,防止灾难性遗忘。

链接: https://arxiv.org/abs/2505.18343
作者: Yash Kumar Atri,Ahmed Alaa,Thomas Hartvigsen
机构: University of Virginia (弗吉尼亚大学); UC Berkeley (加州大学伯克利分校); UCSF (加州大学旧金山分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing, yet their practical utility is often limited by persistent issues of hallucinations and outdated parametric knowledge. Although post-training model editing offers a pathway for dynamic updates, existing methods frequently suffer from overfitting and catastrophic forgetting. To tackle these challenges, we propose a novel framework that leverages hyperbolic geometry and graph neural networks for precise and stable model edits. We introduce HYPE (HYperbolic Parameter Editing), which comprises three key components: (i) Hyperbolic Graph Construction, which uses Poincaré embeddings to represent knowledge triples in hyperbolic space, preserving hierarchical relationships and preventing unintended side effects by ensuring that edits to parent concepts do not inadvertently affect child concepts; (ii) Möbius-Transformed Updates, which apply hyperbolic addition to propagate edits while maintaining structural consistency within the hyperbolic manifold, unlike conventional Euclidean updates that distort relational distances; and (iii) Dual Stabilization, which combines gradient masking and periodic GNN parameter resetting to prevent catastrophic forgetting by focusing updates on critical parameters and preserving long-term knowledge. Experiments on CounterFact, CounterFact+, and MQuAKE with GPT-J and GPT2-XL demonstrate that HYPE significantly enhances edit stability, factual accuracy, and multi-hop reasoning.
zh

[NLP-321] PerMedCQA: Benchmarking Large Language Models on Medical Consumer Question Answering in Persian Language

【速读】: 该论文试图解决低资源语言(如波斯语)在医疗消费者问答(Medical Consumer Question Answering, CQA)领域中,缺乏面向消费者和多语言的高质量数据集的问题。解决方案的关键在于构建了PerMedCQA,这是首个针对波斯语的基准数据集,包含68,138个经过精心清洗的问答对,用于评估大语言模型(LLMs)在真实世界消费者生成的医疗问题上的表现。此外,研究还引入了MedJudge,一种基于评分标准的新型评估框架,由LLM评分器驱动,并通过专家人类标注者进行验证,以提高评估的准确性和可靠性。

链接: https://arxiv.org/abs/2505.18331
作者: Naghmeh Jamali,Milad Mohammadi,Danial Baledi,Zahra Rezvani,Hesham Faili
机构: 1School of Computer Science(Institute for Research in Fundamental Sciences)(Tehran)(Iran); 2School of Electrical and Computer Engineering(University of Tehran)(Tehran)(Iran); 3 Department of Computer Science(Faculty of Mathematical Sciences)(Alzahra University)(Tehran)(Iran)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical consumer question answering (CQA) is crucial for empowering patients by providing personalized and reliable health information. Despite recent advances in large language models (LLMs) for medical QA, consumer-oriented and multilingual resources, particularly in low-resource languages like Persian, remain sparse. To bridge this gap, we present PerMedCQA, the first Persian-language benchmark for evaluating LLMs on real-world, consumer-generated medical questions. Curated from a large medical QA forum, PerMedCQA contains 68,138 question-answer pairs, refined through careful data cleaning from an initial set of 87,780 raw entries. We evaluate several state-of-the-art multilingual and instruction-tuned LLMs, utilizing MedJudge, a novel rubric-based evaluation framework driven by an LLM grader, validated against expert human annotators. Our results highlight key challenges in multilingual medical QA and provide valuable insights for developing more accurate and context-aware medical assistance systems. The data is publicly available on this https URL
zh

[NLP-322] Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT -4

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化情境下可能存在的价值观偏差问题,特别是其对不同文化规范的适应性和公平性。现有研究主要通过直接调查模型或人类的价值观来验证其与西方或北美文化的契合度,但这种方法难以反映模型在真实场景中的表现。为此,论文提出了一种自下而上的解决方案,即让LLMs在不同文化背景的故事中推理文化规范,从而更真实地评估其文化敏感性。该方案的关键在于通过分析模型生成的文化规范,揭示其在文化特定性上的不足以及潜在的刻板印象,进而为开发更加公平、适用于多元用户群体的LLMs提供基础。

链接: https://arxiv.org/abs/2505.18322
作者: Zhuozhuo Joy Liu,Farhan Samir,Mehar Bhatia,Laura K. Nelson,Vered Shwartz
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (向量人工智能研究所); Department of Linguistics (语言学系); Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); School of Computer Science (计算机科学学院); Department of Sociology (社会学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.
zh

[NLP-323] hinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学任务中通过强化学习(Reinforcement Learning, RL)训练后产生的推理过程过于冗长的问题,这导致了推理成本和延迟的增加。解决方案的关键在于提出一种自适应奖励重塑方法,该方法根据模型性能动态调整准确率与响应长度之间的奖励权衡:当准确率较高时,增加长度惩罚以鼓励快速缩短响应;当准确率下降时,放松惩罚以保证正确性。这种方法在早期阶段加速长度减少,同时避免后期过度压缩,从而在保持较高准确率的同时显著减少推理长度。

链接: https://arxiv.org/abs/2505.18298
作者: Jinyan Su,Claire Cardie
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong reasoning abilities in mathematical tasks, often enhanced through reinforcement learning (RL). However, RL-trained models frequently produce unnecessarily long reasoning traces – even for simple queries – leading to increased inference costs and latency. While recent approaches attempt to control verbosity by adding length penalties to the reward function, these methods rely on fixed penalty terms that are hard to tune and cannot adapt as the model’s reasoning capability evolves, limiting their effectiveness. In this work, we propose an adaptive reward-shaping method that enables LLMs to “think fast and right” – producing concise outputs without sacrificing correctness. Our method dynamically adjusts the reward trade-off between accuracy and response length based on model performance: when accuracy is high, the length penalty increases to encourage faster length reduction; when accuracy drops, the penalty is relaxed to preserve correctness. This adaptive reward accelerates early-stage length reduction while avoiding over-compression in later stages. Experiments across multiple datasets show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy, offering a new direction for cost-efficient adaptive reasoning in large-scale language models.
zh

[NLP-324] AGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在零样本医学推理任务中存在浅层且不稳定的问题,以及微调后的医学LLMs在分布偏移下泛化能力差和对未见临床场景适应性不足的问题。其解决方案的关键在于提出TAGS框架,该框架结合了一个广泛适用的通用模型与一个领域特定的专业模型,通过无需任何模型微调或参数更新的方式提供互补视角。为支持这种通用-专业推理过程,引入了两个辅助模块:一种分层检索机制,通过语义和推理层面的相似性选择多尺度示例;以及一个可靠性评分器,用于评估推理一致性以指导最终答案的聚合。

链接: https://arxiv.org/abs/2505.18283
作者: Jianghao Wu,Feilong Tang,Yulong Li,Ming Hu,Haochen Xue,Shoaib Jameel,Yutong Xie,Imran Razzak
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Monash University (莫纳什大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 16 pages including references, 2 figures

点击查看摘要

Abstract:Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at this https URL.
zh

[NLP-325] Collaborative Memory: Multi-User Memory Sharing in LLM Agents with Dynamic Access Control

【速读】: 该论文旨在解决多用户、多智能体环境中知识共享的安全性与效率问题,特别是在动态、非对称权限条件下实现跨用户的可追溯和可审计的知识传递。其解决方案的关键在于引入协作记忆(Collaborative Memory)框架,该框架通过双层记忆结构(私有记忆与共享记忆)以及基于二部图编码的非对称访问控制机制,支持细粒度的读写策略,确保内存片段的权限检查、变换视图及更新符合当前用户-代理-资源约束,并实现对记忆操作的全程可审计性。

链接: https://arxiv.org/abs/2505.18279
作者: Alireza Rezazadeh,Zichao Li,Ange Lou,Yuying Zhao,Wei Wei,Yujia Bao
机构: Center for Advanced AI, Accenture (人工智能中心,埃森哲)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Complex tasks are increasingly delegated to ensembles of specialized LLM-based agents that reason, communicate, and coordinate actions-both among themselves and through interactions with external tools, APIs, and databases. While persistent memory has been shown to enhance single-agent performance, most approaches assume a monolithic, single-user context-overlooking the benefits and challenges of knowledge transfer across users under dynamic, asymmetric permissions. We introduce Collaborative Memory, a framework for multi-user, multi-agent environments with asymmetric, time-evolving access controls encoded as bipartite graphs linking users, agents, and resources. Our system maintains two memory tiers: (1) private memory-private fragments visible only to their originating user; and (2) shared memory-selectively shared fragments. Each fragment carries immutable provenance attributes (contributing agents, accessed resources, and timestamps) to support retrospective permission checks. Granular read policies enforce current user-agent-resource constraints and project existing memory fragments into filtered transformed views. Write policies determine fragment retention and sharing, applying context-aware transformations to update the memory. Both policies may be designed conditioned on system, agent, and user-level information. Our framework enables safe, efficient, and interpretable cross-user knowledge sharing, with provable adherence to asymmetric, time-varying policies and full auditability of memory operations.
zh

[NLP-326] MetaGen Blended RAG : Higher Accuracy for Domain-Specific QA Without Fine-Tuning NEURIPS2025

【速读】: 该论文旨在解决企业在特定领域数据集上部署检索增强生成(Retrieval-Augmented Generation, RAG)系统时面临的答案准确性不足的问题。由于企业内部知识库中的语料库通常包含复杂的领域术语,且在预训练阶段未被大型语言模型接触过,导致RAG系统在不同领域(如网络、军事、法律等)或同一领域(如医学)中表现出显著的语义差异,从而影响上下文精度。论文提出的解决方案关键在于通过混合查询索引和元数据增强来提升检索器对领域特定语料库的适应能力,其核心是构建一个基于关键概念、主题和缩写的元数据生成流程,并创建增强型混合索引以提升搜索查询效果,从而避免过拟合并实现跨领域的有效泛化。

链接: https://arxiv.org/abs/2505.18247
作者: Kunal Sawarkar,Shivam R. Solanki,Abhilasha Mangal
机构: IBM(国际商业机器公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Preprint. Paper Submitted NeurIPS 2025- The Thirty-Ninth Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Despite the widespread exploration of Retrieval-Augmented Generation (RAG), its deployment in enterprises for domain-specific datasets remains limited due to poor answer accuracy. These corpora, often shielded behind firewalls in private enterprise knowledge bases, having complex, domain-specific terminology, rarely seen by LLMs during pre-training; exhibit significant semantic variability across domains (like networking, military, or legal, etc.), or even within a single domain like medicine, and thus result in poor context precision for RAG systems. Currently, in such situations, fine-tuning or RAG with fine-tuning is attempted, but these approaches are slow, expensive, and lack generalization for accuracy as the new domain-specific data emerges. We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment. This ‘MetaGen Blended RAG’ method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries. This approach avoids overfitting and generalizes effectively across domains. On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5. The results are even comparable to the best fine-tuned models on this dataset, and we further demonstrate the robustness and scalability of the approach by evaluating it on other QA datasets like SQuAD, NQ etc.
zh

[NLP-327] Will Large Language Models Transform Clinical Prediction?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在临床预测场景中的应用问题,探讨其潜在优势与现存挑战。解决方案的关键在于扩展方法学,特别是验证、公平性与偏见评估、生存分析以及监管体系的建立,以实现LLMs在临床预测工作流程中的全面整合。

链接: https://arxiv.org/abs/2505.18246
作者: Yusuf Yildiz,Goran Nenadic,Meghna Jani,David A. Jenkins
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Submitted to: BMC Diagnostic and Prognostic Research

点击查看摘要

Abstract:Background: Large language models (LLMs) are attracting increasing interest in healthcare. Their ability to summarise large datasets effectively, answer questions accurately, and generate synthesised text is widely recognised. These capabilities are already finding applications in healthcare. Body: This commentary discusses LLMs usage in the clinical prediction context and highlight potential benefits and existing challenges. In these early stages, the focus should be on extending the methodology, specifically on validation, fairness and bias evaluation, survival analysis and development of regulations. Conclusion: We conclude that further work and domain-specific considerations need to be made for full integration into the clinical prediction workflows.
zh

[NLP-328] Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models

【速读】: 该论文试图解决大型Transformer语言模型在文本生成过程中缺乏可解释性的问题,特别是其规划、结构化和实现文本的机制不透明。解决方案的关键在于提出多尺度概率生成理论(Multi_Scale Probabilistic Generation Theory, MSPGT),该理论将生成过程分解为三个语义尺度——全局上下文、中间结构和局部词选择,并将其与Transformer架构中的特定层范围对齐。通过引入注意力跨度阈值和层间互信息峰值两种互补指标,实现了对不同尺度边界的稳定划分,并通过探测任务和因果干预验证了其有效性。

链接: https://arxiv.org/abs/2505.18244
作者: Yukin Zhang,Qi Dong
机构: The Chinese University of Hong Kong (香港中文大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Transformer based language models achieve remarkable performance but remain opaque in how they plan, structure, and realize text. We introduce Multi_Scale Probabilistic Generation Theory (MSPGT), a hierarchical framework that factorizes generation into three semantic scales_global context, intermediate structure, and local word choices and aligns each scale with specific layer ranges in Transformer architectures. To identify scale boundaries, we propose two complementary metrics: attention span thresholds and inter layer mutual information peaks. Across four representative models (GPT-2, BERT, RoBERTa, and T5), these metrics yield stable local/intermediate/global partitions, corroborated by probing tasks and causal interventions. We find that decoder_only models allocate more layers to intermediate and global processing while encoder_only models emphasize local feature extraction. Through targeted interventions, we demonstrate that local scale manipulations primarily influence lexical diversity, intermediate-scale modifications affect sentence structure and length, and global_scale perturbations impact discourse coherence all with statistically significant effects. MSPGT thus offers a unified, architecture-agnostic method for interpreting, diagnosing, and controlling large language models, bridging the gap between mechanistic interpretability and emergent capabilities.
zh

[NLP-329] aming LLM s with Negative Samples: A Reference-Free Framework to Evaluate Presentation Content with Actionable Feedback

【速读】: 该论文试图解决如何自动生成能够有效总结文档并传达概念的演示文稿(presentation slides)的问题,其核心在于评估演示文稿中多模态内容的质量。解决方案的关键是引入了一个基准数据集RefSlides,并提出了REFLEX评估方法,该方法通过生成具有不同度量特定扰动的负样本演示文稿来微调大语言模型(LLM),从而生成评分和可操作的反馈,且无需在推理过程中依赖真实标注的演示文稿。

链接: https://arxiv.org/abs/2505.18240
作者: Ananth Muppidi,Tarak Das,Sambaran Bandyopadhyay,Tripti Shukla,Dharun D A
机构: Adobe Research(Adobe 研究院); IIIT Hyderabad(印度海得拉巴国际信息技术学院); IIT Madras(印度理工学院马德拉斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The generation of presentation slides automatically is an important problem in the era of generative AI. This paper focuses on evaluating multimodal content in presentation slides that can effectively summarize a document and convey concepts to a broad audience. We introduce a benchmark dataset, RefSlides, consisting of human-made high-quality presentations that span various topics. Next, we propose a set of metrics to characterize different intrinsic properties of the content of a presentation and present REFLEX, an evaluation approach that generates scores and actionable feedback for these metrics. We achieve this by generating negative presentation samples with different degrees of metric-specific perturbations and use them to fine-tune LLMs. This reference-free evaluation technique does not require ground truth presentations during inference. Our extensive automated and human experiments demonstrate that our evaluation approach outperforms classical heuristic-based and state-of-the-art large language model-based evaluations in generating scores and explanations.
zh

[NLP-330] hink or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在多步骤推理中产生的推理链过长导致效率低下的问题。其关键解决方案是引入基于熵的自适应思考策略(Adaptive Think),通过动态停止推理过程以提高效率,当置信度足够高时终止推理,从而在保持竞争性准确率的同时显著减少token使用量。

链接: https://arxiv.org/abs/2505.18237
作者: Xixian Yong,Xiao Zhou,Yingying Zhang,Jinlin Li,Yefeng Zheng,Xian Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:The recent rise of Large Reasoning Models (LRMs) has significantly improved multi-step reasoning performance, but often at the cost of generating excessively long reasoning chains. This paper revisits the efficiency of such reasoning processes through an information-theoretic lens, revealing a fundamental trade-off between reasoning length and semantic efficiency. We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution, respectively. Empirical analyses show that longer reasoning chains tend to exhibit higher information bias and diminishing information gain, especially for incorrect answers. Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high, improving efficiency while maintaining competitive accuracy. Compared to the Vanilla Think approach (default mode), our strategy yields a 1.10% improvement in average accuracy and a 50.80% reduction in token usage on QwQ-32B across six benchmark tasks spanning diverse reasoning types and difficulty levels, demonstrating superior efficiency and reasoning performance. These results underscore the promise of entropy-based methods for enhancing both accuracy and cost-effiiciency in large language model deployment.
zh

[NLP-331] ELDeR: Getting Efficient LLM s through Data-Driven Regularized Layer-wise Pruning

【速读】: 该论文旨在解决大型语言模型(Large language models, LLMs)在实际部署中因高计算和内存成本而受到的限制。其关键解决方案是提出一种新的“正则化后剪枝”范式,通过先对模型进行正则化处理再进行剪枝,以减少信息损失并避免性能不可逆下降。该方法的核心在于利用数据驱动的方式迭代学习每一层的权重,并通过对低权重层的输出与输入差异施加正则化,促使信息向剩余层转移,从而在保持模型语言建模能力的同时显著降低恢复微调(Recovery Fine-Tuning, RFT)的计算成本。

链接: https://arxiv.org/abs/2505.18232
作者: Mingkuan Feng,Jinyang Wu,Siyuan Liu,Shuai Zhang,Hongjian Fang,Ruihan Jin,Feihu Che,Pengpeng Shao,Zhengqi Wen,Jianhua Tao
机构: Tsinghua University (清华大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The deployment of Large language models (LLMs) in many fields is largely hindered by their high computational and memory costs. Recent studies suggest that LLMs exhibit sparsity, which can be used for pruning. Previous pruning methods typically follow a prune-then-finetune paradigm. Since the pruned parts still contain valuable information, statically removing them without updating the remaining parameters often results in irreversible performance degradation, requiring costly recovery fine-tuning (RFT) to maintain performance. To address this, we propose a novel paradigm: first apply regularization, then prune. Based on this paradigm, we propose ELDeR: Getting Efficient LLMs through Data-Driven Regularized Layer-wise Pruning. We multiply the output of each transformer layer by an initial weight, then we iteratively learn the weights of each transformer layer by using a small amount of data in a simple way. After that, we apply regularization to the difference between the output and input of the layers with smaller weights, forcing the information to be transferred to the remaining layers. Compared with direct pruning, ELDeR reduces the information loss caused by direct parameter removal, thus better preserving the model’s language modeling ability. Experimental results show that ELDeR achieves superior performance compared with powerful layer-wise structured pruning methods, while greatly reducing RFT computational costs. Since ELDeR is a layer-wise pruning method, its end-to-end acceleration effect is obvious, making it a promising technique for efficient LLMs.
zh

[NLP-332] IDA-Bench: Evaluating LLM s on Interactive Guided Data Analysis

【速读】: 该论文试图解决现有基准测试未能反映数据分析师领域迭代特性的问题,即专家在深入理解数据集后其决策会不断演变,而当前评估方法未充分考虑这一多轮交互的复杂性。解决方案的关键在于引入IDA-Bench,这是一个新型基准测试,旨在评估大型语言模型(Large Language Models, LLMs)在多轮交互场景中的表现,任务通过LLM模拟用户以序列化的自然语言指令形式呈现,并依据代理最终的数值输出与人工基准的对比来评估性能。

链接: https://arxiv.org/abs/2505.18223
作者: Hanyu Li,Haoyu Liu,Tingyu Zhu,Tianyu Guo,Zeyu Zheng,Xiaotie Deng,Michael I. Jordan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts’ decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs’ multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.
zh

[NLP-333] Evidence-Grounded Multimodal Misinformation Detection with Attention-Based GNNs

【速读】: 该论文试图解决多模态非上下文(Multimodal out-of-context, OOC)虚假信息的检测问题,这类虚假信息通过重新利用真实图像并搭配不相关或具有误导性的文字说明来传播错误信息。传统方法如大语言模型(Large Language Models, LLMs)和图文语言模型(Large Vision-Language Models, LVLMs)在缺乏上下文或参数化知识的情况下容易产生幻觉,无法有效进行上下文建模。该研究的关键解决方案是提出一种基于图的方法,通过构建两个图表示——从在线文本证据中提取的证据图和从文字说明中生成的主张图,并利用图神经网络(Graph Neural Networks, GNNs)对这两个图进行编码与比较,从而评估图像与文字说明的一致性及真实性。

链接: https://arxiv.org/abs/2505.18221
作者: Sharad Duwal,Mir Nafis Sharear Shopnil,Abhishek Tyagi,Adiba Mahbub Proma
机构: University of Rochester(罗切斯特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal out-of-context (OOC) misinformation is misinformation that repurposes real images with unrelated or misleading captions. Detecting such misinformation is challenging because it requires resolving the context of the claim before checking for misinformation. Many current methods, including LLMs and LVLMs, do not perform this contextualization step. LLMs hallucinate in absence of context or parametric knowledge. In this work, we propose a graph-based method that evaluates the consistency between the image and the caption by constructing two graph representations: an evidence graph, derived from online textual evidence, and a claim graph, from the claim in the caption. Using graph neural networks (GNNs) to encode and compare these representations, our framework then evaluates the truthfulness of image-caption pairs. We create datasets for our graph-based method, evaluate and compare our baseline model against popular LLMs on the misinformation detection task. Our method scores 93.05% detection accuracy on the evaluation set and outperforms the second-best performing method (an LLM) by 2.82% , making a case for smaller and task-specific methods.
zh

[NLP-334] CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多智能体语言游戏中难以理解和运用隐喻的问题,这限制了其在隐晦沟通和语义规避方面的表现,而这些能力对于战略通信至关重要。解决方案的关键在于提出CoMet框架,该框架结合了基于假设的隐喻推理器与通过自我反思和知识整合不断优化的隐喻生成器,从而提升智能体对隐喻的解释和应用能力。

链接: https://arxiv.org/abs/2505.18218
作者: Shuhang Xu,Fangwei Zhong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To Appear at ACL 2025 (Main)

点击查看摘要

Abstract:Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents’ ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents’ ability to communicate strategically using metaphors.
zh

[NLP-335] Do BERT-Like Bidirectional Models Still Perform Better on Text Classification in the Era of LLM s?

【速读】: 该论文试图解决当前研究中过度依赖大语言模型(Large Language Models, LLMs)而忽视传统BERT-like模型在文本分类任务中的潜在优势的问题。其解决方案的关键在于通过系统比较三种分类方法——即BERT-like模型微调、LLM内部状态利用以及零样本推理——并在六个高难度数据集上进行实验,揭示不同模型在不同任务类型中的表现差异,进而提出一种细粒度的任务选择策略TaMAS,以实现更合理的模型选择与应用。

链接: https://arxiv.org/abs/2505.18215
作者: Junyan Zhang,Yiming Huang,Shuliang Liu,Yubo Gao,Xuming Hu
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid adoption of LLMs has overshadowed the potential advantages of traditional BERT-like models in text classification. This study challenges the prevailing “LLM-centric” trend by systematically comparing three category methods, i.e., BERT-like models fine-tuning, LLM internal state utilization, and zero-shot inference across six high-difficulty datasets. Our findings reveal that BERT-like models often outperform LLMs. We further categorize datasets into three types, perform PCA and probing experiments, and identify task-specific model strengths: BERT-like models excel in pattern-driven tasks, while LLMs dominate those requiring deep semantics or world knowledge. Based on this, we propose TaMAS, a fine-grained task selection strategy, advocating for a nuanced, task-driven approach over a one-size-fits-all reliance on LLMs.
zh

[NLP-336] owards medical AI misalignment: a preliminary study

【速读】: 该论文试图解决恶意用户通过角色扮演(role-playing)方式对生成式 AI (Generative AI) 进行越狱(jailbreak)的问题,这种攻击方式能够绕过当前大多数模型的安全机制,导致输出不安全甚至可能有害的临床建议。解决方案的关键在于揭示即使不具备模型内部架构和参数的技术知识,恶意用户也可以构造特定的角色扮演提示,从而诱导模型产生错误的输出,该研究旨在展示这一特定漏洞场景,并为未来提升模型安全性提供参考。

链接: https://arxiv.org/abs/2505.18212
作者: Barbara Puccio,Federico Castagna,Allan Tucker,Pierangelo Veltri
机构: Magna Graecia University of Catanzaro (马格纳格雷西亚卡坦扎罗大学); Brunel University of London (布鲁内尔大学伦敦分校); University of Calabria (卡拉布里亚大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their staggering capabilities as assistant tools, often exceeding human performances, Large Language Models (LLMs) are still prone to jailbreak attempts from malevolent users. Although red teaming practices have already identified and helped to address several such jailbreak techniques, one particular sturdy approach involving role-playing (which we named `Goofy Game’) seems effective against most of the current LLMs safeguards. This can result in the provision of unsafe content, which, although not harmful per se, might lead to dangerous consequences if delivered in a setting such as the medical domain. In this preliminary and exploratory study, we provide an initial analysis of how, even without technical knowledge of the internal architecture and parameters of generative AI models, a malicious user could construct a role-playing prompt capable of coercing an LLM into producing incorrect (and potentially harmful) clinical suggestions. We aim to illustrate a specific vulnerability scenario, providing insights that can support future advancements in the field.
zh

[NLP-337] Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language NAACL2025

【速读】: 该论文试图解决濒危语言在自然语言处理(Natural Language Processing, NLP)中的数字排斥问题,这一问题限制了语言学研究和语言复兴努力。其解决方案的关键在于通过低成本、社区参与的NLP干预措施,支持语言保护。研究提出了一个手工整理的412个短语的数据集、一种合成数据生成流程,并对GPT-4o和GPT-4o-mini进行了语言识别的实证评估,结果表明在少样本提示下,模型性能显著提升,仅需五个示例即可实现接近完美的准确率。这凸显了针对低资源情境的定向NLP方法的潜力,并强调了可见性是实现包容性的第一步。

链接: https://arxiv.org/abs/2505.18159
作者: Jesus Alvarez C,Daua D. Karajeanes,Ashley Celeste Prado,John Ruttan,Ivory Yang,Sean O’Brien,Vasu Sharma,Kevin Zhu
机构: Algoverse AI Research (Algoverse AI 研究)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 13 figures; published in Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2025) at NAACL 2025, Albuquerque, NM

点击查看摘要

Abstract:The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.
zh

[NLP-338] WalkRetrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks SIGIR2025

【速读】: 该论文旨在解决基于知识图谱(Knowledge Graph, KG)的检索增强生成(Retrieval-Augmented Generation, RAG)方法中存在的三个核心问题:(i) KG与文本表示之间的对齐问题,(ii) 检索准确率与效率的平衡问题,以及(iii) 对动态更新KG的适应性问题。其解决方案的关键在于提出WalkRetrieve框架,该框架通过基于路径的图遍历和知识语义化生成语料,实现了无需领域数据微调的零样本RAG,从而具备对KG更新的无缝适应性、较低的计算开销以及与任意预训练大语言模型(Large Language Model, LLM)的兼容性。

链接: https://arxiv.org/abs/2505.16849
作者: Martin Böckling,Heiko Paulheim,Andreea Iana
机构: University of Mannheim(曼海姆大学); Mannheim(曼海姆); Germany(德国)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at the Information Retrieval’s Role in RAG Systems (IR-RAG 2025) in conjunction with SIGIR 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased impressive reasoning abilities, but often suffer from hallucinations or outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) remedies these shortcomings by grounding LLM responses in structured external information from a knowledge base. However, many KG-based RAG approaches struggle with (i) aligning KG and textual representations, (ii) balancing retrieval accuracy and efficiency, and (iii) adapting to dynamically updated KGs. In this work, we introduce WalkRetrieve, a simple yet effective KG-based framework that leverages walk-based graph traversal and knowledge verbalization for corpus generation for zero-shot RAG. Built around efficient KG walks, our method does not require fine-tuning on domain-specific data, enabling seamless adaptation to KG updates, reducing computational overhead, and allowing integration with any off-the-shelf backbone LLM. Despite its simplicity, WalkRetrieve performs competitively, often outperforming existing RAG systems in response accuracy and hallucination reduction. Moreover, it demonstrates lower query latency and robust scalability to large KGs, highlighting the potential of lightweight retrieval strategies as strong baselines for future RAG research.
zh

[NLP-339] Suicide Risk Assessment Using Multimodal Speech Features: A Study on the SW1 Challenge Dataset INTERSPEECH2025

【速读】: 该论文试图解决青少年基于语音的自杀风险评估问题(suicide risk assessment),其核心挑战在于如何有效融合多模态信息以提高分类的可靠性。解决方案的关键在于采用加权注意力机制结合混合正则化策略,通过整合自动转录、语言嵌入、音频嵌入以及手工设计的声学特征,实现对语音数据的多模态分析,从而提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.13069
作者: Ambre Marie,Ilias Maoudj,Guillaume Dardenne,Gwenolé Quellec
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to the SpeechWellness Challenge at Interspeech 2025; 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:The 1st SpeechWellness Challenge conveys the need for speech-based suicide risk assessment in adolescents. This study investigates a multimodal approach for this challenge, integrating automatic transcription with WhisperX, linguistic embeddings from Chinese RoBERTa, and audio embeddings from WavLM. Additionally, handcrafted acoustic features – including MFCCs, spectral contrast, and pitch-related statistics – were incorporated. We explored three fusion strategies: early concatenation, modality-specific processing, and weighted attention with mixup regularization. Results show that weighted attention provided the best generalization, achieving 69% accuracy on the development set, though a performance gap between development and test sets highlights generalization challenges. Our findings, strictly tied to the MINI-KID framework, emphasize the importance of refining embedding representations and fusion mechanisms to enhance classification reliability.
zh

[NLP-340] Accelerating the Low-Rank Decomposed Models

【速读】: 该论文试图解决生成式 AI (Generative AI) 模型压缩过程中因低秩分解(Low Rank Decomposition)技术导致的模型深度增加和推理或训练延迟的问题。解决方案的关键在于对低秩分解技术进行改进,以在保持高精度的同时实现低内存消耗,并加速训练和推理过程。

链接: https://arxiv.org/abs/2407.20266
作者: Habib Hajimolahoseini,Walid Ahmed,Austin Wen,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tensor decomposition is a mathematically supported technique for data compression. It consists of applying some kind of a Low Rank Decomposition technique on the tensors or matrices in order to reduce the redundancy of the data. However, it is not a popular technique for compressing the AI models duo to the high number of new layers added to the architecture after decomposition. Although the number of parameters could shrink significantly, it could result in the model be more than twice deeper which could add some latency to the training or inference. In this paper, we present a comprehensive study about how to modify low rank decomposition technique in AI models so that we could benefit from both high accuracy and low memory consumption as well as speeding up the training and inference
zh

[NLP-341] News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation ECIR2025

【速读】: 该论文旨在解决多语言新闻推荐系统在零样本跨语言迁移(ZS-XLT)中的性能下降问题,以及在少样本推荐和冷启动场景下微调预训练语言模型(LM)的计算成本过高问题。其解决方案的关键在于提出一种新闻适配的句子编码器(NaSE),该编码器基于预训练的大规模多语言句子编码器(SE)进行领域专业化,并利用构建的多语言新闻语料库PolyNews和PolyNewsParallel进行优化。通过冻结NaSE的嵌入并结合后期点击行为融合,该方法在冷启动和少样本新闻推荐中实现了最先进的ZS-XLT性能。

链接: https://arxiv.org/abs/2406.12634
作者: Andreea Iana,Fabian David Schmidt,Goran Glavaš,Heiko Paulheim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at the 47th European Conference on Information Retrieval (ECIR 2025) Appendix A is provided only in the arXiv version

点击查看摘要

Abstract:Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data is scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we construct and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click-behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation.
zh

[NLP-342] GQKVA: Efficient Pre-training of Transformers by Grouping Queries Keys and Values

【速读】: 该论文旨在解决基于Transformer的大规模模型在预训练过程中存在的计算效率低下和参数过多的问题。其解决方案的关键在于提出一种名为GQKVA的通用方法,该方法通过泛化查询、键和值分组技术,在加速Transformer预训练的同时减少模型规模。实验表明,GQKVA在性能与模型大小之间提供了可定制的权衡,并证明传统多头注意力机制并非最优选择,存在更轻量且高效的替代方案。

链接: https://arxiv.org/abs/2311.03426
作者: Farnoosh Javadi,Walid Ahmed,Habib Hajimolahoseini,Foozhan Ataiefard,Mohammad Hassanpour,Saina Asani,Austin Wen,Omar Mohamed Awad,Kangling Liu,Yang Liu
机构: Huawei Technologies(华为技术); Toronto Research Center(多伦多研究中心); Ascend Team(昇腾团队)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.
zh

[NLP-343] Improving Resnet-9 Generalization Trained on Small Datasets

【速读】: 该论文旨在解决在有限时间内(少于10分钟)利用小规模数据集(5000张图像,来自CIFAR-10数据集)实现图像分类任务中尽可能高的准确率问题。其解决方案的关键在于对ResNet-9模型进行多项技术优化,包括锐度感知优化(sharpness aware optimization)、标签平滑(label smoothing)、梯度中心化(gradient centralization)、输入块白化(input patch whitening)以及基于元学习的训练(metalearning based training),从而提升模型的泛化能力。

链接: https://arxiv.org/abs/2309.03965
作者: Omar Mohamed Awad,Habib Hajimolahoseini,Michael Lim,Gurpreet Gosal,Walid Ahmed,Yang Liu,Gordon Deng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents our proposed approach that won the first prize at the ICLR competition on Hardware Aware Efficient Training. The challenge is to achieve the highest possible accuracy in an image classification task in less than 10 minutes. The training is done on a small dataset of 5000 images picked randomly from CIFAR-10 dataset. The evaluation is performed by the competition organizers on a secret dataset with 1000 images of the same size. Our approach includes applying a series of technique for improving the generalization of ResNet-9 including: sharpness aware optimization, label smoothing, gradient centralization, input patch whitening as well as metalearning based training. Our experiments show that the ResNet-9 can achieve the accuracy of 88% while trained only on a 10% subset of CIFAR-10 dataset in less than 10 minuets
zh

[NLP-344] raining Acceleration of Low-Rank Decomposed Networks using Sequential Freezing and Rank Quantization

【速读】: 该论文试图解决低秩分解(Low Rank Decomposition, LRD)在压缩深度学习模型时面临的矛盾问题,即在不显著降低模型精度的前提下,如何有效提升训练和推理速度。传统方法由于在架构中引入大量新层,若分解秩(decomposition ranks)不够小,则难以实现显著的加速效果;而使用过小的秩又会导致模型精度大幅下降。论文提出的解决方案关键在于两种技术:秩优化(rank optimization)和分解层的顺序冻结(sequential freezing),这些方法能够在不依赖小秩分解的情况下,显著提升模型的吞吐量,同时保持接近原始模型的精度。

链接: https://arxiv.org/abs/2309.03824
作者: Habib Hajimolahoseini,Walid Ahmed,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low Rank Decomposition (LRD) is a model compression technique applied to the weight tensors of deep learning models in order to reduce the number of trainable parameters and computational complexity. However, due to high number of new layers added to the architecture after applying LRD, it may not lead to a high training/inference acceleration if the decomposition ranks are not small enough. The issue is that using small ranks increases the risk of significant accuracy drop after decomposition. In this paper, we propose two techniques for accelerating low rank decomposed models without requiring to use small ranks for decomposition. These methods include rank optimization and sequential freezing of decomposed layers. We perform experiments on both convolutional and transformer-based models. Experiments show that these techniques can improve the model throughput up to 60% during training and 37% during inference when combined together while preserving the accuracy close to that of the original models
zh

[NLP-345] From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

【速读】: 该论文试图解决音频感知大型语言模型(Audio-aware Large Language Models, ALLMs)在训练过程中存在的两个主要问题:一是灾难性遗忘,导致模型在学习音频数据后丢失重要的文本能力;二是跨模态对齐依赖于大量任务特定的问答对,导致资源消耗大。解决方案的关键在于利用ALLMs的基础语言模型生成通用的图文描述式对齐数据,这一过程称为通过基础语言模型合成数据进行音频-语言对齐的自举(Bootstrapping Audio-Language Alignment via Synthetic Data Generation from Backbone LLMs, BALSa)。在此基础上,提出LISTEN(Learning to Identify Sounds Through Extended Negative Samples)方法,通过对比学习增强模型区分存在与不存在声音的能力,并进一步扩展BALSa以处理多音频场景,从而提升音频-语言对齐效果。

链接: https://arxiv.org/abs/2505.20166
作者: Chun-Yi Kuan,Hung-yi Lee
机构: National Taiwan University (台湾大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Project Website: this https URL

点击查看摘要

Abstract:Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where important textual capabilities such as instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about their reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making the process resource-intensive. To address these issues, we leverage the backbone LLMs from ALLMs to synthesize general-purpose caption-style alignment data. We refer to this process as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Building on BALSa, we introduce LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method designed to improve ALLMs’ ability to distinguish between present and absent sounds. We further extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption that describes them all, thereby enhancing audio-language alignment. Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills. Moreover, incorporating multi-audio training further enhances the model’s comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to the development of ALLMs.
zh

[NLP-346] MVP: Multi-source Voice Pathology detection INTERSPEECH2025

【速读】: 该论文旨在解决语音障碍的非侵入式自动化诊断问题,该问题因病理语音数据稀缺及录音来源的多样性而未得到充分研究。其解决方案的关键在于提出了一种名为MVP(Multi-source Voice Pathology detection)的新方法,该方法利用直接作用于原始语音信号的Transformer模型,并通过中间特征融合策略有效结合句子朗读和持续元音录音,从而更好地捕捉两种录音类型的互补特性。

链接: https://arxiv.org/abs/2505.20050
作者: Alkis Koudounas,Moreno La Quatra,Gabriele Ciravegna,Marco Fantini,Erika Crosetti,Giovanni Succo,Tania Cerquitelli,Sabato Marco Siniscalchi,Elena Baralis
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.
zh

[NLP-347] Multi-modal brain encoding models for multi-modal stimuli ICLR-2025

【速读】: 该论文试图解决多模态Transformer模型在处理多模态刺激时对大脑活动预测的准确性问题,特别是探讨不同类型的多模态模型(跨模态和联合预训练)在参与观看电影时与fMRI脑活动的相关性。其解决方案的关键在于通过对比多种单模态和多模态模型,分析多模态表示中各模态对脑区活动对齐的贡献,并发现除了单模态嵌入外,视觉和语言区域还处理了额外的多模态信息,从而揭示了不同模型结构在脑活动预测中的差异性。

链接: https://arxiv.org/abs/2505.20027
作者: Subba Reddy Oota,Khushbu Pahwa,Mounika Marreddy,Maneesh Singh,Manish Gupta,Bapi S. Raju
机构: Technische Universität Berlin(柏林工业大学); Rice Univ(莱斯大学); Univ of Bonn(波恩大学); Spector Inc(斯派克特公司); Microsoft(微软); IIIT Hyderabad(印度信息技术研究所海得拉巴分校)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注: 26 pages, 15 figures, The Thirteenth International Conference on Learning Representations, ICLR-2025, Singapore. this https URL

点击查看摘要

Abstract:Despite participants engaging in unimodal stimuli, such as watching images or silent videos, recent work has demonstrated that multi-modal Transformer models can predict visual brain activity impressively well, even with incongruent modality representations. This raises the question of how accurately these multi-modal models can predict brain activity when participants are engaged in multi-modal stimuli. As these models grow increasingly popular, their use in studying neural activity provides insights into how our brains respond to such multi-modal naturalistic stimuli, i.e., where it separates and integrates information across modalities through a hierarchy of early sensory regions to higher cognition. We investigate this question by using multiple unimodal and two types of multi-modal models-cross-modal and jointly pretrained-to determine which type of model is more relevant to fMRI brain activity when participants are engaged in watching movies. We observe that both types of multi-modal models show improved alignment in several language and visual regions. This study also helps in identifying which brain regions process unimodal versus multi-modal information. We further investigate the contribution of each modality to multi-modal alignment by carefully removing unimodal features one by one from multi-modal representations, and find that there is additional information beyond the unimodal embeddings that is processed in the visual and language regions. Based on this investigation, we find that while for cross-modal models, their brain alignment is partially attributed to the video modality; for jointly pretrained models, it is partially attributed to both the video and audio modalities. This serves as a strong motivation for the neuroscience community to investigate the interpretability of these models for deepening our understanding of multi-modal information processing in brain.
zh

[NLP-348] Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models INTERSPEECH2025

【速读】: 该论文试图解决语音感知语言模型(Speech-aware Language Models, SLMs)在遵循指令能力上的不足以及灾难性遗忘问题。现有基准测试将语音感知与指令遵循能力混杂,无法准确评估这两项独立技能。论文提出的解决方案是引入Speech-IFeval评估框架,该框架专门用于诊断SLMs的指令遵循能力,并揭示其在面对简单指令时表现不佳、对提示变化敏感等问题,从而为未来研究提供方向。

链接: https://arxiv.org/abs/2505.19037
作者: Ke-Han Lu,Chun-Yi Kuan,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accecpted by Interspeech 2025; this https URL

点击查看摘要

Abstract:We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.
zh

[NLP-349] Evaluating the Usefulness of Non-Diagnostic Speech Data for Developing Parkinsons Disease Classifiers INTERSPEECH2025

【速读】: 该论文试图解决如何利用非诊断目的的语音数据进行帕金森病(Parkinson’s Disease, PD)检测的问题,以探索更广泛可用的数据源。其解决方案的关键在于验证Turn-Taking (TT) 数据集在PD分类任务中的有效性,并通过分析数据集特征(如音频拼接、性别和状态分布平衡)来提升分类性能。研究还揭示了不同数据集间模型泛化能力的差异,表明TT数据集在PD检测中具有与专业诊断数据集相当的潜力。

链接: https://arxiv.org/abs/2505.18722
作者: Terry Yi Zhong,Esther Janse,Cristian Tejedor-Garcia,Louis ten Bosch,Martha Larson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted for Interspeech 2025 (Camera-Ready)

点击查看摘要

Abstract:Speech-based Parkinson’s disease (PD) detection has gained attention for its automated, cost-effective, and non-intrusive nature. As research studies usually rely on data from diagnostic-oriented speech tasks, this work explores the feasibility of diagnosing PD on the basis of speech data not originally intended for diagnostic purposes, using the Turn-Taking (TT) dataset. Our findings indicate that TT can be as useful as diagnostic-oriented PD datasets like PC-GITA. We also investigate which specific dataset characteristics impact PD classification performance. The results show that concatenating audio recordings and balancing participants’ gender and status distributions can be beneficial. Cross-dataset evaluation reveals that models trained on PC-GITA generalize poorly to TT, whereas models trained on TT perform better on PC-GITA. Furthermore, we provide insights into the high variability across folds, which is mainly due to large differences in individual speaker performance.
zh

[NLP-350] Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving INTERSPEECH2025

【速读】: 该论文试图解决语音语言模型(Speech Language Models, SLLMs)在跨任务泛化能力不足的问题,主要原因是缺乏广泛任务的标注语音数据,导致对齐效率低下。其解决方案的关键在于提出一种名为MTBI(Multi-Task Behavior Imitation)的多任务“行为模仿”方法,该方法仅依赖于配对的语音和文本数据,通过确保语言模型解码器对语音和文本生成等效响应,从而实现更泛化的SLLMs,并通过语音-文本交错进一步提升对齐效率。

链接: https://arxiv.org/abs/2505.18644
作者: Jingran Xie,Xiang Li,Hui Wang,Yue Yu,Yang Xiang,Xixin Wu,Zhiyong Wu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task ‘behavior imitation’ method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.
zh

计算机视觉

[CV-0] GLEAM: Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scenes

【速读】:该论文旨在解决移动机器人在复杂未知环境中实现可泛化的主动建图(active mapping)这一关键问题。现有方法由于训练数据不足和保守的探索策略,在不同布局和复杂连通性的场景中表现出有限的泛化能力。解决方案的关键在于提出GLEAM-Bench,首个针对可泛化主动建图的大规模基准,包含1,152个来自合成和真实扫描数据集的多样化3D场景,并基于此提出GLEAM,一种统一的可泛化探索策略,其优势主要来源于语义表示、长期可导航目标以及随机化策略,从而在未见过的复杂场景中实现了更高的覆盖率和映射精度。

链接: https://arxiv.org/abs/2505.20294
作者: Xiao Chen,Tai Wang,Quanyi Li,Tao Huang,Jiangmiao Pang,Tianfan Xue
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by insufficient training data and conservative exploration strategies, exhibit limited generalizability across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we introduce GLEAM-Bench, the first large-scale benchmark designed for generalizable active mapping with 1,152 diverse 3D scenes from synthetic and real-scan datasets. Building upon this foundation, we propose GLEAM, a unified generalizable exploration policy for active mapping. Its superior generalizability comes mainly from our semantic representations, long-term navigable goals, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 66.50% coverage (+9.49%) with efficient trajectories and improved mapping accuracy on 128 unseen complex scenes. Project page: this https URL.
zh

[CV-1] OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

【速读】:该论文旨在解决视频生成中如何准确且一致地融入参考内容的问题,以提升视频制作的灵活性。其关键解决方案是构建OpenS2V-Nexus框架,包括细粒度基准OpenS2V-Eval和大规模数据集OpenS2V-5M,通过引入多维度评估指标及高质量数据增强模型对主体一致性、自然性和文本相关性的理解与生成能力。

链接: https://arxiv.org/abs/2505.20292
作者: Shenghai Yuan,Xianyi He,Yufan Deng,Yang Ye,Jinfa Huang,Bin Lin,Chongyang Ma,Jiebo Luo,Li Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); University of Rochester (罗切斯特大学); Rabbitpre AI (Rabbitpre AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code and Dataset: this https URL

点击查看摘要

Abstract:Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model’s ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.
zh

[CV-2] VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

【速读】:该论文试图解决视觉代理在工具增强推理中缺乏动态探索、选择和组合多样化工具能力的问题,现有方法要么依赖无训练提示,要么需要大规模微调,均存在工具多样性受限及人工监督需求高的缺陷。解决方案的关键在于提出VisTA框架,该框架基于端到端强化学习,通过任务结果作为反馈信号,迭代优化特定查询的工具选择策略,并利用组相对策略优化(GRPO)使代理自主发现有效的工具选择路径,从而提升泛化能力和多样工具的自适应利用。

链接: https://arxiv.org/abs/2505.20289
作者: Zeyi Huang,Yuyang Ji,Anirudh Sundara Rajan,Zefan Cai,Wen Xiao,Junjie Hu,Yong Jae Lee
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA’s ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
zh

[CV-3] Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots ICML2025

【速读】:该论文旨在解决传统自回归模型在视觉生成任务中难以有效利用全局上下文信息的问题,尤其是在早期令牌预测时。其解决方案的关键在于提出一种分阶段的层次化自回归模型(Hierarchical Masked Autoregressive models, Hi-MAR),通过低分辨率图像令牌作为中间引导,逐步构建高分辨率图像令牌,从而增强对全局结构的感知能力,并在多尺度图像令牌之间建立深入的层次依赖关系。

链接: https://arxiv.org/abs/2505.20288
作者: Guangting Zheng,Yehao Li,Yingwei Pan,Jiajun Deng,Ting Yao,Yanyong Zhang,Tao Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICML 2025. Source code is available at this https URL

点击查看摘要

Abstract:Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at this https URL.
zh

[CV-4] MotionPro: A Precise Motion Controller for Image-to-Video Generation CVPR2025

【速读】:该论文旨在解决图像到视频(I2V)生成中运动控制精度不足的问题,特别是现有方法依赖大高斯核扩展运动轨迹作为条件,导致运动控制粗糙且无法区分物体运动与相机运动。其解决方案的关键在于提出MotionPro,该方法通过区域级轨迹和运动掩码实现细粒度运动合成与目标运动类别识别,其中区域级轨迹直接利用局部区域内的轨迹以提高控制精度,而运动掩码则从预测的光流图中提取以捕捉运动区域的整体动态,从而提升视频去噪和自然运动控制的效果。

链接: https://arxiv.org/abs/2505.20287
作者: Zhongwei Zhang,Fuchen Long,Zhaofan Qiu,Yingwei Pan,Wu Liu,Ting Yao,Tao Mei
机构: HiDream.ai Inc. (HiDream.ai 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: this https URL.
zh

[CV-5] Category-Agnostic Neural Object Rigging CVPR2025

【速读】:该论文试图解决可变形4D对象运动在低维流形中的建模问题,传统方法依赖于基于启发式的表示(如rigging)来实现对动态对象的直观操控,但这些方法因需要特定类别的专家知识而难以扩展。论文提出的解决方案的关键在于设计一种新的表示方式,将可变形4D对象编码为稀疏的空间定位斑点(blobs)和实例感知特征体素,从而解耦3D形状的姿态与实例信息,使得通过修改斑点参数可以直观地操控3D对象姿态,同时保留丰富的实例特异性信息。

链接: https://arxiv.org/abs/2505.20283
作者: Guangzhao He,Chen Geng,Shangzhe Wu,Jiajun Wu
机构: Stanford University (斯坦福大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:The motion of deformable 4D objects lies in a low-dimensional manifold. To better capture the low dimensionality and enable better controllability, traditional methods have devised several heuristic-based methods, i.e., rigging, for manipulating dynamic objects in an intuitive fashion. However, such representations are not scalable due to the need for expert knowledge of specific categories. Instead, we study the automatic exploration of such low-dimensional structures in a purely data-driven manner. Specifically, we design a novel representation that encodes deformable 4D objects into a sparse set of spatially grounded blobs and an instance-aware feature volume to disentangle the pose and instance information of the 3D shape. With such a representation, we can manipulate the pose of 3D objects intuitively by modifying the parameters of the blobs, while preserving rich instance-specific information. We evaluate the proposed method on a variety of object categories and demonstrate the effectiveness of the proposed framework. Project page: this https URL
zh

[CV-6] ImgEdit: A Unified Image Editing Dataset and Benchmark

【速读】:该论文旨在解决开源图像编辑模型在性能上落后于专有模型的问题,主要原因是高质量数据不足和缺乏充分的基准测试。其解决方案的关键在于构建ImgEdit,一个大规模、高质量的图像编辑数据集,包含120万对精心筛选的编辑样本,涵盖新颖且复杂的单轮编辑及具有挑战性的多轮任务,并通过多阶段流程确保数据质量,包括先进的视觉语言模型、检测模型、分割模型以及特定任务的修复程序和严格的后处理步骤。此外,还提出了ImgEdit-Bench基准,用于全面评估图像编辑模型在指令遵循性、编辑质量和细节保留方面的表现。

链接: https://arxiv.org/abs/2505.20275
作者: Yang Ye,Xianyi He,Zongjian Li,Bin Lin,Shenghai Yuan,Zhiyuan Yan,Bohan Hou,Li Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Peng Cheng Laboratory (鹏城实验室); Rabbitpre AI (兔前人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in generative models have enabled high-fidelity text-to-image generation. However, open-source image-editing models still lag behind their proprietary counterparts, primarily due to limited high-quality data and insufficient benchmarks. To overcome these limitations, we introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks. To ensure the data quality, we employ a multi-stage pipeline that integrates a cutting-edge vision-language model, a detection model, a segmentation model, alongside task-specific in-painting procedures and strict post-processing. ImgEdit surpasses existing datasets in both task novelty and data quality. Using ImgEdit, we train ImgEdit-E1, an editing model using Vision Language Model to process the reference image and editing prompt, which outperforms existing open-source models on multiple tasks, highlighting the value of ImgEdit and model design. For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance in terms of instruction adherence, editing quality, and detail preservation. It includes a basic testsuite, a challenging single-turn suite, and a dedicated multi-turn suite. We evaluate both open-source and proprietary models, as well as ImgEdit-E1, providing deep analysis and actionable insights into the current behavior of image-editing models. The source data are publicly available on this https URL.
zh

[CV-7] Probabilistic Kernel Function for Fast Angle Testing

【速读】:该论文旨在解决高维欧几里得空间中的角度测试问题,其核心挑战在于如何高效地进行角度比较和角度阈值判断。解决方案的关键在于提出两种基于投影的概率核函数,分别用于角度比较和角度阈值设定,与传统方法依赖高斯分布随机投影向量不同,该方法利用参考角度并采用确定性结构的投影向量,从而避免了对渐近假设(如投影向量数量趋于无穷)的依赖,并在理论和实验上均表现出优于基于高斯分布的核函数的性能。

链接: https://arxiv.org/abs/2505.20274
作者: Kejing Lu,Chuan Xiao,Yoshiharu Ishikawa
机构: Nagoya University (名古屋大学); Osaka University (大阪大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:In this paper, we study the angle testing problem in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and employs a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be both theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We further apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5X ~ 3X higher query-per-second (QPS) throughput compared to the state-of-the-art graph-based search algorithm HNSW.
zh

[CV-8] Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在推理过程中产生的输出不可靠和可解释性有限的问题。现有方法通常依赖于成本高昂的监督信号,如边界框标注、思维链推理或外部工具调用,从而限制了其可扩展性。论文提出的解决方案——Ground-R1,是一种基于强化学习的框架,其关键在于无需显式证据或推理注释即可实现基于视觉证据的推理,通过接地阶段生成符合格式约束的证据区域滚动样本,并在回答阶段结合答案正确性和格式遵循奖励来生成响应。

链接: https://arxiv.org/abs/2505.20272
作者: Meng Cao,Haoze Zhao,Can Zhang,Xiaojun Chang,Ian Reid,Xiaodan Liang
机构: MBZUAI; Peking University; Sun Yat-sen University; University of Science and Technology of China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.
zh

[CV-9] In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

【速读】:该论文旨在解决多模态引导的视觉生成中定制化主体插入的高保真度与文本提示意图对齐问题。现有方法在将用户指定对象无缝融入图像时,往往难以保持高质量的细节并准确遵循文本指令。其解决方案的关键在于提出“In-Context Brush”,一个基于上下文学习范式的零样本框架,通过将目标图像的掩码区域作为查询,对象图像和文本提示作为跨模态示例,实现无需模型微调的主体插入。该方法通过双层级潜在空间操作——包括注意力头内的“潜在特征偏移”和跨注意力头的“注意力重加权”,以动态调整注意力输出并增强文本控制能力,从而提升图像质量与文本一致性。

链接: https://arxiv.org/abs/2505.20271
作者: Yu Xu,Fan Tang,You Wu,Lin Gao,Oliver Deussen,Hongbin Yan,Jintao Li,Juan Cao,Tong-Yee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly “brushes” user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user’s intent through textual prompts. In this work, we propose “In-Context Brush”, a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head “latent feature shifting” within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head “attention reweighting” across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.
zh

[CV-10] ParticleGS: Particle-Based Dynamics Modeling of 3D Gaussians for Prior-free Motion Extrapolation

【速读】:该论文试图解决现有动态3D重建方法在学习底层动力学或依赖手动定义的物理先验时存在的局限性,从而限制了其时间外推能力的问题。解决方案的关键在于提出一种基于粒子动力学系统的无先验运动外推框架,通过学习描述3D高斯分布动力学的微分方程,并在未来帧外推过程中遵循这些方程,从而更有效地建模高斯粒子动力学系统。

链接: https://arxiv.org/abs/2505.20270
作者: Jinsheng Quan,Chunshi Wang,Yawei Luo
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper aims to model the dynamics of 3D Gaussians from visual observations to support temporal extrapolation. Existing dynamic 3D reconstruction methods often struggle to effectively learn underlying dynamics or rely heavily on manually defined physical priors, which limits their extrapolation capabilities. To address this issue, we propose a novel dynamic 3D Gaussian Splatting prior-free motion extrapolation framework based on particle dynamics systems. The core advantage of our method lies in its ability to learn differential equations that describe the dynamics of 3D Gaussians, and follow them during future frame extrapolation. Instead of simply fitting to the observed visual frame sequence, we aim to more effectively model the gaussian particle dynamics system. To this end, we introduce a dynamics latent state vector into the standard Gaussian kernel and design a dynamics latent space encoder to extract initial state. Subsequently, we introduce a Neural ODEs-based dynamics module that models the temporal evolution of Gaussian in dynamics latent space. Finally, a Gaussian kernel space decoder is used to decode latent state at the specific time step into the deformation. Experimental results demonstrate that the proposed method achieves comparable rendering quality with existing approaches in reconstruction tasks, and significantly outperforms them in future frame extrapolation. Our code is available at this https URL.
zh

[CV-11] HaloGS: Loose Coupling of Compact Geometry and Gaussian Splats for 3D Scenes

【速读】:该论文旨在解决高保真三维重建与渲染中几何精度与照片级细节保持之间的矛盾问题,现有方法要么将两者融合为复杂的单一模型,要么采用混合方案导致效率与保真度之间的权衡。其解决方案的关键在于提出HaloGS,一种双表示方法,通过松耦合的粗略三角形几何与高斯基元外观表示,结合轻量级经典几何表示的高效性,实现紧凑且富有表现力的模型,能够在室内外环境中进行照片级渲染,并适应不同复杂度的场景。

链接: https://arxiv.org/abs/2505.20267
作者: Changjian Jiang,Kerui Ren,Linning Xu,Jiong Chen,Jiangmiao Pang,Yu Zhang,Bo Dai,Mulin Yu
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Inria (法国国家信息与自动化研究所); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High fidelity 3D reconstruction and rendering hinge on capturing precise geometry while preserving photo realistic detail. Most existing methods either fuse these goals into a single cumbersome model or adopt hybrid schemes whose uniform primitives lead to a trade off between efficiency and fidelity. In this paper, we introduce HaloGS, a dual representation that loosely couples coarse triangles for geometry with Gaussian primitives for appearance, motivated by the lightweight classic geometry representations and their proven efficiency in real world applications. Our design yields a compact yet expressive model capable of photo realistic rendering across both indoor and outdoor environments, seamlessly adapting to varying levels of scene complexity. Experiments on multiple benchmark datasets demonstrate that our method yields both compact, accurate geometry and high fidelity renderings, especially in challenging scenarios where robust geometric structure make a clear difference.
zh

[CV-12] Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

【速读】:该论文旨在解决长时序视频-音频推理与细粒度像素理解对多模态模型提出的矛盾需求:密集的时间覆盖需要大量低分辨率帧,而精确的定位则要求高分辨率输入。其解决方案的关键在于采用一种双系统架构,即全局推理系统通过选择信息丰富的关键帧并在低空间成本下重写任务,而细节理解系统则在选定的高分辨率片段上进行像素级定位。为应对关键帧选择和任务重写过程中存在的模糊性和监督困难,作者将其建模为强化学习(Reinforcement Learning, RL)问题,并提出了Omni-R1,一个基于群体相对策略优化的端到端RL框架。

链接: https://arxiv.org/abs/2505.20256
作者: Hao Zhong,Muzhi Zhu,Zongze Du,Zheng Huang,Canyu Zhao,Mingyu Liu,Wen Wang,Hao Chen,Chunhua Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal’’ keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models. Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.20256 [cs.CV] (or arXiv:2505.20256v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.20256 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-13] AniCrafter: Customizing Realistic Human-Centric Animation via Avatar-Background Conditioning in Video Diffusion Models ICRA

【速读】:该论文旨在解决现有视频扩散模型在开放领域场景中进行角色动画生成时的局限性,尤其是在动态背景或复杂人体姿态下的表现不足。其关键解决方案是提出一种基于扩散的以人类为中心的动画模型AniCrafter,该模型通过引入“虚拟角色-背景”条件机制,将开放领域的人类中心动画问题重新定义为一个恢复任务,从而实现更稳定和多样的动画输出。

链接: https://arxiv.org/abs/2505.20255
作者: Muyao Niu,Mingdeng Cao,Yifan Zhan,Qingtian Zhu,Mingze Ma,Jiancheng Zhao,Yanhong Zeng,Zhihang Zhong,Xiao Sun,Yinqiang Zheng
机构: The University of Tokyo (东京大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github: this https URL

点击查看摘要

Abstract:Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce \textbfAniCrafter , a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative “avatar-background” conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes will be available at this https URL.
zh

[CV-14] Seeing is Believing but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)中不确定性量化(Uncertainty Quantification)的可靠性问题,特别是模型通过自然语言表达置信度(verbalized confidence)的有效性不足问题。研究发现,当前VLMs在多种任务和场景下普遍存在显著的校准偏差(miscalibration),而视觉推理模型在校准方面表现更优,表明模态特定推理对可靠不确定性估计至关重要。为解决这一问题,作者提出了一种两阶段提示策略——视觉置信感知提示(Visual Confidence-Aware Prompting),旨在提升多模态设置下的置信度对齐。解决方案的关键在于通过引入模态感知的提示机制,增强模型对不确定性的准确表达与校准能力。

链接: https://arxiv.org/abs/2505.20236
作者: Weihao Xuan,Qingcheng Zeng,Heli Qi,Junjue Wang,Naoto Yokoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.
zh

[CV-15] Multimodal Federated Learning With Missing Modalities through Feature Imputation Network

【速读】:该论文旨在解决医疗领域中多模态联邦学习(multimodal federated learning)因数据缺失而导致的模型训练难题。在实际应用中,由于临床实践差异、成本与可及性限制、回顾性数据收集、隐私顾虑以及技术或人为错误等原因,常常出现模态缺失的问题。传统方法依赖公开的真实数据集或合成数据来弥补缺失模态,但获取针对每种疾病的真实数据集不切实际,且生成模型在高维医学数据上的训练既计算成本高又容易出错。该论文提出了一种轻量级、低维特征翻译器,用于重建缺失模态的瓶颈特征,其关键在于通过低维特征映射实现高效且准确的模态补全,从而提升模型性能。

链接: https://arxiv.org/abs/2505.20232
作者: Pranav Poudel,Aavash Chhetri,Prashnna Gyawali,Georgios Leontidis,Binod Bhattarai
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: MIUA 2025

点击查看摘要

Abstract:Multimodal federated learning holds immense potential for collaboratively training models from multiple sources without sharing raw data, addressing both data scarcity and privacy concerns, two key challenges in healthcare. A major challenge in training multimodal federated models in healthcare is the presence of missing modalities due to multiple reasons, including variations in clinical practice, cost and accessibility constraints, retrospective data collection, privacy concerns, and occasional technical or human errors. Previous methods typically rely on publicly available real datasets or synthetic data to compensate for missing modalities. However, obtaining real datasets for every disease is impractical, and training generative models to synthesize missing modalities is computationally expensive and prone to errors due to the high dimensionality of medical data. In this paper, we propose a novel, lightweight, low-dimensional feature translator to reconstruct bottleneck features of the missing modalities. Our experiments on three different datasets (MIMIC-CXR, NIH Open-I, and CheXpert), in both homogeneous and heterogeneous settings consistently improve the performance of competitive baselines. The code and implementation details are available at: this https URL
zh

[CV-16] PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology

【速读】:该论文旨在解决病理基础模型(Pathology Foundation Models, PFMs)在临床转化过程中面临的挑战,包括不同癌症类型间最优模型的差异性、评估中的潜在数据泄露问题以及缺乏标准化基准。其解决方案的关键在于构建PathBench,这是一个全面的基准测试平台,通过多中心内部数据集、严格的泄漏预防机制、覆盖从诊断到预后全临床谱的评估以及自动化排行榜系统,实现了对PFMs的客观比较和持续评估,从而推动这些模型向实际临床应用的转化。

链接: https://arxiv.org/abs/2505.20202
作者: Jiabo Ma,Yingxue Xu,Fengtao Zhou,Yihui Wang,Cheng Jin,Zhengrui Guo,Jianfeng Wu,On Ki Tang,Huajun Zhou,Xi Wang,Luyang Luo,Zhengyu Zhang,Du Cai,Zizhao Gao,Wei Wang,Yueping Liu,Jiankun He,Jing Cui,Zhenhui Li,Jing Zhang,Feng Gao,Xiuming Zhang,Li Liang,Ronald Cheong Kin Chan,Zhe Wang,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Chinese University of Hong Kong (香港中文大学); Fourth Military Medical University (第四军医大学); Sun Yat-sen University (中山大学); Harvard University (哈佛大学); Southern Medical University (南方医科大学); University of Science and Technology of China (中国科学技术大学); Hebei Medical University (河北医科大学); Shandong First Medical University (山东第一医科大学); Zhejiang University (浙江大学); Kunming Medical University (昆明医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 9 figures

点击查看摘要

Abstract:The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.
zh

[CV-17] Long-Context State-Space Video World Models

【速读】:该论文试图解决视频扩散模型在长期记忆保持方面的局限性,这一问题源于注意力层处理长序列时的高计算成本。解决方案的关键在于提出一种新颖的架构,利用状态空间模型(State-Space Models, SSMs)在不牺牲计算效率的前提下扩展时间记忆能力。该方法充分利用了SSMs在因果序列建模中的固有优势,并通过分块式SSM扫描方案与密集局部注意力机制相结合,实现了时空一致性的有效平衡。

链接: https://arxiv.org/abs/2505.20171
作者: Ryan Po,Yotam Nitzan,Richard Zhang,Berlin Chen,Tri Dao,Eli Shechtman,Gordon Wetzstein,Xun Huang
机构: Stanford University (斯坦福大学); Princeton University (普林斯顿大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
zh

[CV-18] HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

【速读】:该论文旨在解决音频驱动的人类动画生成中的三个关键问题:在生成高度动态视频的同时保持角色一致性、实现角色与音频之间精确的情感对齐,以及支持多角色的音频驱动动画。其解决方案的关键在于提出HunyuanVideo-Avatar模型,该模型基于多模态扩散变换器(MM-DiT),并通过三个核心创新实现突破:首先,设计角色图像注入模块以替代传统的基于添加的角色条件方案,从而消除训练与推理间的条件不匹配;其次,引入音频情感模块(AEM)以提取并传递情感参考图像中的情感线索至目标视频,实现细粒度的情感风格控制;最后,提出面向人脸的音频适配器(FAA)以通过潜在层面的人脸掩码隔离音频驱动角色,并在多角色场景中通过交叉注意力实现独立的音频注入。这些创新使模型在基准数据集和新提出的野外数据集上超越了现有方法。

链接: https://arxiv.org/abs/2505.20156
作者: Yi Chen,Sen Liang,Zixiang Zhou,Ziyao Huang,Yifeng Ma,Junshu Tang,Qin Lin,Yuan Zhou,Qinglin Lu
机构: Tencent Hunyuan(腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.
zh

[CV-19] FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)依赖自回归(Autoregressive, AR)架构所带来的局限性,例如图像生成中的栅格扫描顺序限制和因果上下文建模中的推理能力受限。其解决方案的关键在于引入FUDOKI,这是一个完全基于离散流匹配(Discrete Flow Matching)的统一多模态模型,通过度量诱导的概率路径与动力学最优速度,实现了迭代优化、自我修正能力和更丰富的双向上下文整合,从而超越了传统的基于掩码的破坏过程。

链接: https://arxiv.org/abs/2505.20147
作者: Jin Wang,Yao Lai,Aoxue Li,Shifeng Zhang,Jiacheng Sun,Ning Kang,Chengyue Wu,Zhenguo Li,Ping Luo
机构: The University of Hong Kong (香港大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 12 figures

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.
zh

[CV-20] Agent ic 3D Scene Generation with Spatially Contextualized VLMs

【速读】:该论文试图解决视觉语言模型(VLMs)在生成和推理结构化三维场景方面的不足,这一问题限制了其在具身人工智能、沉浸式模拟和交互式三维应用等空间基础任务中的实用性。解决方案的关键在于引入一种新的范式,通过注入持续演化的空间上下文,使VLMs能够生成、理解和编辑复杂的三维环境。该空间上下文由三个组件构成:提供高层次语义蓝图的场景肖像、捕捉物体级几何的语义标注点云以及编码丰富空间关系的场景超图,从而为VLMs提供一个结构化且几何感知的工作记忆,结合其固有的多模态推理能力与结构化三维理解,实现有效的空间推理。

链接: https://arxiv.org/abs/2505.20129
作者: Xinhang Liu,Yu-Wing Tai,Chi-Keung Tang
机构: HKUST(香港科技大学); Dartmouth College(达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications.
zh

[CV-21] OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using Blender

【速读】:该论文试图解决从多视角全向图像中进行高保真三维重建时所面临的几何失真问题,特别是由于常见全向表示(如等距柱状投影)在极区的严重形变所带来的挑战。解决方案的关键在于构建一个名为Omnidirectional Blender 3D (OB3D) 的合成数据集,该数据集包含由Blender 3D项目生成的多样化且复杂的三维场景,并提供全面的地面真实数据,包括全向RGB图像、精确的全向相机参数以及像素对齐的等距柱状深度和法线图,旨在为现有方法提供严格的评估环境并推动新方法的发展。

链接: https://arxiv.org/abs/2505.20126
作者: Shintaro Ito,Natsuki Takama,Toshiki Watanabe,Koichi Ito,Hwann-Tzong Chen,Takafumi Aoki
机构: Tohoku University (东北大学); National Tsing Hua University (国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in radiance field rendering, exemplified by Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly progressed 3D modeling and reconstruction. The use of multiple 360-degree omnidirectional images for these tasks is increasingly favored due to advantages in data acquisition and comprehensive scene capture. However, the inherent geometric distortions in common omnidirectional representations, such as equirectangular projection (particularly severe in polar regions and varying with latitude), pose substantial challenges to achieving high-fidelity 3D reconstructions. Current datasets, while valuable, often lack the specific focus, scene composition, and ground truth granularity required to systematically benchmark and drive progress in overcoming these omnidirectional-specific challenges. To address this critical gap, we introduce Omnidirectional Blender 3D (OB3D), a new synthetic dataset curated for advancing 3D reconstruction from multiple omnidirectional images. OB3D features diverse and complex 3D scenes generated from Blender 3D projects, with a deliberate emphasis on challenging scenarios. The dataset provides comprehensive ground truth, including omnidirectional RGB images, precise omnidirectional camera parameters, and pixel-aligned equirectangular maps for depth and normals, alongside evaluation metrics. By offering a controlled yet challenging environment, OB3Daims to facilitate the rigorous evaluation of existing methods and prompt the development of new techniques to enhance the accuracy and reliability of 3D reconstruction from omnidirectional images.
zh

[CV-22] UNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos CVPR2025

【速读】:该论文旨在解决现有视频理解基准在处理视频内容时过于侧重特定方面或单独处理时间属性,而忽视了视频整体动态特性的不足。其解决方案的关键在于提出TUNA,一个面向细粒度理解的时序导向基准,包含字幕生成和问答两个互补任务,通过多样化视频场景与动态性以及可解释且稳健的评估标准,推动对视频时间关系的全面理解。

链接: https://arxiv.org/abs/2505.20124
作者: Fanheng Kong,Jingyuan Zhang,Hongzhi Zhang,Shi Feng,Daling Wang,Linhao Yu,Xingguang Ji,Yu Tian,Qi Wang,Fuzheng Zhang
机构: Northeastern University (东北大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Multimedia (cs.MM)
备注: Accepted to CVPR 2025 Main. Project page: this https URL

点击查看摘要

Abstract:Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models. The data and code are available at this https URL.
zh

[CV-23] Understanding Generalization in Diffusion Models via Probability Flow Distance

【速读】:该论文试图解决扩散模型在高维数据下分布泛化能力评估的问题,现有理论指标在实际应用中不可行,而缺乏能够严格衡量泛化的实用指标。解决方案的关键是引入概率流距离(Probability Flow Distance, PFD),这是一种基于理论且计算高效的度量方法,通过比较由概率流常微分方程(ODE)诱导的噪声到数据映射来量化分布间的距离。

链接: https://arxiv.org/abs/2505.20123
作者: Huijie Zhang,Zijian Huang,Siyi Chen,Jinfan Zhou,Zekai Zhang,Peng Wang,Qing Qu
机构: University of Michigan(密歇根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages, 14 figures

点击查看摘要

Abstract:Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality samples that generalize beyond the training data. However, evaluating this generalization remains challenging: theoretical metrics are often impractical for high-dimensional data, while no practical metrics rigorously measure generalization. In this work, we bridge this gap by introducing probability flow distance ( \textttPFD ), a theoretically grounded and computationally efficient metric to measure distributional generalization. Specifically, \textttPFD quantifies the distance between distributions by comparing their noise-to-data mappings induced by the probability flow ODE. Moreover, by using \textttPFD under a teacher-student evaluation protocol, we empirically uncover several key generalization behaviors in diffusion models, including: (1) scaling behavior from memorization to generalization, (2) early learning and double descent training dynamics, and (3) bias-variance decomposition. Beyond these insights, our work lays a foundation for future empirical and theoretical studies on generalization in diffusion models.
zh

[CV-24] MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

【速读】:该论文试图解决在儿童词汇学习过程中出现的互斥性(Mutual Exclusivity, ME)偏差问题,其核心在于构建一个更具挑战性和现实性的评估框架。解决方案的关键是提出MEBench基准,该基准不仅包含传统的ME任务,还引入了空间推理以增强任务复杂度,并通过新颖的评估指标来捕捉基于ME的推理关键特征,同时提供了一个灵活且可扩展的数据生成管道,以支持多样化标注场景的构建。

链接: https://arxiv.org/abs/2505.20122
作者: Anh Thai,Stefan Stojanov,Zixuan Huang,Bikram Boote,James M. Rehg
机构: Georgia Institute of Technology (佐治亚理工学院); University of Illinois, Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. We assess the performance of state-of-the-art vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes.
zh

[CV-25] Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

【速读】:该论文旨在解决文本到多视角(Text-to-Multiview, T2MV)生成任务中,使用少步数扩散模型加速生成过程时所面临的图像保真度和视角一致性下降的问题。其解决方案的关键在于提出一种针对少步数T2MV扩散模型的强化学习(Reinforcement Learning, RL)微调框架,通过将多视角去噪过程建模为统一的马尔可夫决策过程,并引入ZMV-Sampling采样技术与MV-ZigAL策略优化方法,以联合优化单视角保真度与跨视角一致性。此外,通过将RL微调重构为受显式联合视角约束的优化问题,实现了更高效且平衡的策略更新。

链接: https://arxiv.org/abs/2505.20107
作者: Ziyi Zhang,Li Shen,Deheng Ye,Yong Luo,Huangxuan Zhao,Lefei Zhang
机构: Wuhan University (武汉大学); Sun Yat-Sen University (中山大学); Tencent (腾讯)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.
zh

[CV-26] From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

【速读】:该论文旨在解决传统闭集模型在场景图生成(Scene Graph Generation, SGG)任务中的局限性,即其对象和关系识别受限于固定词汇表,难以适应现实世界中频繁出现的新概念。解决方案的关键在于提出一种基于Transformer的框架OvSGTR,该框架能够联合预测超出预定义类别的对象及其相互关系,并通过DETR-like架构提取高质量的视觉与语义特征,结合Transformer解码器实现端到端的场景图生成。此外,论文还引入了关系感知的预训练策略和视觉概念保留机制,以增强模型对复杂视觉关系的理解并缓解开放词汇设置下的灾难性遗忘问题。

链接: https://arxiv.org/abs/2505.20106
作者: Zuyao Chen,Jinlin Wu,Zhen Lei,Chang Wen Chen
机构: The Hong Kong Polytechnic University (香港理工大学); CAIR, HKISI-CAS (CAIR,香港智能系统与人工智能联合实验室); Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model’s understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines–scene parser-based, LLM-based, and multimodal LLM-based–to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
zh

[CV-27] AdaTP: Attention-Debiased Token Pruning for Video Large Language Models

【速读】:该论文旨在解决视频大语言模型(Video LLMs)在处理视频理解任务时因多帧视频生成大量视觉标记而导致的计算开销过大的问题。现有视觉标记压缩方法依赖于语言模型中的注意力分数作为指导,但这些分数存在固有的偏差:全局偏差倾向于关注视觉标记序列的两端,局部偏差则导致对不同帧中相同空间位置的过度集中。为了解决注意力偏差问题,作者提出了AdaTP(Attention-Debiased Token Pruning),这是一种针对Video LLMs的新型标记剪枝流程,其关键在于集成两个专门的去偏模块,分别针对全局和局部注意力偏差,在无需额外训练的情况下显著降低计算开销并保持原始模型性能。

链接: https://arxiv.org/abs/2505.20100
作者: Fengyuan Sun,Leqi Shen,Hui Chen,Sicheng Zhao,Jungong Han,Guiguang Ding
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Large Language Models (Video LLMs) have achieved remarkable results in video understanding tasks. However, they often suffer from heavy computational overhead due to the large number of visual tokens generated from multiple video frames. Existing visual token compression methods often rely on attention scores from language models as guidance. However, these scores exhibit inherent biases: global bias reflects a tendency to focus on the two ends of the visual token sequence, while local bias leads to an over-concentration on the same spatial positions across different frames. To address the issue of attention bias, we propose \textbfA ttention- \textbfD ebi \textbfa sed \textbfT oken \textbfP runing for Video Large Language Models ( \textbfAdaTP ), a novel token pruning pipeline for Video LLMs. AdaTP integrates two dedicated debiasing modules into the pipeline, targeting global attention bias and local attention bias, respectively. Without the need for additional training, our method significantly reduces the computational overhead of Video LLMs while retaining the performance of vanilla models. Extensive evaluation shows that AdaTP achieves state-of-the-art performance in various commonly used video understanding benchmarks. In particular, on LLaVA-OneVision-7B, AdaTP maintains performance without degradation while using only up to 27.3% FLOPs compared to the vanilla model. Our code will be released soon.
zh

[CV-28] M3DHMR: Monocular 3D Hand Mesh Recovery

【速读】:该论文旨在解决单目3D手部网格恢复(Monocular 3D Hand Mesh Recovery)中的挑战,包括手部高自由度、2D到3D的歧义性以及自遮挡问题。现有方法在效率或直接预测3D网格顶点位置方面存在不足。论文提出的解决方案关键在于设计了一种新的流水线M3DHMR,其核心是包含多个动态螺旋卷积(Dynamic Spiral Convolution, DSC)层和感兴趣区域(Region of Interest, ROI)层的螺旋解码器。DSC层通过根据顶点位置自适应调整权重,在空间和通道维度上提取顶点特征,而ROI层则利用物理信息分别优化每个预定义手部区域的网格顶点。

链接: https://arxiv.org/abs/2505.20058
作者: Yihong Lin,Xianjia Wu,Xilai Wang,Jianqiao Hu,Songju Lei,Xiandong Li,Wenxiong Kang
机构: South China University of Technology (华南理工大学); Huawei Cloud (华为云); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Monocular 3D hand mesh recovery is challenging due to high degrees of freedom of hands, 2D-to-3D ambiguity and self-occlusion. Most existing methods are either inefficient or less straightforward for predicting the position of 3D mesh vertices. Thus, we propose a new pipeline called Monocular 3D Hand Mesh Recovery (M3DHMR) to directly estimate the positions of hand mesh vertices. M3DHMR provides 2D cues for 3D tasks from a single image and uses a new spiral decoder consist of several Dynamic Spiral Convolution (DSC) Layers and a Region of Interest (ROI) Layer. On the one hand, DSC Layers adaptively adjust the weights based on the vertex positions and extract the vertex features in both spatial and channel dimensions. On the other hand, ROI Layer utilizes the physical information and refines mesh vertices in each predefined hand region separately. Extensive experiments on popular dataset FreiHAND demonstrate that M3DHMR significantly outperforms state-of-the-art real-time methods.
zh

[CV-29] PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation

【速读】:该论文旨在解决基于扩散模型的音乐到舞蹈生成中物理合理性不足的问题,即现有方法难以生成符合物理规律的舞蹈动作。解决方案的关键在于提出了一种名为Plausibility-Aware Motion Diffusion (PAMD)的框架,其核心是可实现运动合理性的Plausible Motion Constraint (PMC),该约束通过Neural Distance Fields (NDFs)建模实际姿态流形,并引导生成的运动趋向于物理上有效的姿态流形。

链接: https://arxiv.org/abs/2505.20056
作者: Hongsong Wang,Yin Zhu,Qiuxia Lai,Yang Zhang,Guo-Sen Xie,Xin Geng
机构: School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University (东南大学); State Key Laboratory of Media Convergence and Communication, Communication University of China (中国传媒大学); School of Computer Science and Software Engineering, National Engineering Laboratory for Big Data System Computing Technology, Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University (深圳大学); School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This project page is available at: this https URL

点击查看摘要

Abstract:Computational dance generation is crucial in many areas, such as art, human-computer interaction, virtual reality, and digital entertainment, particularly for generating coherent and expressive long dance sequences. Diffusion-based music-to-dance generation has made significant progress, yet existing methods still struggle to produce physically plausible motions. To address this, we propose Plausibility-Aware Motion Diffusion (PAMD), a framework for generating dances that are both musically aligned and physically realistic. The core of PAMD lies in the Plausible Motion Constraint (PMC), which leverages Neural Distance Fields (NDFs) to model the actual pose manifold and guide generated motions toward a physically valid pose manifold. To provide more effective guidance during generation, we incorporate Prior Motion Guidance (PMG), which uses standing poses as auxiliary conditions alongside music features. To further enhance realism for complex movements, we introduce the Motion Refinement with Foot-ground Contact (MRFC) module, which addresses foot-skating artifacts by bridging the gap between the optimization objective in linear joint position space and the data representation in nonlinear rotation space. Extensive experiments show that PAMD significantly improves musical alignment and enhances the physical plausibility of generated motions. This project page is available at: this https URL.
zh

[CV-30] Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay

【速读】:该论文试图解决类别增量手势识别(class-incremental gesture recognition)问题,即在不遗忘已有手势识别能力的前提下,有效处理新出现或未见过的手势。解决方案的关键在于提出一种无需数据的原型引导伪特征重放(Prototype-Guided Pseudo Feature Replay, PGPFR)框架,该框架通过动态生成多样化的伪特征、利用旧类原型和协方差矩阵增强模型鲁棒性、引入截断交叉熵缓解领域差异影响以及设计持续分类器再训练策略来防止过拟合,从而实现对新旧手势的稳定识别。

链接: https://arxiv.org/abs/2505.20049
作者: Hongsong Wang,Ao Sun,Jie Gui,Liang Wang
机构: Southeast University (东南大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is on this https URL

点击查看摘要

Abstract:Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier’s weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8% and 12.8% in terms of mean global accuracy, respectively. The code is available on this https URL.
zh

[CV-31] DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization

【速读】:该论文旨在解决RGB-D场景解析方法在训练过程中依赖大量手动标注的像素级标签所带来的高成本和低效率问题。其解决方案的关键在于提出一种半监督学习框架DepthMatch,通过引入互补块混合增强策略以充分利用未标记数据中的纹理与空间特征之间的潜在关系,并设计轻量级的空间先验注入器替代传统复杂的融合模块,从而提升异构特征融合的效率,同时引入深度引导的边界损失以增强模型的边界预测能力。

链接: https://arxiv.org/abs/2505.20041
作者: Jianxin Huang,Jiahang Li,Sergey Vityazev,Alexander Dvorkovich,Rui Fan
机构: Tongji University (同济大学); Ryazan State Radio Engineering University (里亚赞州立无线电工程大学); Moscow Institute of Physics and Technology (莫斯科物理技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, accepted by IEEE Signal Processing Letters

点击查看摘要

Abstract:RGB-D scene parsing methods effectively capture both semantic and geometric features of the environment, demonstrating great potential under challenging conditions such as extreme weather and low lighting. However, existing RGB-D scene parsing methods predominantly rely on supervised training strategies, which require a large amount of manually annotated pixel-level labels that are both time-consuming and costly. To overcome these limitations, we introduce DepthMatch, a semi-supervised learning framework that is specifically designed for RGB-D scene parsing. To make full use of unlabeled data, we propose complementary patch mix-up augmentation to explore the latent relationships between texture and spatial features in RGB-D image pairs. We also design a lightweight spatial prior injector to replace traditional complex fusion modules, improving the efficiency of heterogeneous feature fusion. Furthermore, we introduce depth-guided boundary loss to enhance the model’s boundary prediction capabilities. Experimental results demonstrate that DepthMatch exhibits high applicability in both indoor and outdoor scenes, achieving state-of-the-art results on the NYUv2 dataset and ranking first on the KITTI Semantics benchmark.
zh

[CV-32] owards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks CVPR2025

【速读】:该论文旨在解决从视频生成高质量钢琴音频时存在的视觉线索与音乐输出之间同步性不足的问题,尤其是现有评估数据集未能充分捕捉钢琴音乐生成所需的复杂同步性。解决方案的关键在于引入了CoP基准数据集——一个完全开源的多模态基准,专门用于视频引导的钢琴音乐生成。该数据集通过详细的多模态标注、通用且专用的视频到钢琴生成任务评估框架以及完整的数据集、标注和评估协议的开放,实现了视频内容与钢琴音频之间的精确语义和时间对齐。

链接: https://arxiv.org/abs/2505.20038
作者: Chang Liu,Haomin Zhang,Shiyu Xia,Zihao Chen,Chaofan Ding,Xin Yue,Huizhe Chen,Xinhan Di
机构: AI Lab, Giant Network (AI 实验室,巨网络); University of Trento (特伦托大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 4 pages, 1 figure, accepted by CVPR 2025 MMFM Workshop

点击查看摘要

Abstract:Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal this http URL, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at this https URL, with a continuously updated leaderboard to promote ongoing research in this domain.
zh

[CV-33] EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

【速读】:该论文试图解决当前人机交互中AI在感知和解读人类情绪方面的局限性,尤其是现有视觉和视觉-语言模型基准测试在情感谱系覆盖范围狭窄、无法捕捉细微情绪状态(如苦涩、醉酒)以及难以区分相关情绪(如羞耻与尴尬)的问题。此外,现有数据集常使用未经控制的图像,存在面部遮挡和人口统计学多样性不足的问题,导致显著偏差。解决方案的关键在于引入EmoNet Face,这是一个全面的基准套件,包含一个基于基础研究构建的40类情感分类体系、三个大规模AI生成的数据集(EmoNet HQ、Binary和Big),以及多专家严格标注和高保真评估,同时构建了Empathic Insight Face模型,在其基准上实现了接近人类专家的性能。

链接: https://arxiv.org/abs/2505.20033
作者: Christoph Schuhmann,Robert Kaczmarczyk,Gollam Rabby,Maurice Kraus,Felix Friedrich,Huu Nguyen,Krishna Kalyan,Kourosh Nadi,Kristian Kersting,Sören Auer
机构: LAION e.V.(LAION协会); Technical University of Munich(慕尼黑工业大学); L3S Research Center(L3S研究中心); Leibniz University of Hannover(汉诺威莱布尼茨大学); TU Darmstadt(达姆施塔特工业大学); Hessian.AI(Hessian.AI); Ontocord(Ontocord); Kristian Kersting(TU Darmstadt(达姆施塔特工业大学); Hessian.AI(Hessian.AI); DFKI(德国人工智能研究中心); TIB–Leibniz Information Centre for Science and Technology(TIB-莱布尼茨科技信息中心); L3S Research Center(L3S研究中心); Leibniz University of Hannover(汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective human-AI interaction relies on AI’s ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We build Empathic Insight Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite - taxonomy, datasets, and model - provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.
zh

[CV-34] ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

【速读】:该论文旨在解决多模态感知中视觉与触觉信息融合的挑战,特别是在不依赖预训练视觉-语言模型的情况下实现任务和环境的泛化能力。其关键解决方案是提出ViTaPEs框架,该框架基于Transformer结构,采用新颖的多尺度位置编码方案,以捕捉模态内结构并建模跨模态线索,同时提供了可证明的保证,包括编码的单射性、刚体运动等变性和信息保真性。

链接: https://arxiv.org/abs/2505.20032
作者: Fotios Lygerakis,Ozan Özdenizci,Elmar Rückert
机构: Technical University of Leoben(莱奥本矿业大学); Graz University of Technology(格拉茨技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures, while simultaneously modeling cross-modal cues. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, rigid-motion-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: this https URL
zh

[CV-35] Reason Plan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在闭环驾驶系统中的应用问题,特别是其在端到端(E2E)自动驾驶中尚未展现出对主流E2E模仿学习方法的明显优势。解决方案的关键在于提出一种名为ReasonPlan的新型MLLM微调框架,该框架通过结合自监督的下一场景预测任务和监督的决策思维链过程,实现对视觉表示与可操作驾驶情境的对齐,并促进可解释且因果基础的决策生成。

链接: https://arxiv.org/abs/2505.20024
作者: Xueyi Liu,Zuodong Zhong,Yuxin Guo,Yun-Fu Liu,Zhiguo Su,Qichao Zhang,Junli Wang,Yinfeng Gao,Yupeng Zheng,Qiao Lin,Huiyong Chen,Dongbin Zhao
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Fujian (福建); University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 18 pages; 9 figures; this https URL

点击查看摘要

Abstract:Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases. Code and dataset will be found in this https URL.
zh

[CV-36] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在处理基础二维欧几里得几何中的原子视觉技能(atomic visual skills)时表现不佳的问题,尽管这些任务对成人人类来说是微不足道的。解决方案的关键在于系统地分类并定义这些基本、不可分割的视觉感知技能,并引入原子视觉技能数据集(Atomic Visual Skills Dataset, AVSD)以评估VLMs在这些任务上的表现。通过AVSD的基准测试,研究揭示了现有VLMs在原子视觉任务上的局限性,强调了构建专门数据集以训练和评估VLMs的重要性。

链接: https://arxiv.org/abs/2505.20021
作者: Hyunsik Chae,Seungwoo Yoon,Jaden Park,Chloe Yewon Chun,Yongin Cho,Mu Cai,Yong Jae Lee,Ernest K. Ryu
机构: Seoul National University (首尔大学); UCLA (加利福尼亚大学洛杉矶分校); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 69 pages, 16 figures

点击查看摘要

Abstract:Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.
zh

[CV-37] NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-ID

【速读】:该论文旨在解决多模态目标重识别(Multi-modal Object Re-Identification, ReID)中因依赖隐式特征融合结构而导致的细粒度识别策略难以建模的问题。其关键解决方案是提出一种基于属性置信度的多模态标题生成方法,以降低多模态大语言模型(MLLMs)在多模态语义生成中的未知识别率,并提升生成文本的质量。此外,论文还提出了NEXT框架,通过文本调制的语义采样专家(TMSE)和上下文共享的结构感知专家(CSSE)分别捕捉模态特定的外观和内在结构特征,并采用多模态特征聚合(MMFA)实现语义与结构专家输出的统一融合,从而提升最终身份表示的准确性。

链接: https://arxiv.org/abs/2505.20001
作者: Shihao Li,Chenglong Li,Aihua Zheng,Andong Lu,Jin Tang,Jixin Ma
机构: Anhui University (安徽大学); University of Greenwich (格林威治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal object re-identification (ReID) aims to extract identity features across heterogeneous spectral modalities to enable accurate recognition and retrieval in complex real-world scenarios. However, most existing methods rely on implicit feature fusion structures, making it difficult to model fine-grained recognition strategies under varying challenging conditions. Benefiting from the powerful semantic understanding capabilities of Multi-modal Large Language Models (MLLMs), the visual appearance of an object can be effectively translated into descriptive text. In this paper, we propose a reliable multi-modal caption generation method based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs in multi-modal semantic generation and improves the quality of generated text. Additionally, we propose a novel ReID framework NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural expert branches to separately capture modality-specific appearance and intrinsic structure. For semantic recognition, we propose the Text-Modulated Semantic-sampling Experts (TMSE), which leverages randomly sampled high-quality semantic texts to modulate expert-specific sampling of multi-modal features and mining intra-modality fine-grained semantic cues. Then, to recognize coarse-grained structure features, we propose the Context-Shared Structure-aware Experts (CSSE) that focuses on capturing the holistic object structure across modalities and maintains inter-modality structural consistency through a soft routing mechanism. Finally, we propose the Multi-Modal Feature Aggregation (MMFA), which adopts a unified feature fusion strategy to simply and effectively integrate semantic and structural expert outputs into the final identity representations.
zh

[CV-38] Optimizing edge AI models on HPC systems with the edge in the loop

【速读】:该论文旨在解决在边缘设备上部署的生成式 AI (Generative AI) 模型体积小、需在短时间内实现高精度推理的问题。传统方法通常从较大的训练模型出发,通过结构剪枝、知识蒸馏或量化等技术减少其内存和延迟开销,但效果有限。本文提出的解决方案关键在于采用硬件感知的神经网络架构搜索(hardware-aware Neural Architecture Search, NAS),通过将位于比利时的边缘设备与德国的高性能计算系统相结合,快速训练可能的架构候选,并在目标硬件上进行实时延迟测量,从而实现高效且高质量的模型优化。

链接: https://arxiv.org/abs/2505.19995
作者: Marcel Aach,Cyril Blanc,Andreas Lintermann,Kurt De Grave
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, accepted for oral presentation at Computational Aspects of Deep Learning 2025 (at ISC 2025)

点击查看摘要

Abstract:Artificial intelligence and machine learning models deployed on edge devices, e.g., for quality control in Additive Manufacturing (AM), are frequently small in size. Such models usually have to deliver highly accurate results within a short time frame. Methods that are commonly employed in literature start out with larger trained models and try to reduce their memory and latency footprint by structural pruning, knowledge distillation, or quantization. It is, however, also possible to leverage hardware-aware Neural Architecture Search (NAS), an approach that seeks to systematically explore the architecture space to find optimized configurations. In this study, a hardware-aware NAS workflow is introduced that couples an edge device located in Belgium with a powerful High-Performance Computing system in Germany, to train possible architecture candidates as fast as possible while performing real-time latency measurements on the target hardware. The approach is verified on a use case in the AM domain, based on the open RAISE-LPBF dataset, achieving ~8.8 times faster inference speed while simultaneously enhancing model quality by a factor of ~1.35, compared to a human-designed baseline.
zh

[CV-39] Progressive Scaling Visual Object Tracking

【速读】:该论文旨在解决视觉目标跟踪中由于训练数据量、模型规模和输入分辨率等因子单独扩展时存在的优化不充分和迭代优化受限问题。其解决方案的关键在于提出DT-Training框架,该框架通过集成小教师迁移和双分支对齐机制,实现模型潜力的最大化,从而提升跟踪性能并增强方法的泛化能力和迁移性。

链接: https://arxiv.org/abs/2505.19990
作者: Jack Hong,Shilin Yan,Zehao Xiao,Jiayin Cai,Xiaolong Jiang,Yao Hu,Henghui Ding
机构: Xiaohongshu Inc. (小红书公司); AIM Lab, University of Amsterdam (阿姆斯特丹大学AI实验室); Institute of Big Data, Fudan University (复旦大学大数据研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance. Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement. To address this issue, we introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential. The resulting scaled tracker consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating strong generalization and transferability of the proposed method. Furthermore, we validate the broader applicability of our approach to additional tasks, underscoring its versatility beyond tracking.
zh

[CV-40] Structured Initialization for Vision Transformers

【速读】:该论文试图解决视觉Transformer (Vision Transformer, ViT) 在小规模数据集上性能不足的问题,同时保持其在大规模数据集上的可扩展性。解决方案的关键在于将卷积神经网络 (Convolutional Neural Networks, CNN) 的归纳偏置通过初始化策略引入ViT,而非通过架构修改。该方法基于实验结果,即随机脉冲滤波器可以在CNN中实现与学习滤波器相当的性能,并改进了当前依赖经验启发式(如使用预训练模型的注意力权重)的ViT初始化策略,从而在多个小到中等规模数据集上显著优于标准ViT初始化,同时在大规模数据集上保持竞争力。

链接: https://arxiv.org/abs/2505.19985
作者: Jianqiao Zheng,Xueqian Li,Hemanth Saratchandran,Simon Lucey
机构: The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method significantly outperforms standard ViT initialization across numerous small and medium-scale benchmarks, including Food-101, CIFAR-10, CIFAR-100, STL-10, Flowers, and Pets, while maintaining comparative performance on large-scale datasets such as ImageNet-1K. Moreover, our initialization strategy can be easily integrated into various transformer-based architectures such as Swin Transformer and MLP-Mixer with consistent improvements in performance.
zh

[CV-41] ICDM: Interference Cancellation Diffusion Models for Wireless Semantic Communications

【速读】:该论文旨在解决无线语义通信系统中干扰抑制的问题,特别是在存在高斯噪声和未知干扰的情况下,如何有效提升信号质量。其解决方案的关键在于将干扰消除问题建模为信号与干扰联合后验概率下的最大后验(MAP)问题,并通过设计一种干扰消除扩散模型(ICDM)来实现。ICDM通过将联合后验分解为独立的信号和干扰先验概率以及信道转移概率,分别学习各分布的对数梯度,并结合先进的数值迭代方法,实现了精确且快速的干扰消除。

链接: https://arxiv.org/abs/2505.19983
作者: Tong Wu,Zhiyong Chen,Dazhi He,Feng Yang,Meixia Tao,Xiaodong Xu,Wenjun Zhang,Ping Zhang
机构: Cooperative Medianet Innovation Center (CMIC), Shanghai Jiao Tong University (上海交通大学); Department of Electronic Engineering, Shanghai Jiao Tong University (上海交通大学); State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (北京邮电大学); Department of Broadband Communication, PengCheng Laboratory (鹏城实验室)
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE journal

点击查看摘要

Abstract:Diffusion models (DMs) have recently achieved significant success in wireless communications systems due to their denoising capabilities. The broadcast nature of wireless signals makes them susceptible not only to Gaussian noise, but also to unaware interference. This raises the question of whether DMs can effectively mitigate interference in wireless semantic communication systems. In this paper, we model the interference cancellation problem as a maximum a posteriori (MAP) problem over the joint posterior probability of the signal and interference, and theoretically prove that the solution provides excellent estimates for the signal and interference. To solve this problem, we develop an interference cancellation diffusion model (ICDM), which decomposes the joint posterior into independent prior probabilities of the signal and interference, along with the channel transition probablity. The log-gradients of these distributions at each time step are learned separately by DMs and accurately estimated through deriving. ICDM further integrates these gradients with advanced numerical iteration method, achieving accurate and rapid interference cancellation. Extensive experiments demonstrate that ICDM significantly reduces the mean square error (MSE) and enhances perceptual quality compared to schemes without ICDM. For example, on the CelebA dataset under the Rayleigh fading channel with a signal-to-noise ratio (SNR) of 20 dB and signal to interference plus noise ratio (SINR) of 0 dB, ICDM reduces the MSE by 4.54 dB and improves the learned perceptual image patch similarity (LPIPS) by 2.47 dB.
zh

[CV-42] PHI: Bridging Domain Shift in Long-Term Action Quality Assessment via Progressive Hierarchical Instruction

【速读】:该论文旨在解决长期动作质量评估(Long-term Action Quality Assessment, AQA)中因预训练大规模动作识别主干网络与特定AQA任务之间的领域偏移(domain shift)导致的性能受限问题。其关键解决方案是通过识别两种层次的领域偏移——任务级和特征级,并提出渐进式分层指令(Progressive Hierarchical Instruction, PHI)框架。PHI包含两个核心策略:Gap Minimization Flow (GMF) 通过流匹配逐步学习减少初始特征与目标特征之间领域差距的快速路径,以及时间增强注意力模块以捕捉AQA所需的长程依赖关系;List-wise Contrastive Regularization (LCR) 则通过批量对比较全面地学习细粒度线索并缓解领域偏移。

链接: https://arxiv.org/abs/2505.19972
作者: Kanglei Zhou,Hubert P. H. Shum,Frederick W. B. Li,Xingxing Zhang,Xiaohui Liang
机构: Beihang University (北京航空航天大学); Durham University (杜伦大学); Tsinghua University (清华大学); Zhongguancun Laboratory (中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Long-term Action Quality Assessment (AQA) aims to evaluate the quantitative performance of actions in long videos. However, existing methods face challenges due to domain shifts between the pre-trained large-scale action recognition backbones and the specific AQA task, thereby hindering their performance. This arises since fine-tuning resource-intensive backbones on small AQA datasets is impractical. We address this by identifying two levels of domain shift: task-level, regarding differences in task objectives, and feature-level, regarding differences in important features. For feature-level shifts, which are more detrimental, we propose Progressive Hierarchical Instruction (PHI) with two strategies. First, Gap Minimization Flow (GMF) leverages flow matching to progressively learn a fast flow path that reduces the domain gap between initial and desired features across shallow to deep layers. Additionally, a temporally-enhanced attention module captures long-range dependencies essential for AQA. Second, List-wise Contrastive Regularization (LCR) facilitates coarse-to-fine alignment by comprehensively comparing batch pairs to learn fine-grained cues while mitigating domain shift. Integrating these modules, PHI offers an effective solution. Experiments demonstrate that PHI achieves state-of-the-art performance on three representative long-term AQA datasets, proving its superiority in addressing the domain shift for long-term AQA.
zh

[CV-43] UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space

【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)中由于扩散模型固有的随机性和缺乏时间建模能力而导致的生成质量与时间一致性不足的问题。其解决方案的关键在于提出一种名为UltraVSR的框架,该框架通过一个高效的单步扩散空间实现超现实且时间一致的视频超分辨率。核心创新包括Degradation-aware Restoration Schedule(DRS),用于将迭代去噪过程转换为单步重建,从而消除扩散噪声的随机性并加速推理;以及Recurrent Temporal Shift(RTS)模块,通过时序特征传播与对齐提升时间一致性,同时避免依赖显式的时间层结构。

链接: https://arxiv.org/abs/2505.19958
作者: Yong Liu,Jinshan Pan,Yinchuan Li,Qingji Dong,Chao Zhu,Yu Guo,Fei Wang
机构: Xi’an Jiaotong University (西安交通大学); Nanjing University of Science and Technology (南京理工大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, 10 pages, 7 figures

点击查看摘要

Abstract:Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporal-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Restoration Schedule (DRS), which estimates a degradation factor from the low-resolution input and transforms iterative denoising process into a single-step reconstruction from from low-resolution to high-resolution videos. This design eliminates randomness from diffusion noise and significantly speeds up inference. To ensure temporal consistency, we propose a lightweight yet effective Recurrent Temporal Shift (RTS) module, composed of an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, these two units collaboratively facilitate effective feature propagation, fusion, and alignment across neighboring frames, without relying on explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporal coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step.
zh

[CV-44] Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

【速读】:该论文旨在解决零样本组合图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)中的问题,即在没有标注训练数据的情况下,根据包含参考图像和修改文本的组合查询检索目标图像。现有方法依赖于大型语言模型生成合成目标文本作为查询与目标图像之间的中间锚点,但这种依赖导致了误差传播,影响了检索性能。该论文提出的解决方案关键在于引入多模态推理代理(Multimodal Reasoning Agent, MRA),通过直接构建包含参考图像、修改文本和目标图像的三元组,消除对文本中介的依赖,从而直接学习组合查询与候选图像之间的关系。

链接: https://arxiv.org/abs/2505.19952
作者: Rong-Cheng Tu,Wenhao Sun,Hanzhe You,Yingjie Wang,Jiaxing Huang,Li Shen,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); University of Science and Technology of China (中国科学技术大学); Sun Yat-sen University Shenzhen Campus (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, reference image, modification text, target image, using only unlabeled image data. By training on these synthetic triplets, our model learns to capture the relationships between compositional queries and candidate images directly. Extensive experiments on three standard CIR benchmarks demonstrate the effectiveness of our approach. On the FashionIQ dataset, our method improves Average R@10 by at least 7.5% over existing baselines; on CIRR, it boosts R@1 by 9.6%; and on CIRCO, it increases mAP@5 by 9.5%.
zh

[CV-45] SaSi: A Self-augmented and Self-interpreted Deep Learning Approach for Few-shot Cryo-ET Particle Detection

【速读】:该论文旨在解决在低温电子断层扫描(cryo-ET)中,由于信号噪声比低和缺失楔形伪影导致的三维粒子定位难题,尤其是在标记数据稀缺的情况下。其解决方案的关键在于提出了一种新颖的自增强与自解释(SaSi)深度学习方法,通过自增强技术提升数据利用率,并引入自解释分割策略以减少对标注数据的依赖,从而提高模型的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2505.19948
作者: Gokul Adethya,Bhanu Pratyush Mantha,Tianyang Wang,Xingjian Li,Min Xu
机构: National Institute of Technology, Tiruchirappalli (印度理工学院特里奇拉帕利分校); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cryo-electron tomography (cryo-ET) has emerged as a powerful technique for imaging macromolecular complexes in their near-native states. However, the localization of 3D particles in cellular environments still presents a significant challenge due to low signal-to-noise ratios and missing wedge artifacts. Deep learning approaches have shown great potential, but they need huge amounts of data, which can be a challenge in cryo-ET scenarios where labeled data is often scarce. In this paper, we propose a novel Self-augmented and Self-interpreted (SaSi) deep learning approach towards few-shot particle detection in 3D cryo-ET images. Our method builds upon self-augmentation techniques to further boost data utilization and introduces a self-interpreted segmentation strategy for alleviating dependency on labeled data, hence improving generalization and robustness. As demonstrated by experiments conducted on both simulated and real-world cryo-ET datasets, the SaSi approach significantly outperforms existing state-of-the-art methods for particle localization. This research increases understanding of how to detect particles with very few labels in cryo-ET and thus sets a new benchmark for few-shot learning in structural biology.
zh

[CV-46] Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning

【速读】:该论文旨在解决音频-视觉零样本学习(Audio-visual Zero-Shot Learning, ZSL)中背景场景偏差和运动细节不足的问题。其解决方案的关键在于提出一种新型的双流多时间尺度运动解耦脉冲变压器(MDST++),通过解耦上下文语义信息与稀疏动态运动信息,并引入循环联合学习单元以提取上下文语义信息并捕捉跨模态的联合知识,从而更准确地捕捉运动信息并减少背景场景偏差。此外,通过将RGB图像转换为事件流以及动态调整神经元阈值,进一步提升了脉冲神经网络在提取时序和运动线索方面的鲁棒性。

链接: https://arxiv.org/abs/2505.19938
作者: Wenrui Li,Penghong Wang,Xingtao Wang,Wangmeng Zuo,Xiaopeng Fan,Yonghong Tian
机构: Harbin Institute of Technology(哈尔滨工业大学); Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院); School of AI for Science, the Shenzhen Graduate School, Peking University(北京大学深圳研究生院人工智能科学学院); Peng Cheng Laboratory(鹏城实验室); School of Computer Science, Peking University(北京大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TCSVT

点击查看摘要

Abstract:Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2% and 39.9%.
zh

[CV-47] CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

【速读】:该论文试图解决当前视频活动识别模型计算需求高、难以在消费级和边缘设备上高效运行的问题,特别是在智能家庭和智能医疗应用中对效率和隐私的严格要求。其解决方案的关键在于引入一种结合卷积层与线性复杂度注意力机制的深度学习架构,并提出一种新的量化机制,以在保持模型学习能力和泛化能力的同时降低计算成本。

链接: https://arxiv.org/abs/2505.19928
作者: Gabriele Lagani,Fabrizio Falchi,Claudio Gennaro,Giuseppe Amato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce a deep learning solution for video activity recognition that leverages an innovative combination of convolutional layers with a linear-complexity attention mechanism. Moreover, we introduce a novel quantization mechanism to further improve the efficiency of our model during both training and inference. Our model maintains a reduced computational cost, while preserving robust learning and generalization capabilities. Our approach addresses the issues related to the high computing requirements of current models, with the goal of achieving competitive accuracy on consumer and edge devices, enabling smart home and smart healthcare applications where efficiency and privacy issues are of concern. We experimentally validate our model on different established and publicly available video activity recognition benchmarks, improving accuracy over alternative models at a competitive computing cost.
zh

[CV-48] A Responsible Face Recognition Approach for Small and Mid-Scale Systems Through Personalized Neural Networks

【速读】:该论文试图解决传统人脸识别系统中基于向量的模板(template)在可解释性、公平性和隐私方面的不足。其解决方案的关键在于提出一种新型的模型-模板(MOTE)方法,将传统的向量表示替换为小型个性化神经网络,从而在每个个体注册时创建专用的二分类器,仅需一个参考样本即可进行训练,并通过合成平衡样本调整个体层面的公平性,从而提升系统的公平性和隐私保护能力。

链接: https://arxiv.org/abs/2505.19920
作者: Sebastian Groß,Stefan Heindorf,Philipp Terhörst
机构: Paderborn University (帕德博恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional face recognition systems rely on extracting fixed face representations, known as templates, to store and verify identities. These representations are typically generated by neural networks that often lack explainability and raise concerns regarding fairness and privacy. In this work, we propose a novel model-template (MOTE) approach that replaces vector-based face templates with small personalized neural networks. This design enables more responsible face recognition for small and medium-scale systems. During enrollment, MOTE creates a dedicated binary classifier for each identity, trained to determine whether an input face matches the enrolled identity. Each classifier is trained using only a single reference sample, along with synthetically balanced samples to allow adjusting fairness at the level of a single individual during enrollment. Extensive experiments across multiple datasets and recognition systems demonstrate substantial improvements in fairness and particularly in privacy. Although the method increases inference time and storage requirements, it presents a strong solution for small- and mid-scale applications where fairness and privacy are critical.
zh

[CV-49] Weather-Magician: Reconstruction and Rendering Framework for 4D Weather Synthesis In Real Time

【速读】:该论文旨在解决传统工业方法在城市数字孪生、VR/AR/游戏场景设计或合成电影制作中,因手动建模和渲染导致的高人力成本、硬件需求及复杂真实场景复现质量不佳的问题。其解决方案的关键在于提出一种基于高斯泼溅(Gaussian Splatting)的框架,能够利用真实场景数据进行重建,并通过高斯建模与渲染技术有效模拟多种常见的天气效果,支持连续动态天气变化和细节控制,同时具备较低的硬件要求和实时渲染性能。

链接: https://arxiv.org/abs/2505.19919
作者: Chen Sang,Yeqiang Qian,Jiale Zhang,Chunxiang Wang,Ming Yang
机构: Shanghai Jiao Tong University (上海交通大学); SJTU Paris Elite Institute of Technology (上海交通大学巴黎卓越工程师学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project homepage: this https URL

点击查看摘要

Abstract:For tasks such as urban digital twins, VR/AR/game scene design, or creating synthetic films, the traditional industrial approach often involves manually modeling scenes and using various rendering engines to complete the rendering process. This approach typically requires high labor costs and hardware demands, and can result in poor quality when replicating complex real-world scenes. A more efficient approach is to use data from captured real-world scenes, then apply reconstruction and rendering algorithms to quickly recreate the authentic scene. However, current algorithms are unable to effectively reconstruct and render real-world weather effects. To address this, we propose a framework based on gaussian splatting, that can reconstruct real scenes and render them under synthesized 4D weather effects. Our work can simulate various common weather effects by applying Gaussians modeling and rendering techniques. It supports continuous dynamic weather changes and can easily control the details of the effects. Additionally, our work has low hardware requirements and achieves real-time rendering performance. The result demos can be accessed on our project homepage: this http URL
zh

[CV-50] Attention! You Vision Language Model Could Be Maliciously Manipulated

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, VLMs)在面对对抗样本时表现出的显著脆弱性问题,特别是图像类对抗样本对模型输出的精确操控能力。其解决方案的关键在于提出一种名为视觉-语言模型操纵攻击(Vision-language model Manipulation Attack, VMA)的新方法,该方法结合了一阶和二阶动量优化技术与可微分变换机制,以有效优化对抗扰动,从而实现对模型输出的精准控制。

链接: https://arxiv.org/abs/2505.19911
作者: Xiaosen Wang,Shaokang Wang,Zhijin Ge,Yuyang Luo,Shudong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved remarkable success in understanding complex real-world scenarios and supporting data-driven decision-making processes. However, VLMs exhibit significant vulnerability against adversarial examples, either text or image, which can lead to various adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In this work, we empirically and theoretically demonstrate that VLMs are particularly susceptible to image-based adversarial examples, where imperceptible perturbations can precisely manipulate each output token. To this end, we propose a novel attack called Vision-language model Manipulation Attack (VMA), which integrates first-order and second-order momentum optimization techniques with a differentiable transformation mechanism to effectively optimize the adversarial perturbation. Notably, VMA can be a double-edged sword: it can be leveraged to implement various attacks, such as jailbreaking, hijacking, privacy breaches, Denial-of-Service, and the generation of sponge examples, etc, while simultaneously enabling the injection of watermarks for copyright protection. Extensive empirical evaluations substantiate the efficacy and generalizability of VMA across diverse scenarios and datasets.
zh

[CV-51] Dynamic-I2V: Exploring Image-to-Video Generaion Models via Multimodal LLM

【速读】:该论文旨在解决图像到视频(I2V)生成中在复杂场景下运动控制性和时间连贯性不足的问题,以及现有I2V基准测试对低动态视频的显著偏好所导致的评估偏差。其解决方案的关键在于提出Dynamic-I2V框架,该框架通过集成多模态大语言模型(MLLMs)联合编码视觉和文本条件,与扩散变换器(DiT)架构相结合,从而提升视频生成的运动可控性和时间一致性,并支持多样化的条件输入。此外,为弥补评估缺陷,作者还提出了DIVE基准,用于更全面地衡量I2V生成的动态质量。

链接: https://arxiv.org/abs/2505.19901
作者: Peng Liu,Xiaoming Ren,Fengkai Liu,Qingsong Xie,Quanlong Zheng,Yanhao Zhang,Haonan Lu,Yujiu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.
zh

[CV-52] Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement

【速读】:该论文旨在解决水下图像增强中的复杂退化问题,如光吸收、散射、颜色偏移和伪影,这些问题严重影响了水下环境中的目标检测、识别和场景理解。现有方法,尤其是基于扩散的模型,由于缺乏真实的水下参考数据,通常依赖合成配对数据集,导致偏差并限制了泛化能力。此外,微调这些模型可能会破坏已学习的先验知识,从而因领域偏移导致不现实的增强效果。该论文提出的解决方案关键在于UDAN-CLIP框架,其在合成水下数据集上预训练,并结合了定制分类器、空间注意力模块和新颖的CLIP-Diffusion损失。该分类器保留了自然空气中先验知识并语义引导扩散过程,空间注意力模块专注于纠正局部退化,如雾霾和低对比度,而CLIP-Diffusion损失则增强了视觉-文本对齐并保持增强过程中的语义一致性。

链接: https://arxiv.org/abs/2505.19895
作者: Afrah Shaahid,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (国王法赫德石油与矿产大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater images are often affected by complex degradations such as light absorption, scattering, color casts, and artifacts, making enhancement critical for effective object detection, recognition, and scene understanding in aquatic environments. Existing methods, especially diffusion-based approaches, typically rely on synthetic paired datasets due to the scarcity of real underwater references, introducing bias and limiting generalization. Furthermore, fine-tuning these models can degrade learned priors, resulting in unrealistic enhancements due to domain shifts. To address these challenges, we propose UDAN-CLIP, an image-to-image diffusion framework pre-trained on synthetic underwater datasets and enhanced with a customized classifier based on vision-language model, a spatial attention module, and a novel CLIP-Diffusion loss. The classifier preserves natural in-air priors and semantically guides the diffusion process, while the spatial attention module focuses on correcting localized degradations such as haze and low contrast. The proposed CLIP-Diffusion loss further strengthens visual-textual alignment and helps maintain semantic consistency during enhancement. The proposed contributions empower our UDAN-CLIP model to perform more effective underwater image enhancement, producing results that are not only visually compelling but also more realistic and detail-preserving. These improvements are consistently validated through both quantitative metrics and qualitative visual comparisons, demonstrating the model’s ability to correct distortions and restore natural appearance in challenging underwater conditions.
zh

[CV-53] OmniFall: A Unified Staged-to-Wild Benchmark for Human Fall Detection

【速读】:该论文旨在解决现有基于视频的跌倒检测研究中因数据集规模小、场景受限且存在显著领域偏差(如背景、光照和摄像头设置)而导致的真实世界性能不可预测的问题。其解决方案的关键在于引入OmniFall数据集,该数据集统一了八个公开的跌倒检测数据集(约14小时的记录,约42小时的多视角数据,101名受试者,29个摄像头视角),采用一致的十类分类体系并制定标准化评估协议,从而提供完整的视频分割标签,并实现此前因标注方案不兼容而无法进行的跨数据集公平比较。此外,为进行真实环境评估,研究者还整理了OOPS-Fall数据集,并建立了从受控环境到非受控环境的泛化评估协议。实验表明,使用冻结的预训练主干网络(如I3D或VideoMAE)时,在分布内与真实场景之间存在显著性能差距,突显了构建鲁棒跌倒检测系统的关键挑战。

链接: https://arxiv.org/abs/2505.19889
作者: David Schneider,Zdravko Marinov,Rafael Baur,Zeyun Zhong,Rodi Düger,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current video-based fall detection research mostly relies on small, staged datasets with significant domain biases concerning background, lighting, and camera setup resulting in unknown real-world performance. We introduce OmniFall, unifying eight public fall detection datasets (roughly 14 h of recordings, roughly 42 h of multiview data, 101 subjects, 29 camera views) under a consistent ten-class taxonomy with standardized evaluation protocols. Our benchmark provides complete video segmentation labels and enables fair cross-dataset comparison previously impossible with incompatible annotation schemes. For real-world evaluation we curate OOPS-Fall from genuine accident videos and establish a staged-to-wild protocol measuring generalization from controlled to uncontrolled environments. Experiments with frozen pre-trained backbones such as I3D or VideoMAE reveal significant performance gaps between in-distribution and in-the-wild scenarios, highlighting critical challenges in developing robust fall detection systems. OmniFall Dataset at this https URL , Code at this https URL
zh

[CV-54] ErpGS: Equirectangular Image Rendering enhanced with 3D Gaussian Regularization ICIP2025

【速读】:该论文试图解决基于360度相机获取的等距投影图像(equirectangular images)在三维重建和新视角合成(Novel View Synthesis, NVS)中因投影模型导致的大畸变问题,该问题会使得基于3DGS(3D Gaussian Splatting)的方法生成过大的3D高斯分布,从而影响渲染精度。解决方案的关键在于提出ErpGS,这是一种基于3DGS的全方位高斯方法,通过引入几何正则化、尺度正则化、畸变感知权重以及掩码机制来抑制障碍物对渲染的影响,从而提升新视角合成的准确性。

链接: https://arxiv.org/abs/2505.19883
作者: Shintaro Ito,Natsuki Takama,Koichi Ito,Hwann-Tzong Chen,Takafumi Aoki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP2025

点击查看摘要

Abstract:The use of multi-view images acquired by a 360-degree camera can reconstruct a 3D space with a wide area. There are 3D reconstruction methods from equirectangular images based on NeRF and 3DGS, as well as Novel View Synthesis (NVS) methods. On the other hand, it is necessary to overcome the large distortion caused by the projection model of a 360-degree camera when equirectangular images are used. In 3DGS-based methods, the large distortion of the 360-degree camera model generates extremely large 3D Gaussians, resulting in poor rendering accuracy. We propose ErpGS, which is Omnidirectional GS based on 3DGS to realize NVS addressing the problems. ErpGS introduce some rendering accuracy improvement techniques: geometric regularization, scale regularization, and distortion-aware weights and a mask to suppress the effects of obstacles in equirectangular images. Through experiments on public datasets, we demonstrate that ErpGS can render novel view images more accurately than conventional methods.
zh

[CV-55] Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought

【速读】:该论文试图解决现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的视频异常检测(Video Anomaly Detection, VAD)方法在异常描述上仅限于浅层表达而缺乏深度推理的问题。解决方案的关键在于提出了一种新的任务——视频异常推理(Video Anomaly Reasoning, VAR),并设计了Vad-R1框架,其中核心是感知到认知的思维链(Perception-to-Cognition Chain-of-Thought, P2C-CoT),该机制模拟人类识别异常的过程,引导MLLM逐步推理异常,并结合改进的强化学习算法AVA-GRPO,通过有限标注的自验证机制显式激励模型的异常推理能力。

链接: https://arxiv.org/abs/2505.19877
作者: Chao Huang,Benfeng Wang,Jie Wen,Chengliang Liu,Wei Wang,Li Shen,Xiaochun Cao
机构: Shenzhen Campus of Sun Yat-sen University (深圳校区中山大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in reasoning capability of Multimodal Large Language Models (MLLMs) demonstrate its effectiveness in tackling complex visual tasks. However, existing MLLM-based Video Anomaly Detection (VAD) methods remain limited to shallow anomaly descriptions without deep reasoning. In this paper, we propose a new task named Video Anomaly Reasoning (VAR), which aims to enable deep analysis and understanding of anomalies in the video by requiring MLLMs to think explicitly before answering. To this end, we propose Vad-R1, an end-to-end MLLM-based framework for VAR. Specifically, we design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies, guiding the MLLM to reason anomaly step-by-step. Based on the structured P2C-CoT, we construct Vad-Reasoning, a dedicated dataset for VAR. Furthermore, we propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs through a self-verification mechanism with limited annotations. Experimental results demonstrate that Vad-R1 achieves superior performance, outperforming both open-source and proprietary models on VAD and VAR tasks. Codes and datasets will be released at this https URL.
zh

[CV-56] StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

【速读】:该论文旨在解决风格对齐的文本到图像生成问题,该任务在数据获取方面面临显著挑战,因为需要大量包含特定风格的文本-图像-图像三元组数据,而这类数据的获取比传统文本到图像数据更加困难。解决方案的关键在于提出StyleAR方法,该方法结合了专门设计的数据整理策略与自回归(AR)模型,通过参考风格图像和提示合成目标风格化数据,并仅将目标风格化图像作为图像模态用于构建高质量的二元数据。此外,引入了带有感知器重采样的CLIP图像编码器以生成与多模态标记对齐的风格标记,并采用风格增强的标记技术防止内容泄露,同时混合大规模文本-图像数据集中的原始图像与风格化图像以提升模型提取丰富风格特征的能力。

链接: https://arxiv.org/abs/2505.19874
作者: Yi Wu,Lingting Zhu,Shengju Qian,Lei Liu,Wandi Qiao,Lequan Yu,Bin Li
机构: University of Science and Technology of China (中国科学技术大学); The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed AR models to effectively utilize text-to-image binary data for style-aligned text-to-image generation. Our method synthesizes target stylized data using a reference style image and prompt, but only incorporates the target stylized image as the image modality to create high-quality binary data. To facilitate binary data training, we introduce a CLIP image encoder with a perceiver resampler that translates the image input into style tokens aligned with multimodal tokens in AR models and implement a style-enhanced token technique to prevent content leakage which is a common issue in previous work. Furthermore, we mix raw images drawn from large-scale text-image datasets with stylized images to enhance StyleAR’s ability to extract richer stylistic features and ensure style consistency. Extensive qualitative and quantitative experiments demonstrate our superior performance.
zh

[CV-57] Deep Spectral Prior

【速读】:该论文试图解决传统深度图像先验(Deep Image Prior, DIP)在图像重建中因依赖像素级损失和早停策略而带来的过拟合问题,以及缺乏对高频噪声的有效抑制。解决方案的关键在于提出深度频谱先验(Deep Spectral Prior, DSP),将图像重建重新定义为频域对齐问题,通过直接匹配网络输出与观测数据的傅里叶系数,引入显式的频谱一致性归纳偏置,从而实现隐式的频谱正则化,有效抑制高频噪声并消除对早停的依赖。

链接: https://arxiv.org/abs/2505.19873
作者: Yanqi Cheng,Tieyong Zeng,Pietro Lio,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
机构: University of Cambridge (剑桥大学); The Chinese University of Hong Kong (香港中文大学); YMSC, Tsinghua University (清华大学数学科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:We introduce Deep Spectral Prior (DSP), a new formulation of Deep Image Prior (DIP) that redefines image reconstruction as a frequency-domain alignment problem. Unlike traditional DIP, which relies on pixel-wise loss and early stopping to mitigate overfitting, DSP directly matches Fourier coefficients between the network output and observed measurements. This shift introduces an explicit inductive bias towards spectral coherence, aligning with the known frequency structure of images and the spectral bias of convolutional neural networks. We provide a rigorous theoretical framework demonstrating that DSP acts as an implicit spectral regulariser, suppressing high-frequency noise by design and eliminating the need for early stopping. Our analysis spans four core dimensions establishing smooth convergence dynamics, local stability, and favourable bias-variance tradeoffs. We further show that DSP naturally projects reconstructions onto a frequency-consistent manifold, enhancing interpretability and robustness. These theoretical guarantees are supported by empirical results across denoising, inpainting, and super-resolution tasks, where DSP consistently outperforms classical DIP and other unsupervised baselines.
zh

[CV-58] Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling

【速读】:该论文试图解决如何将训练无关技术(training-free techniques)有效整合到Score Distillation Sampling (SDS) 中,以提升文本到3D生成的质量问题。其关键在于通过动态调整这些技术的尺度参数,根据时间步或优化迭代步骤进行策略性缩放,从而在纹理细节与表面平滑度之间取得平衡,同时保持输出尺寸并减少几何缺陷的发生。

链接: https://arxiv.org/abs/2505.19868
作者: Junhong Lee,Seungwook Kim,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies show that simple training-free techniques can dramatically improve the quality of text-to-2D generation outputs, e.g. Classifier-Free Guidance (CFG) or FreeU. However, these training-free techniques have been underexplored in the lens of Score Distillation Sampling (SDS), which is a popular and effective technique to leverage the power of pretrained text-to-2D diffusion models for various tasks. In this paper, we aim to shed light on the effect such training-free techniques have on SDS, via a particular application of text-to-3D generation via 2D lifting. We present our findings, which show that varying the scales of CFG presents a trade-off between object size and surface smoothness, while varying the scales of FreeU presents a trade-off between texture details and geometric errors. Based on these findings, we provide insights into how we can effectively harness training-free techniques for SDS, via a strategic scaling of such techniques in a dynamic manner with respect to the timestep or optimization iteration step. We show that using our proposed scheme strikes a favorable balance between texture details and surface smoothness in text-to-3D generations, while preserving the size of the output and mitigating the occurrence of geometric defects.
zh

[CV-59] FruitNeRF: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields

【速读】:该论文旨在解决果园中未结构化图像下水果计数的问题,尤其是针对不同种类水果的适应性不足和实际应用困难。其解决方案的关键在于提出一种形状无关的多水果计数框架,通过结合RGB图像、语义数据与视觉基础模型预测的实例掩码,将每种水果的身份编码为实例嵌入,构建神经实例场,并通过体素采样提取包含实例特征的点云,实现无需依赖水果种类的聚类计数。

链接: https://arxiv.org/abs/2505.19863
作者: Lukas Meyer,Andrei-Timotei Ardelean,Tim Weyrich,Marc Stamminger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: for project website, see this https URL

点击查看摘要

Abstract:We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF, which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.
zh

[CV-60] A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

【速读】:该论文试图解决视频图像融合中因忽略时间相关性而导致的闪烁和时间不一致问题(flickering and temporal inconsistency)。其解决方案的关键在于提出统一视频融合(Unified Video Fusion, UniVF)框架,该框架通过多帧学习和基于光流的特征变形技术,实现信息丰富且时间一致的视频融合。

链接: https://arxiv.org/abs/2505.19858
作者: Zixiang Zhao,Haowen Bai,Bingxin Ke,Yukun Cui,Lilun Deng,Yulun Zhang,Kai Zhang,Konrad Schindler
机构: ETH Zürich(ETH Zurich); Xi’an Jiaotong University(西安交通大学); Shanghai Jiao Tong University(上海交通大学); Nanjing University(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: this https URL.
zh

[CV-61] Sparse2DGS: Sparse-View Surface Reconstruction using 2D Gaussian Splatting with Dense Point Cloud ICIP2025

【速读】:该论文旨在解决在仅有少量输入图像(如三张图像)的情况下,传统高斯泼溅(Gaussian Splatting, GS)方法在三维重建中的精度显著下降的问题。其关键解决方案是提出一种新的三维重建方法——Sparse2DGS,该方法通过结合DUSt3R和COLMAP MVS生成高精度且密集的三维点云,从而为二维高斯分布提供更优质的初始化,进而提升在有限图像数量下的重建效果。

链接: https://arxiv.org/abs/2505.19854
作者: Natsuki Takama,Shintaro Ito,Koichi Ito,Hwann-Tzong Chen,Takafumi Aoki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP 2025

点击查看摘要

Abstract:Gaussian Splatting (GS) has gained attention as a fast and effective method for novel view synthesis. It has also been applied to 3D reconstruction using multi-view images and can achieve fast and accurate 3D reconstruction. However, GS assumes that the input contains a large number of multi-view images, and therefore, the reconstruction accuracy significantly decreases when only a limited number of input images are available. One of the main reasons is the insufficient number of 3D points in the sparse point cloud obtained through Structure from Motion (SfM), which results in a poor initialization for optimizing the Gaussian primitives. We propose a new 3D reconstruction method, called Sparse2DGS, to enhance 2DGS in reconstructing objects using only three images. Sparse2DGS employs DUSt3R, a fundamental model for stereo images, along with COLMAP MVS to generate highly accurate and dense 3D point clouds, which are then used to initialize 2D Gaussians. Through experiments on the DTU dataset, we show that Sparse2DGS can accurately reconstruct the 3D shapes of objects using just three images.
zh

[CV-62] wo Causally Related Needles in a Video Haystack

【速读】:该论文试图解决视频语言模型(Video-Language Models, VLMs)在长视频理解能力评估中的不足,特别是现有基准测试未能充分评估模型从长视频中提取两个独立位置信息并联合理解的能力,以及建模人类行为因果关系的能力。解决方案的关键是提出一个名为Causal2Needles的长上下文视频理解基准,其中引入了“2-needle”问题,要求模型从长视频中的人类行为事件及其相关叙述文本的因果和结果部分提取信息,并通过两种互补的问题格式防止文本偏差,从而更全面地评估模型的视觉定位能力。

链接: https://arxiv.org/abs/2505.19853
作者: Miaoyu Li,Qin Chao,Boyang Li
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the video understanding capabilities of Video-Language Models (VLMs) remains a significant challenge. We propose a long-context video understanding benchmark, Causal2Needles, that assesses two crucial abilities insufficiently evaluated by existing benchmarks: (1) the ability to extract information from two separate locations in a long video and understand them jointly, and (2) the ability to model the world in terms of cause and effect in human behaviors. Specifically, Causal2Needles introduces 2-needle questions, which require extracting information from both the cause and effect human-behavior events in a long video and the associated narration text. To prevent textual bias, these questions comprise two complementary formats: one asking to identify the video clip containing the answer, and one asking for the textual description of an unrelated visual detail from that video clip. Our experiments reveal that models excelling in pre-existing benchmarks struggle with 2-needle visual grounding, and the model performance is negatively correlated with the distance between the two needles. These findings highlight critical limitations in current VLMs.
zh

[CV-63] Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation ICIP2025

【速读】:该论文旨在解决语义分割任务中因标注数据成本高昂而导致的训练难题,其解决方案的关键在于利用零样本注释生成伪标签,并通过统一双流扰动方法(UniMatch)提升伪标签的准确性,从而作为增强标签用于语义分割模型的训练。

链接: https://arxiv.org/abs/2505.19846
作者: Nagito Saito,Shintaro Ito,Koichi Ito,Takafumi Aoki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP 2025

点击查看摘要

Abstract:Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO.
zh

[CV-64] GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis CVPR2025

【速读】:该论文旨在解决在输入视图数量有限的情况下,通用神经辐射场(Neural Radiance Fields, NeRF)模型在渲染质量上显著下降的问题。其关键解决方案是提出一种基于全局与局部特征融合的神经渲染Transformer——GoLF-NRT,该方法通过引入带有高效稀疏注意力机制的3D Transformer来捕捉全局场景上下文,并结合沿极线提取的局部几何特征,从而实现从1到3个输入视图中高质量重建场景。此外,还提出了基于注意力权重和核回归的自适应采样策略,以提升基于Transformer的神经渲染精度。

链接: https://arxiv.org/abs/2505.19813
作者: You Wang,Li Fang,Hao Zhu,Fei Hu,Long Ye,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have transformed novel view synthesis by modeling scene-specific volumetric representations directly from images. While generalizable NeRF models can generate novel views across unknown scenes by learning latent ray representations, their performance heavily depends on a large number of multi-view observations. However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. GoLF-NRT enhances generalizable neural rendering from few input views by leveraging a 3D transformer with efficient sparse attention to capture global scene context. In parallel, it integrates local geometric features extracted along the epipolar line, enabling high-quality scene reconstruction from as few as 1 to 3 input views. Furthermore, we introduce an adaptive sampling strategy based on attention weights and kernel regression, improving the accuracy of transformer-based neural rendering. Extensive experiments on public datasets show that GoLF-NRT achieves state-of-the-art performance across varying numbers of input views, highlighting the effectiveness and superiority of our approach. Code is available at this https URL.
zh

[CV-65] Efficient Multi-modal Long Context Learning for Training-free Adaptation ICML2025

【速读】:该论文旨在解决传统多模态大语言模型(Multi-Modal Large Language Models, MLLMs)在适应新任务时依赖微调所带来的效率低下、灵活性不足及可扩展性差的问题。其解决方案的关键在于提出一种无需训练的新型方法——高效多模态长上下文学习(Efficient Multi-Modal Long Context Learning, EMLoC),通过将示范示例直接嵌入模型输入,实现任务适配。EMLoC的核心创新在于引入了分块压缩机制与层间自适应剪枝技术,以降低长上下文多模态输入的计算和内存开销,同时保持模型性能,从而实现了压缩与剪枝技术在多模态长上下文学习中的无缝集成。

链接: https://arxiv.org/abs/2505.19812
作者: Zehong Ma,Shiliang Zhang,Longhui Wei,Qi Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML2025

点击查看摘要

Abstract:Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available at this https URL.
zh

[CV-66] ranslation-Equivariance of Normalization Layers and Aliasing in Convolutional Neural Networks

【速读】:该论文试图解决如何设计在连续平移下严格等变的卷积神经网络架构的问题,特别是关注归一化层(normalization layers)在离散位移和连续平移下的等变性。解决方案的关键在于提出一种新的理论框架,用于分析归一化层在不同维度上的等变性条件,并确定其必要且充分的条件。通过使用ResNet-18和ImageNet的真实特征图进行实证测试,验证了这些理论结果的有效性。

链接: https://arxiv.org/abs/2505.19805
作者: Jérémy Scanvic,Quentin Barthélemy,Julián Tachella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The design of convolutional neural architectures that are exactly equivariant to continuous translations is an active field of research. It promises to benefit scientific computing, notably by making existing imaging systems more physically accurate. Most efforts focus on the design of downsampling/pooling layers, upsampling layers and activation functions, but little attention is dedicated to normalization layers. In this work, we present a novel theoretical framework for understanding the equivariance of normalization layers to discrete shifts and continuous translations. We also determine necessary and sufficient conditions for normalization layers to be equivariant in terms of the dimensions they operate on. Using real feature maps from ResNet-18 and ImageNet, we test those theoretical results empirically and find that they are consistent with our predictions.
zh

[CV-67] GraphAU-Pain: Graph-based Action Unit Representation for Pain Intensity Estimation

【速读】:该论文旨在解决通过面部表情检测疼痛的问题,特别是在无法用语言沟通的患者中实现有效的监测、辅助诊断和治疗计划。现有基于数据驱动的方法在可解释性和疼痛严重程度量化方面存在局限。其解决方案的关键在于提出GraphAU-Pain框架,该框架利用图结构建模面部动作单元(Action Units, AUs)及其相互关系,通过关系图神经网络提升模型的可解释性与性能,从而更准确地估计疼痛强度。

链接: https://arxiv.org/abs/2505.19802
作者: Zhiyu Wang,Yang Liu,Hatice Gunes
机构: University of Cambridge(剑桥大学); University of Oulu(奥卢大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding pain-related facial behaviors is essential for digital healthcare in terms of effective monitoring, assisted diagnostics, and treatment planning, particularly for patients unable to communicate verbally. Existing data-driven methods of detecting pain from facial expressions are limited due to interpretability and severity quantification. To this end, we propose GraphAU-Pain, leveraging a graph-based framework to model facial Action Units (AUs) and their interrelationships for pain intensity estimation. AUs are represented as graph nodes, with co-occurrence relationships as edges, enabling a more expressive depiction of pain-related facial behaviors. By utilizing a relational graph neural network, our framework offers improved interpretability and significant performance gains. Experiments conducted on the publicly available UNBC dataset demonstrate the effectiveness of the GraphAU-Pain, achieving an F1-score of 66.21% and accuracy of 87.61% in pain intensity estimation.
zh

[CV-68] A Regularization-Guided Equivariant Approach for Image Restoration

【速读】:该论文旨在解决传统等变(equivariant)和不变(invariant)深度学习模型在图像恢复任务中因表示精度有限及严格对称性假设不成立而带来的性能瓶颈问题。其解决方案的关键在于提出一种旋转等变正则化策略(EQ-Reg),通过自监督学习与特征图的空间旋转及循环通道移位,在保持网络表示能力的同时,自适应地施加合适的对称性约束,从而实现非严格等变网络的构建,为图像恢复任务提供灵活且高效的等变性调整机制。

链接: https://arxiv.org/abs/2505.19799
作者: Yulu Bai,Jiahong Fu,Qi Xie,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); Macau University of Science and Technology (澳门科技大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Equivariant and invariant deep learning models have been developed to exploit intrinsic symmetries in data, demonstrating significant effectiveness in certain scenarios. However, these methods often suffer from limited representation accuracy and rely on strict symmetry assumptions that may not hold in practice. These limitations pose a significant drawback for image restoration tasks, which demands high accuracy and precise symmetry representation. To address these challenges, we propose a rotation-equivariant regularization strategy that adaptively enforces the appropriate symmetry constraints on the data while preserving the network’s representational accuracy. Specifically, we introduce EQ-Reg, a regularizer designed to enhance rotation equivariance, which innovatively extends the insights of data-augmentation-based and equivariant-based methodologies. This is achieved through self-supervised learning and the spatial rotation and cyclic channel shift of feature maps deduce in the equivariant framework. Our approach firstly enables a non-strictly equivariant network suitable for image restoration, providing a simple and adaptive mechanism for adjusting equivariance based on task. Extensive experiments across three low-level tasks demonstrate the superior accuracy and generalization capability of our method, outperforming state-of-the-art approaches.
zh

[CV-69] he Missing Point in Vision Transformers for Universal Image Segmentation

【速读】:该论文旨在解决图像分割中掩码生成与分类的挑战,特别是在存在模糊边界和类别分布不平衡的情况下,如何实现高精度分类的问题。其解决方案的关键在于提出ViT-P框架,该框架采用两阶段设计,将掩码生成与分类过程解耦,第一阶段通过提案生成器生成与类别无关的掩码提案,第二阶段则利用基于Vision Transformer (ViT) 的点分类模型,通过关注掩码中心点来优化预测结果。此外,ViT-P作为无需预训练的适配器,可无缝集成多种预训练ViT模型,提升密集预测任务的适应性,并通过粗粒度和边界框标注提升分类性能,降低标注成本。

链接: https://arxiv.org/abs/2505.19795
作者: Sajjad Shahabodini,Mobina Mansoori,Farnoush Bayatmakou,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: this https URLthis https URL.
zh

[CV-70] Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction CVPR2025

【速读】:该论文旨在解决高分辨率图像渲染计算成本过高的问题,特别是在需要对所有光线进行密集采样的情况下。其解决方案的关键在于提出了一种深度引导的束采样策略(depth-guided bundle sampling),通过将相邻光线分组并共同采样,生成共享表示以解码组内所有光线,从而提升渲染效率。此外,还引入了自适应采样策略,根据深度置信度动态分配样本,优化复杂区域与平滑区域的采样密度。

链接: https://arxiv.org/abs/2505.19793
作者: Li Fang,Hao Zhu,Longlong Chen,Fei Hu,Long Ye,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Recent advancements in generalizable novel view synthesis have achieved impressive quality through interpolation between nearby views. However, rendering high-resolution images remains computationally intensive due to the need for dense sampling of all rays. Recognizing that natural scenes are typically piecewise smooth and sampling all rays is often redundant, we propose a novel depth-guided bundle sampling strategy to accelerate rendering. By grouping adjacent rays into a bundle and sampling them collectively, a shared representation is generated for decoding all rays within the bundle. To further optimize efficiency, our adaptive sampling strategy dynamically allocates samples based on depth confidence, concentrating more samples in complex regions while reducing them in smoother areas. When applied to ENeRF, our method achieves up to a 1.27 dB PSNR improvement and a 47% increase in FPS on the DTU dataset. Extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art rendering quality and up to 2x faster rendering compared to existing generalizable methods. Code is available at this https URL.
zh

[CV-71] SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model

【速读】:该论文旨在解决从真实世界图像中进行固有图像分解(intrinsic image decomposition)的问题,即分离图像中的基底颜色(albedo)和阴影(shading)成分,以实现如虚拟重新照明和场景编辑等下游应用。由于真实世界图像缺乏标注的真值数据,现有方法多依赖合成数据进行监督训练,限制了其在真实场景中的泛化能力;而自监督方法生成的albedo图常包含反射且在不同光照条件下不一致。该论文提出的SAIL方法的关键在于利用潜在扩散模型的先验知识作为albedo估计的替代目标,并在潜在空间中完全形式化固有图像分解,同时引入正则化项约束光照相关与无关成分,从而在仅使用无标签多光照数据的情况下,预测出在不同光照条件下稳定的albedo。

链接: https://arxiv.org/abs/2505.19751
作者: Hala Djeghim,Nathan Piasco,Luis Roldão,Moussab Bennehar,Dzmitry Tsishkou,Céline Loscos,Désiré Sidibé
机构: Noah’s Ark, Huawei Paris Research Center, France (诺亚方舟,华为巴黎研究中心,法国); IBISC, Université Paris-Saclay, Univ Evry, France (IBISC,巴黎-萨克雷大学,埃夫里大学,法国); L Research, France (L 研究,法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing. Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significant challenging task due to the scarcity of labeled ground-truth data. Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo maps that contain reflections and lack consistency under different lighting conditions. To address this, we propose SAIL, an approach designed to estimate albedo-like representations from single-view real-world images. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for albedo estimation. To extract the albedo, we introduce a novel intrinsic image decomposition fully formulated in the latent space. To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and independent components of our latent image decomposition. SAIL predicts stable albedo under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.
zh

[CV-72] SuperAD: A Training-free Anomaly Classification and Segmentation Method for CVPR 2025 VAND 3.0 Workshop Challenge Track 1: Adapt Detect

【速读】:该论文旨在解决工业场景中真实世界异常检测的挑战,特别是针对具有物理复杂性的异常类型,如透明或反射表面、遮挡和低对比度污染。为应对MVTec AD 2数据集带来的挑战,如复杂的光照条件和大规模差异的真实异常,该研究提出了一种无需训练的异常检测与分割方法——SuperAD,其关键在于利用DINOv2模型的强大表征能力,通过少量正常参考图像构建记忆库,并通过测试图像特征与记忆库之间的最近邻匹配实现异常分割。

链接: https://arxiv.org/abs/2505.19750
作者: Huaiyuan Zhang,Hang Chen,Yu Cheng,Shunyi Wu,Linghao Sun,Linao Han,Zeyu Shi,Lei Qi
机构: School of Computer Science and Engineering, Southeast University, China (计算机科学与工程学院,东南大学,中国); School of Computer Science and Engineering, Nanjing University of Science and Technology, China (计算机科学与工程学院,南京理工大学,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this technical report, we present our solution to the CVPR 2025 Visual Anomaly and Novelty Detection (VAND) 3.0 Workshop Challenge Track 1: Adapt Detect: Robust Anomaly Detection in Real-World Applications. In real-world industrial anomaly detection, it is crucial to accurately identify anomalies with physical complexity, such as transparent or reflective surfaces, occlusions, and low-contrast contaminations. The recently proposed MVTec AD 2 dataset significantly narrows the gap between publicly available benchmarks and anomalies found in real-world industrial environments. To address the challenges posed by this dataset–such as complex and varying lighting conditions and real anomalies with large scale differences–we propose a fully training-free anomaly detection and segmentation method based on feature extraction using the DINOv2 model named SuperAD. Our method carefully selects a small number of normal reference images and constructs a memory bank by leveraging the strong representational power of DINOv2. Anomalies are then segmented by performing nearest neighbor matching between test image features and the memory bank. Our method achieves competitive results on both test sets of the MVTec AD 2 dataset.
zh

[CV-73] Improving Heart Rejection Detection in XPCI Images Using Synthetic Data Augmentation

【速读】:该论文旨在解决心脏移植患者心内膜活检中急性细胞性排斥反应(ACR)的准确识别问题,尤其是高分级排斥反应(3R)病例稀少导致的深度学习模型训练困难问题。其解决方案的关键在于利用StyleGAN生成合成数据以增强有限的真实3R图像数量,通过直方图均衡化标准化图像外观,提升组织表征的一致性,进而结合真实0R样本训练ResNet-18分类器,实现二分类排斥反应检测。

链接: https://arxiv.org/abs/2505.19746
作者: Jakov Samardžija,Donik Vršnak,Sven Lončarić
机构: University of Zagreb Faculty of Electrical Engineering and Computing (萨格勒布大学电气工程与计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate identification of acute cellular rejection (ACR) in endomyocardial biopsies is essential for effective management of heart transplant patients. However, the rarity of high-grade rejection cases (3R) presents a significant challenge for training robust deep learning models. This work addresses the class imbalance problem by leveraging synthetic data generation using StyleGAN to augment the limited number of real 3R images. Prior to GAN training, histogram equalization was applied to standardize image appearance and improve the consistency of tissue representation. StyleGAN was trained on available 3R biopsy patches and subsequently used to generate 10,000 realistic synthetic images. These were combined with real 0R samples, that is samples without rejection, in various configurations to train ResNet-18 classifiers for binary rejection classification. Three classifier variants were evaluated: one trained on real 0R and synthetic 3R images, another using both synthetic and additional real samples, and a third trained solely on real data. All models were tested on an independent set of real biopsy images. Results demonstrate that synthetic data improves classification performance, particularly when used in combination with real samples. The highest-performing model, which used both real and synthetic images, achieved strong precision and recall for both classes. These findings underscore the value of hybrid training strategies and highlight the potential of GAN-based data augmentation in biomedical image analysis, especially in domains constrained by limited annotated datasets. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.19746 [cs.CV] (or arXiv:2505.19746v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.19746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-74] HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance

【速读】:该论文旨在解决人体中心图像在传输过程中面临的严重通用退化问题,尤其是人类运动模糊(Human Motion Blur, HMB)与通用噪声共存所带来的图像修复难题。现有研究对此类问题的关注不足,而实际应用中两者常常同时存在。为了解决这一问题,作者设计了一个退化流程,模拟HMB与通用噪声的共存情况,并生成合成退化数据以训练提出的HAODiff模型。该模型的核心创新在于提出了一种三分支双提示引导(Triple-Branch Dual-Prompt Guidance, DPG)机制,利用高质量图像、残差噪声以及HMB分割掩码作为训练目标,生成用于无分类器指导(Classifier-Free Guidance, CFG)的正负提示对,从而提升模型对多种退化的鲁棒性。

链接: https://arxiv.org/abs/2505.19742
作者: Jue Gong,Tingyu Yang,Jingkai Wang,Zheng Chen,Xing Liu,Hong Gu,Yulun Zhang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures. The code and model will be available at this https URL

点击查看摘要

Abstract:Human-centered images often suffer from severe generic degradation during transmission and are prone to human motion blur (HMB), making restoration challenging. Existing research lacks sufficient focus on these issues, as both problems often coexist in practice. To address this, we design a degradation pipeline that simulates the coexistence of HMB and generic noise, generating synthetic degraded data to train our proposed HAODiff, a human-aware one-step diffusion. Specifically, we propose a triple-branch dual-prompt guidance (DPG), which leverages high-quality images, residual noise (LQ minus HQ), and HMB segmentation masks as training targets. It produces a positive-negative prompt pair for classifier-free guidance (CFG) in a single diffusion step. The resulting adaptive dual prompts let HAODiff exploit CFG more effectively, boosting robustness against diverse degradations. For fair evaluation, we introduce MPII-Test, a benchmark rich in combined noise and HMB cases. Extensive experiments show that our HAODiff surpasses existing state-of-the-art (SOTA) methods in terms of both quantitative metrics and visual quality on synthetic and real-world datasets, including our introduced MPII-Test. Code is available at: this https URL.
zh

[CV-75] Cross-Sequence Semi-Supervised Learning for Multi-Parametric MRI-Based Visual Pathway Delineation

【速读】:该论文旨在解决多参数磁共振成像(multi-parametric MR imaging)数据中视觉通路(visual pathway, VP)分割的挑战,特别是现有方法难以有效建模不同MRI序列间的复杂交叉关系,以及依赖大量标注数据导致的劳动密集型和耗时问题。其解决方案的关键在于提出一种半监督的多参数特征分解框架,其中核心组件包括相关性约束的特征分解(correlation-constrained feature decomposition, CFD),用于捕捉各MRI序列的独特特征并简化多参数信息融合过程,以及基于一致性的样本增强(consistency-based sample enhancement, CSE)模块,通过从无标签数据中生成并强化有意义的边缘信息来缓解标注数据不足的问题。

链接: https://arxiv.org/abs/2505.19733
作者: Alou Diakite,Cheng Li,Lei Xie,Yuanjing Feng,Ruoyou Wu,Jianzhong He,Hairong Zheng,Shanshan Wang
机构: Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(保罗·劳特布尔生物医学成像研究中心,深圳先进技术研究院,中国科学院); University of Chinese Academy of Sciences(中国科学院大学); Zhejiang University of Technology(浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Accurately delineating the visual pathway (VP) is crucial for understanding the human visual system and diagnosing related disorders. Exploring multi-parametric MR imaging data has been identified as an important way to delineate VP. However, due to the complex cross-sequence relationships, existing methods cannot effectively model the complementary information from different MRI sequences. In addition, these existing methods heavily rely on large training data with labels, which is labor-intensive and time-consuming to obtain. In this work, we propose a novel semi-supervised multi-parametric feature decomposition framework for VP delineation. Specifically, a correlation-constrained feature decomposition (CFD) is designed to handle the complex cross-sequence relationships by capturing the unique characteristics of each MRI sequence and easing the multi-parametric information fusion process. Furthermore, a consistency-based sample enhancement (CSE) module is developed to address the limited labeled data issue, by generating and promoting meaningful edge information from unlabeled data. We validate our framework using two public datasets, and one in-house Multi-Shell Diffusion MRI (MDM) dataset. Experimental results demonstrate the superiority of our approach in terms of delineation performance when compared to seven state-of-the-art approaches.
zh

[CV-76] MLLM -Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

【速读】:该论文旨在解决现有零样本组合图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)方法在生成查询表示时无法充分捕捉组合意图或与目标语义对齐的问题,从而限制了检索性能,尤其是在涉及细粒度或复杂视觉变换的情况下。其解决方案的关键在于提出一种基于多模态大语言模型(MLLM)的视觉语言模型微调方法(MLLM-Guided VLM Fine-Tuning with Joint Inference, MVFT-JI),通过利用未标注图像构建两个互补的训练任务——目标文本检索任务和文本到图像检索任务,并联合优化这些任务,使视觉语言模型具备更强的组合检索能力。

链接: https://arxiv.org/abs/2505.19707
作者: Rong-Cheng Tu,Zhao Jin,Jingyi Liao,Xiao Luo,Yingjie Wang,Li Shen,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Sun Yat-sen University Shenzhen Campus (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to construct two complementary training tasks using only unlabeled images: target text retrieval taskand text-to-image retrieval task. By jointly optimizing these tasks, our method enables the VLM to inherently acquire robust compositional retrieval capabilities, supported by the provided theoretical justifications and empirical validation. Furthermore, during inference, we further prompt the MLLM to generate target texts from composed queries and compute retrieval scores by integrating similarities between (i) the composed query and candidate images, and (ii) the MLLM-generated target text and candidate images. This strategy effectively combines the VLM’s semantic alignment strengths with the MLLM’s reasoning capabilities.
zh

[CV-77] Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

【速读】:该论文旨在解决将基于文本的Chain-of-Thought (CoT)推理方法扩展到视觉语言任务中的挑战,特别是在视觉文档理解中存在视觉幻觉和多模态整合不足的问题。其解决方案的关键在于提出Point-RFT框架,该框架通过两个阶段进行优化:首先利用包含71K个多样化视觉推理问题的数据集进行格式微调,每个问题都带有与相应视觉元素明确关联的详细分步推理;其次通过强化学习微调专注于视觉文档理解。该方法在ChartQA数据集上的实验结果表明,其准确率从70.88%提升至90.04%,优于仅依赖文本CoT的强化学习微调方法,验证了其在多模态推理中的有效性。

链接: https://arxiv.org/abs/2505.19702
作者: Minheng Ni,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Kevin Lin,Wangmeng Zuo,Lijuan Wang
机构: Hong Kong Polytechnic University (香港理工大学); Harbin Institute of Technology (哈尔滨工业大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.
zh

[CV-78] Modeling Beyond MOS: Quality Assessment Models Must Integrate Context Reasoning and Multimodality

【速读】:该论文试图解决当前多媒体质量评估模型依赖单一标量指标Mean Opinion Score (MOS)所带来的局限性,即MOS将复杂的、上下文敏感的人类判断简化为一个标量值,从而忽略了语义失败、用户意图和质量决策的依据。解决方案的关键在于构建具备三种相互依赖能力的质量评估模型:(1)上下文感知能力,以适应特定任务目标和观看条件;(2)推理能力,以生成可解释、基于证据的质量判断理由;(3)多模态能力,通过视觉-语言模型对齐感知和语义线索。论文还提出改进基准数据集和评估指标的路线图,以实现更稳健、符合人类意图且可信的评估系统。

链接: https://arxiv.org/abs/2505.19696
作者: Mohamed Amine Kerkouri,Marouane Tliba,Aladine Chetouani,Nour Aburaed,Alessandro Bruno
机构: F-Initiatives( F-Initiatives); Université d’Orleans(奥尔良大学); Université Sorbonne Paris Nord(索邦巴黎北大学); University of Dubai(迪拜大学); IULM University(伊尔莫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Under review

点击查看摘要

Abstract:This position paper argues that Mean Opinion Score (MOS), while historically foundational, is no longer sufficient as the sole supervisory signal for multimedia quality assessment models. MOS reduces rich, context-sensitive human judgments to a single scalar, obscuring semantic failures, user intent, and the rationale behind quality decisions. We contend that modern quality assessment models must integrate three interdependent capabilities: (1) context-awareness, to adapt evaluations to task-specific goals and viewing conditions; (2) reasoning, to produce interpretable, evidence-grounded justifications for quality judgments; and (3) multimodality, to align perceptual and semantic cues using vision-language models. We critique the limitations of current MOS-centric benchmarks and propose a roadmap for reform: richer datasets with contextual metadata and expert rationales, and new evaluation metrics that assess semantic alignment, reasoning fidelity, and contextual sensitivity. By reframing quality assessment as a contextual, explainable, and multimodal modeling task, we aim to catalyze a shift toward more robust, human-aligned, and trustworthy evaluation systems.
zh

[CV-79] Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition CVPR2025

【速读】:该论文旨在解决视觉情感识别(Visual Emotion Recognition, VER)模型在跨领域泛化能力上的不足,特别是在无监督条件下将模型从源域(如真实图像)迁移到低资源目标域(如贴纸)的问题。其关键解决方案是提出一种基于知识对齐的反事实增强扩散感知框架(Knowledge-aligned Counterfactual-enhancement Diffusion Perception, KCDP),通过视觉语言模型对齐情感表征,并利用反事实增强的语言-图像情感对齐方法生成高质量伪标签,从而提升模型在目标域中的感知能力和泛化性能。

链接: https://arxiv.org/abs/2505.19694
作者: Wen Yin,Yong Wang,Guiduo Duan,Dongyang Zhang,Xin Hu,Yuan-Fang Li,Tao He
机构: UESTC(电子科技大学); Sichuan Province(四川省); Monash University(莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Visual Emotion Recognition (VER) is a critical yet challenging task aimed at inferring emotional states of individuals based on visual cues. However, existing works focus on single domains, e.g., realistic images or stickers, limiting VER models’ cross-domain generalizability. To fill this gap, we introduce an Unsupervised Cross-Domain Visual Emotion Recognition (UCDVER) task, which aims to generalize visual emotion recognition from the source domain (e.g., realistic images) to the low-resource target domain (e.g., stickers) in an unsupervised manner. Compared to the conventional unsupervised domain adaptation problems, UCDVER presents two key challenges: a significant emotional expression variability and an affective distribution shift. To mitigate these issues, we propose the Knowledge-aligned Counterfactual-enhancement Diffusion Perception (KCDP) framework. Specifically, KCDP leverages a VLM to align emotional representations in a shared knowledge space and guides diffusion models for improved visual affective perception. Furthermore, a Counterfactual-Enhanced Language-image Emotional Alignment (CLIEA) method generates high-quality pseudo-labels for the target domain. Extensive experiments demonstrate that our model surpasses SOTA models in both perceptibility and generalization, e.g., gaining 12% improvements over the SOTA VER model TGCA-PVT. The project page is at this https URL.
zh

[CV-80] DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving

【速读】:该论文旨在解决现有相机传感器仿真方法在生成多视角视频序列时受限于固定摄像机视角和视频频率的问题,从而限制了其下游应用的灵活性。其解决方案的关键在于提出了一种通用的相机仿真框架DriveCamSim,其核心创新是显式相机建模(Explicit Camera Modeling, ECM)机制,该机制通过建立多视角和多帧维度上的像素级对应关系,解耦模型对训练数据中特定摄像机配置(如内参/外参、视角数量)和时间采样率的过拟合,从而提升模型的泛化能力和可控性。

链接: https://arxiv.org/abs/2505.19692
作者: Wenchao Sun,Xuewu Lin,Keyu Chen,Zixiang Pei,Yining Shi,Chuang Zhang,Sifa Zheng
机构: Tsinghua University (清华大学); Horizon (霍尔顿); Horizon Continental Technology (霍尔顿大陆科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera sensor simulation serves as a critical role for autonomous driving (AD), e.g. evaluating vision-based AD algorithms. While existing approaches have leveraged generative models for controllable image/video generation, they remain constrained to generating multi-view video sequences with fixed camera viewpoints and video frequency, significantly limiting their downstream applications. To address this, we present a generalizable camera simulation framework DriveCamSim, whose core innovation lies in the proposed Explicit Camera Modeling (ECM) mechanism. Instead of implicit interaction through vanilla attention, ECM establishes explicit pixel-wise correspondences across multi-view and multi-frame dimensions, decoupling the model from overfitting to the specific camera configurations (intrinsic/extrinsic parameters, number of views) and temporal sampling rates presented in the training data. For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines, proposing an information-preserving control mechanism. This control mechanism not only improves conditional controllability, but also can be extended to be identity-aware to enhance temporal consistency in foreground object rendering. With above designs, our model demonstrates superior performance in both visual quality and controllability, as well as generalization capability across spatial-level (camera parameters variations) and temporal-level (video frame rate variations), enabling flexible user-customizable camera simulation tailored to diverse application scenarios. Code will be avaliable at this https URL for facilitating future research.
zh

[CV-81] VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在提升视觉推理能力过程中引入的新安全风险问题,特别是针对其可能遭受的越狱攻击(jailbreak attacks)的脆弱性。论文提出的解决方案之关键在于设计了一种名为VisCRA(Visual Chain Reasoning Attack)的新型越狱框架,该框架通过结合目标视觉注意力掩码与两阶段推理诱导策略,精准控制有害输出,从而有效利用模型的视觉推理链绕过安全机制。

链接: https://arxiv.org/abs/2505.19684
作者: Bingrui Sima,Linhua Cong,Wenxuan Wang,Kun He
机构: Huazhong University of Science and Technology (华中科技大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of Multimodal Large Language Models (MLRMs) has enabled sophisticated visual reasoning capabilities by integrating reinforcement learning and Chain-of-Thought (CoT) supervision. However, while these enhanced reasoning capabilities improve performance, they also introduce new and underexplored safety risks. In this work, we systematically investigate the security implications of advanced visual reasoning in MLRMs. Our analysis reveals a fundamental trade-off: as visual reasoning improves, models become more vulnerable to jailbreak attacks. Motivated by this critical finding, we introduce VisCRA (Visual Chain Reasoning Attack), a novel jailbreak framework that exploits the visual reasoning chains to bypass safety mechanisms. VisCRA combines targeted visual attention masking with a two-stage reasoning induction strategy to precisely control harmful outputs. Extensive experiments demonstrate VisCRA’s significant effectiveness, achieving high attack success rates on leading closed-source MLRMs: 76.48% on Gemini 2.0 Flash Thinking, 68.56% on QvQ-Max, and 56.60% on GPT-4o. Our findings highlight a critical insight: the very capability that empowers MLRMs – their visual reasoning – can also serve as an attack vector, posing significant security risks.
zh

[CV-82] Burst Image Super-Resolution via Multi-Cross Attention Encoding and Multi-Scan State-Space Decoding

【速读】:该论文旨在解决多图像超分辨率(Multi-image Super-resolution, MISR)任务中由于固定且狭窄的注意力窗口限制了对局部以外特征的感知,从而影响对齐和特征聚合的问题。解决方案的关键在于提出一种新型特征提取器,该提取器结合了两种新设计的注意力机制:重叠跨窗注意力和跨帧注意力,以更精确和高效地提取多帧间的子像素信息,并引入带有跨帧注意力机制的多扫描状态空间模块来增强特征聚合能力。

链接: https://arxiv.org/abs/2505.19668
作者: Tengda Huang,Yu Zhang,Tianren Li,Yufu Qu,Fulin Liu,Zhenzhong Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 13 figures, submitted to ‘Image and Vision Computing’

点击查看摘要

Abstract:Multi-image super-resolution (MISR) can achieve higher image quality than single-image super-resolution (SISR) by aggregating sub-pixel information from multiple spatially shifted frames. Among MISR tasks, burst super-resolution (BurstSR) has gained significant attention due to its wide range of applications. Recent methods have increasingly adopted Transformers over convolutional neural networks (CNNs) in super-resolution tasks, due to their superior ability to capture both local and global context. However, most existing approaches still rely on fixed and narrow attention windows that restrict the perception of features beyond the local field. This limitation hampers alignment and feature aggregation, both of which are crucial for high-quality super-resolution. To address these limitations, we propose a novel feature extractor that incorporates two newly designed attention mechanisms: overlapping cross-window attention and cross-frame attention, enabling more precise and efficient extraction of sub-pixel information across multiple frames. Furthermore, we introduce a Multi-scan State-Space Module with the cross-frame attention mechanism to enhance feature aggregation. Extensive experiments on both synthetic and real-world benchmarks demonstrate the superiority of our approach. Additional evaluations on ISO 12233 resolution test charts further confirm its enhanced super-resolution performance.
zh

[CV-83] FieldWorkArena: Agent ic AI Benchmark for Real Field Work Tasks

【速读】:该论文试图解决当前agentic AI在真实工作环境中评估不足的问题,尤其是在安全与健康事件、制造相关事件的监测与报告方面的能力评估。现有基准主要针对网络任务,无法有效评估复杂现实工作环境中的智能体表现。解决方案的关键在于定义适用于真实工作环境的新型动作空间,并改进评估函数以更准确地衡量agentic AI在多样化现实任务中的性能,同时考虑多模态大语言模型(Multimodal LLM, MLLM)的特点。

链接: https://arxiv.org/abs/2505.19662
作者: Atsunori Moteki,Shoichi Masui,Fan Yang,Yueqi Song,Yonatan Bisk,Graham Neubig,Ikuo Kusajima,Yasuto Watanabe,Hiroyuki Ishida,Jun Takahashi,Shan Jiang
机构: Fujitsu Limited (富士通有限公司); Fujitsu Research of America (富士通美国研究院); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, 4 tables

点击查看摘要

Abstract:This paper proposes FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are required to monitor and report safety and health incidents, as well as manufacturing-related incidents, that may occur in real-world work environments. Existing agentic AI benchmarks have been limited to evaluating web tasks and are insufficient for evaluating agents in real-world work environments, where complexity increases significantly. In this paper, we define a new action space that agentic AI should possess for real world work environment benchmarks and improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. The dataset consists of videos captured on-site and documents actually used in factories and warehouses, and tasks were created based on interviews with on-site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Additionally, the effectiveness and limitations of the proposed new evaluation method were identified. The complete dataset (HuggingFace) and evaluation program (GitHub) can be downloaded from the following website: this https URL.
zh

[CV-84] LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation ICML2025

【速读】:该论文旨在解决医学图像分割模型在不同领域间泛化能力不足的问题,其解决方案的关键在于提出一种名为LangDAug的新型数据增强方法,该方法基于能量模型(Energy-Based Models, EBMs)并通过对比发散训练,利用朗之万动力学在源领域之间生成中间样本,从而增强模型表示。理论分析表明,LangDAug具有正则化效果,并通过数据流形的内在维度对广义线性模型(Generalized Linear Models, GLMs)的Rademacher复杂度进行上界约束。

链接: https://arxiv.org/abs/2505.19659
作者: Piyush Tiwary,Kinjawl Bhattacharyya,Prathosh A.P
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DAug). While representation learning methods seek domain-invariant features, they often rely on ad-hoc techniques and lack formal guarantees. DAug methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novel \textbfLang evin \textbfD ata \textbfAug mentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches. The codebase for our method is available at this https URL.
zh

[CV-85] ReDDiT: Rehashing Noise for Discrete Visual Generation

【速读】:该论文试图解决离散扩散模型在视觉生成领域中效率和兼容性虽好但性能仍落后于连续扩散模型的问题,其根源在于噪声(吸收态)设计和采样启发式方法的局限性。解决方案的关键是提出一种名为ReDDiT的重哈希噪声框架,通过引入随机多索引破坏来扩展吸收态并增强模型的表达能力,同时利用推导出的重哈希采样器确保生成过程的多样性和低差异性,从而提升生成质量并减少对复杂随机性调优的依赖。

链接: https://arxiv.org/abs/2505.19656
作者: Tianren Ma,Xiaosong Zhang,Boyu Yang,Junlan Feng,Qixiang Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, under development

点击查看摘要

Abstract:Discrete diffusion models are gaining traction in the visual generative area for their efficiency and compatibility. However, the pioneered attempts still fall behind the continuous counterparts, which we attribute to the noise (absorbing state) design and sampling heuristics. In this study, we propose the rehashing noise framework for discrete diffusion transformer, termed ReDDiT, to extend absorbing states and improve expressive capacity of discrete diffusion models. ReDDiT enriches the potential paths that latent variables can traverse during training with randomized multi-index corruption. The derived rehash sampler, which reverses the randomized absorbing paths, guarantees the diversity and low discrepancy of the generation process. These reformulations lead to more consistent and competitive generation quality, mitigating the need for heavily tuned randomness. Experiments show that ReDDiT significantly outperforms the baseline (reducing gFID from 6.18 to 1.61) and is on par with the continuous counterparts with higher efficiency.
zh

[CV-86] Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

【速读】:该论文旨在解决多模态信息检索(Multimodal Information Retrieval, MIR)中由于数据源异构性和跨模态对齐复杂性所带来的固有挑战。其解决方案的关键在于提出UNITE框架,该框架通过两个关键但未被充分探索的方面——数据整理(data curation)和模态感知训练配置(modality-aware training configurations)——来应对这些挑战。此外,论文还提出了模态感知掩码对比学习(Modal-Aware Masked Contrastive Learning, MAMCL)方法,以缓解不同模态实例之间的竞争关系,从而提升跨模态表示学习的鲁棒性。

链接: https://arxiv.org/abs/2505.19650
作者: Fanheng Kong,Jingyuan Zhang,Yahui Liu,Hongzhi Zhang,Shi Feng,Xiaocui Yang,Daling Wang,Yu Tian,Qi Wang,Fuzheng Zhang,Guorui Zhou
机构: Northeastern University (东北大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 26 pages, project page: this https URL

点击查看摘要

Abstract:Multimodal information retrieval (MIR) faces inherent challenges due to the heterogeneity of data sources and the complexity of cross-modal alignment. While previous studies have identified modal gaps in feature spaces, a systematic approach to address these challenges remains unexplored. In this work, we introduce UNITE, a universal framework that tackles these challenges through two critical yet underexplored aspects: data curation and modality-aware training configurations. Our work provides the first comprehensive analysis of how modality-specific data properties influence downstream task performance across diverse scenarios. Moreover, we propose Modal-Aware Masked Contrastive Learning (MAMCL) to mitigate the competitive relationships among the instances of different modalities. Our framework achieves state-of-the-art results on multiple multimodal retrieval benchmarks, outperforming existing methods by notable margins. Through extensive experiments, we demonstrate that strategic modality curation and tailored training protocols are pivotal for robust cross-modal representation learning. This work not only advances MIR performance but also provides a foundational blueprint for future research in multimodal systems. Our project is available at this https URL.
zh

[CV-87] HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

【速读】:该论文旨在解决虚拟试衣技术中在不同人体姿态下保持图像一致性的问题,具体包括几何失真导致的空间不一致、服装结构与纹理在不同姿态下的语义不一致以及细节丢失影响视觉保真度等问题。其解决方案的关键在于提出HF-VTON框架,该框架包含三个核心模块:Appearance-Preserving Warp Alignment Module (APWAM) 用于对齐服装与人体姿态以解决几何变形并确保空间一致性;Semantic Representation and Comprehension Module (SRCM) 用于捕捉细粒度服装属性和多姿态数据以增强语义表示;Multimodal Prior-Guided Appearance Generation Module (MPAGM) 通过整合多模态特征和预训练模型的先验知识优化外观生成,确保语义和几何一致性。

链接: https://arxiv.org/abs/2505.19638
作者: Ming Meng,Qi Dong,Jiajie Li,Zhe Zhu,Xingyu Wang,Zhaoxin Fan,Wei Zhao,Wenjun Wu
机构: Communication University of China (中国传媒大学); Samsung Research America (三星美国研究院); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.
zh

[CV-88] Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat

【速读】:该论文旨在解决眼科领域中多模态视觉问答(VQA)模型评估缺乏标准化基准的问题,提出了一种双语多模态VQA基准。解决方案的关键在于从微信公众号收集真实临床图像及对应图文描述,并利用GPT-4o-mini生成中英文问答对,构建涵盖9个眼科亚专科、548种疾病、29种影像模态及68种模态组合的大型数据集,从而为视觉语言模型(VLM)提供具有临床真实性和多语言支持的评估平台。

链接: https://arxiv.org/abs/2505.19624
作者: Pusheng Xu,Xia Gong,Xiaolan Chen,Weiyi Zhang,Jiancheng Yang,Bingjie Yan,Meng Yuan,Yalin Zheng,Mingguang He,Danli Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P 0.001) and Qwen2.5-VL-72B-Instruct (0.514, P 0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.
zh

[CV-89] Rotation-Equivariant Self-Supervised Method in Image Denoising CVPR2025

【速读】:该论文旨在解决自监督图像去噪方法中对图像先验信息利用不足的问题,其核心是通过引入旋转等变性(rotation equivariance)这一重要图像先验来提升模型性能。解决方案的关键在于将高精度的旋转等变卷积应用于自监督图像去噪网络,并通过严格的理论分析证明了仅替换卷积层为旋转等变卷积层即可使网络具备旋转等变性。此外,还设计了一种新的掩码机制以融合旋转等变网络与传统卷积神经网络(CNN)的输出,构建了一个自适应的旋转等变框架,从而进一步提升了去噪效果。

链接: https://arxiv.org/abs/2505.19618
作者: Hanze Liu,Jiahong Fu,Qi Xie,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); Pengcheng Laboratory (鹏城实验室); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Self-supervised image denoising methods have garnered significant research attention in recent years, for this kind of method reduces the requirement of large training datasets. Compared to supervised methods, self-supervised methods rely more on the prior embedded in deep networks themselves. As a result, most of the self-supervised methods are designed with Convolution Neural Networks (CNNs) architectures, which well capture one of the most important image prior, translation equivariant prior. Inspired by the great success achieved by the introduction of translational equivariance, in this paper, we explore the way to further incorporate another important image prior. Specifically, we first apply high-accuracy rotation equivariant convolution to self-supervised image denoising. Through rigorous theoretical analysis, we have proved that simply replacing all the convolution layers with rotation equivariant convolution layers would modify the network into its rotation equivariant version. To the best of our knowledge, this is the first time that rotation equivariant image prior is introduced to self-supervised image denoising at the network architecture level with a comprehensive theoretical analysis of equivariance errors, which offers a new perspective to the field of self-supervised image denoising. Moreover, to further improve the performance, we design a new mask mechanism to fusion the output of rotation equivariant network and vanilla CNN-based network, and construct an adaptive rotation equivariant framework. Through extensive experiments on three typical methods, we have demonstrated the effectiveness of the proposed method.
zh

[CV-90] Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理任务时难以区分相关与不相关模态信号的问题,特别是在视觉问答(VQA)等任务中容易受到误导性或虚假输入的影响,这种问题被定义为跨模态能力不足(Cross-Modality Competency Problem)。其中,模态干扰(Modality Interference)是该问题的具体表现,即无关模态的虚假信息导致模型在单一模态任务中的性能下降。该论文的关键解决方案是设计一种基于扰动的微调框架,包括结合启发式扰动和通过投影梯度下降(PGD)生成的对抗扰动的数据增强方法,以及对原始和扰动输入模型输出的一致性正则化策略,以提升模型在单模态推理和多模态任务中的鲁棒性与跨模态能力。

链接: https://arxiv.org/abs/2505.19616
作者: Rui Cai,Bangzheng Li,Xiaofei Wen,Muhao Chen,Zhe Zhao
机构: University of California, Davis (加利福尼亚大学戴维斯分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks, yet they often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA), which can lead to susceptibility to misleading or spurious inputs. We refer to this broader limitation as the Cross-Modality Competency Problem: the model’s inability to fairly evaluate all modalities. This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering, where models are expected to rely solely on one modality. In such tasks, spurious information from irrelevant modalities often leads to significant performance degradation. We refer to this failure as Modality Interference, which serves as a concrete and measurable instance of the cross-modality competency problem. We further design a perturbation-based causal diagnostic experiment to verify and quantify this problem. To mitigate modality interference, we propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentations with both heuristic perturbations and adversarial perturbations via Projected Gradient Descent (PGD), and a consistency regularization strategy applied to model outputs with original and perturbed inputs. Experiments on multiple benchmark datasets (image-heavy, text-heavy, and VQA tasks) and multiple model families with different scales demonstrate significant improvements in robustness and cross-modality competency, indicating our method’s effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks.
zh

[CV-91] Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning

【速读】:该论文试图解决多模态学习中普遍存在的“多重性”(multiplicity)问题,即现实世界中多模态关系本质上是多对多的,而非传统方法所假设的确定性一对一映射。这一问题源于语义抽象、表征不对称性和任务相关模糊性,导致训练不确定性、评估不可靠和数据集质量低下。解决方案的关键在于构建新的多模态感知学习框架和考虑多重性的数据集构建协议,以应对多模态学习全链条中的多重性挑战。

链接: https://arxiv.org/abs/2505.19614
作者: Sanghyuk Chun
机构: NAVER AI Lab(纳维)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning has seen remarkable progress, particularly with the emergence of large-scale pre-training across various modalities. However, most current approaches are built on the assumption of a deterministic, one-to-one alignment between modalities. This oversimplifies real-world multimodal relationships, where their nature is inherently many-to-many. This phenomenon, named multiplicity, is not a side-effect of noise or annotation error, but an inevitable outcome of semantic abstraction, representational asymmetry, and task-dependent ambiguity in multimodal tasks. This position paper argues that multiplicity is a fundamental bottleneck that manifests across all stages of the multimodal learning pipeline: from data construction to training and evaluation. This paper examines the causes and consequences of multiplicity, and highlights how multiplicity introduces training uncertainty, unreliable evaluation, and low dataset quality. This position calls for new research directions on multimodal learning: novel multiplicity-aware learning frameworks and dataset construction protocols considering multiplicity.
zh

[CV-92] ESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization

【速读】:该论文旨在解决深度神经网络在对抗样本迁移性方面的挑战,尤其是在安全关键应用中,对抗样本的迁移性使得黑盒攻击成为可能,从而对实际场景中的对抗威胁评估构成重大风险。论文提出的解决方案是TESSER框架,其关键在于两个策略:一是基于中间特征激活计算的逐标记重要性进行梯度调制的特征敏感梯度缩放(Feature-Sensitive Gradient Scaling, FSGS),二是通过可微分高斯先验抑制扰动中高频噪声的频谱平滑正则化(Spectral Smoothness Regularization, SSR)。这两个策略协同工作,生成语义有意义且频谱平滑的扰动,显著提升了攻击成功率和对防御模型的鲁棒性。

链接: https://arxiv.org/abs/2505.19613
作者: Amira Guesmi,Bassem Ouni,Muhammad Shafique
机构: NYU Abu Dhabi (纽约大学阿布扎比分校); Technology Innovation Institute (技术创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial transferability remains a critical challenge in evaluating the robustness of deep neural networks. In security-critical applications, transferability enables black-box attacks without access to model internals, making it a key concern for real-world adversarial threat assessment. While Vision Transformers (ViTs) have demonstrated strong adversarial performance, existing attacks often fail to transfer effectively across architectures, especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models. In this paper, we introduce \textbfTESSER – a novel adversarial attack framework that enhances transferability via two key strategies: (1) \textitFeature-Sensitive Gradient Scaling (FSGS), which modulates gradients based on token-wise importance derived from intermediate feature activations, and (2) \textitSpectral Smoothness Regularization (SSR), which suppresses high-frequency noise in perturbations using a differentiable Gaussian prior. These components work in tandem to generate perturbations that are both semantically meaningful and spectrally smooth. Extensive experiments on ImageNet across 12 diverse architectures demonstrate that TESSER achieves +10.9% higher attack succes rate (ASR) on CNNs and +7.2% on ViTs compared to the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER significantly improves robustness against defended models, achieving 53.55% ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment between TESSER’s perturbations and salient visual regions identified via Grad-CAM, while frequency-domain analysis reveals a 12% reduction in high-frequency energy, confirming the effectiveness of spectral regularization.
zh

[CV-93] Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

【速读】:该论文试图解决多模态模型在识别与背景视觉相似的隐藏物体时,与人类视觉系统存在显著偏差的问题(即多模态模型无法有效区分隐藏物体)。其解决方案的关键在于构建一种模拟人类视觉伪装感知的视觉系统,通过渐进式引导机制(refocus)使模型能够通过分步推理逻辑定位图像中的隐藏物体,并实现层次化注意力转移及先验认知知识的动态调整。该方法基于策略优化算法提出了一种视觉 refocus 强化框架,以提升模型在回答前的思考与重聚焦能力,从而实现优于人类伪装感知系统的推理性能。

链接: https://arxiv.org/abs/2505.19611
作者: Ruolin Shen,Xiaozhong Ji,Kai WU,Jiangning Zhang,Yijun He,HaiHua Yang,Xiaobin Hu,Xiaoyu Sun
机构: Technische Universität München 2ByteDance  3Zhejiang University  4Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Website: \url{ this https URL }

点击查看摘要

Abstract:Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus’ visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior cognitive knowledge. In this paper, we propose a visual refocus reinforcement framework via the policy optimization algorithm to encourage multi-modal models to think and refocus more before answering, and achieve excellent reasoning abilities to align and even surpass human camouflaged perception systems. Our extensive experiments on camouflaged perception successfully demonstrate the emergence of refocus visual phenomena, characterized by multiple reasoning tokens and dynamic adjustment of the detection box. Besides, experimental results on both camouflaged object classification and detection tasks exhibit significantly superior performance compared to Supervised Fine-Tuning (SFT) baselines.
zh

[CV-94] JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在安全性和鲁棒性方面存在的漏洞问题,特别是针对其容易受到越狱攻击(jailbreak attacks)的缺陷。现有方法在攻击目标定义不明确的情况下,往往依赖于易陷入局部最优的基于梯度的策略,并且通常将视觉与文本模态解耦,忽略了关键的跨模态交互。论文提出的解决方案关键在于利用VLM内部融合层表示中的安全相关信息,通过两个阶段的框架——安全边界探测(Safety Boundary Probing)和安全边界跨越(Safety Boundary Crossing)——来实现对模型行为的引导,从而有效提升攻击成功率并揭示VLM中被忽视的安全风险。

链接: https://arxiv.org/abs/2505.19610
作者: Jiaxin Song,Yixu Wang,Jie Li,Rui Yu,Yan Teng,Xingjun Ma,Yingchun Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); NSFOCUS (网神信息)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit impressive performance, yet the integration of powerful vision encoders has significantly broadened their attack surface, rendering them increasingly susceptible to jailbreak attacks. However, lacking well-defined attack objectives, existing jailbreak methods often struggle with gradient-based strategies prone to local optima and lacking precise directional guidance, and typically decouple visual and textual modalities, thereby limiting their effectiveness by neglecting crucial cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK) framework, we posit that VLMs encode safety-relevant information within their internal fusion-layer representations, revealing an implicit safety decision boundary in the latent space. This motivates exploiting boundary to steer model behavior. Accordingly, we propose JailBound, a novel latent space jailbreak framework comprising two stages: (1) Safety Boundary Probing, which addresses the guidance issue by approximating decision boundary within fusion layer’s latent space, thereby identifying optimal perturbation directions towards the target region; and (2) Safety Boundary Crossing, which overcomes the limitations of decoupled approaches by jointly optimizing adversarial perturbations across both image and text inputs. This latter stage employs an innovative mechanism to steer the model’s internal state towards policy-violating outputs while maintaining cross-modal semantic consistency. Extensive experiments on six diverse VLMs demonstrate JailBound’s efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses. Warning: This paper contains potentially sensitive, harmful and offensive content.
zh

[CV-95] Rep3D: Re-parameterize Large 3D Kernels with Low-Rank Receptive Modeling for Medical Imaging

【速读】:该论文旨在解决在高分辨率3D体积设置中,传统大核卷积因简单增大核尺寸而导致的优化不稳定和性能退化问题。其解决方案的关键在于引入Rep3D框架,该框架通过结合可学习的空间先验与大核卷积训练,利用轻量级两阶段调制网络生成具有感受野偏差的缩放掩码,自适应地重新加权核更新,从而实现局部到全局的收敛行为。此外,Rep3D采用简单的编码器设计,避免了多分支结构的复杂性,有效提升了3D医学图像分析的可解释性与可扩展性。

链接: https://arxiv.org/abs/2505.19603
作者: Ho Hin Lee,Quan Liu,Shunxing Bao,Yuankai Huo,Bennett A. Landman
机构: Vanderbilt University (范德比尔特大学); Accenture (埃森哲)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:In contrast to vision transformers, which model long-range dependencies through global self-attention, large kernel convolutions provide a more efficient and scalable alternative, particularly in high-resolution 3D volumetric settings. However, naively increasing kernel size often leads to optimization instability and degradation in performance. Motivated by the spatial bias observed in effective receptive fields (ERFs), we hypothesize that different kernel elements converge at variable rates during training. To support this, we derive a theoretical connection between element-wise gradients and first-order optimization, showing that structurally re-parameterized convolution blocks inherently induce spatially varying learning rates. Building on this insight, we introduce Rep3D, a 3D convolutional framework that incorporates a learnable spatial prior into large kernel training. A lightweight two-stage modulation network generates a receptive-biased scaling mask, adaptively re-weighting kernel updates and enabling local-to-global convergence behavior. Rep3D adopts a plain encoder design with large depthwise convolutions, avoiding the architectural complexity of multi-branch compositions. We evaluate Rep3D on five challenging 3D segmentation benchmarks and demonstrate consistent improvements over state-of-the-art baselines, including transformer-based and fixed-prior re-parameterization methods. By unifying spatial inductive bias with optimization-aware learning, Rep3D offers an interpretable, and scalable solution for 3D medical image analysis. The source code is publicly available at this https URL.
zh

[CV-96] WQLCP: Weighted Adaptive Conformal Prediction for Robust Uncertainty Quantification Under Distribution Shifts

【速读】:该论文旨在解决在分布偏移(distribution shifts)情况下,传统共形预测(Conformal Prediction, CP)方法因数据不可交换性(non-exchangeable data)导致的覆盖率不可靠和预测集过大的问题。其解决方案的关键在于提出一种基于加权分位数损失的共形预测方法(Weighted Quantile Loss-scaled Conformal Prediction, WQLCP),该方法通过引入与校准和测试损失比值相关的权重,动态调整校准分位数阈值,从而在分布偏移条件下提升预测集的覆盖率并减小其规模。

链接: https://arxiv.org/abs/2505.19587
作者: Shadi Alijani,Homayoun Najjaran
机构: University of Victoria (维多利亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conformal prediction (CP) provides a framework for constructing prediction sets with guaranteed coverage, assuming exchangeable data. However, real-world scenarios often involve distribution shifts that violate exchangeability, leading to unreliable coverage and inflated prediction sets. To address this challenge, we first introduce Reconstruction Loss-Scaled Conformal Prediction (RLSCP), which utilizes reconstruction losses derived from a Variational Autoencoder (VAE) as an uncertainty metric to scale score functions. While RLSCP demonstrates performance improvements, mainly resulting in better coverage, it quantifies quantiles based on a fixed calibration dataset without considering the discrepancies between test and train datasets in an unexchangeable setting. In the next step, we propose Weighted Quantile Loss-scaled Conformal Prediction (WQLCP), which refines RLSCP by incorporating a weighted notion of exchangeability, adjusting the calibration quantile threshold based on weights with respect to the ratio of calibration and test loss values. This approach improves the CP-generated prediction set outputs in the presence of distribution shifts. Experiments on large-scale datasets, including ImageNet variants, demonstrate that WQLCP outperforms existing baselines by consistently maintaining coverage while reducing prediction set sizes, providing a robust solution for CP under distribution shifts.
zh

[CV-97] Beyond Segmentation: Confidence-Aware and Debiased Estimation of Ratio-based Biomarkers

【速读】:该论文试图解决比率型生物标志物(ratio-based biomarkers)在临床决策中缺乏不确定性度量的问题,现有方法仅提供点估计,无法反映预测结果的可靠性。其解决方案的关键在于提出一种统一的自信-aware框架,通过系统分析从分割到生物标志物的误差传播,识别出模型校准不足是主要的不确定性来源,并引入一个轻量级的后处理校准模块,利用内部医院数据进行校准而无需重新训练,同时通过可调参数Q控制置信水平,从而生成统计上合理的置信区间,提升预测生物标志物在临床流程中的可信度。

链接: https://arxiv.org/abs/2505.19585
作者: Jiameng Li,Teodora Popordanoska,Sebastian G. Gruber,Frederik Maes,Matthew B. Blaschko
机构: ESAT-PSI, KU Leuven
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Ratio-based biomarkers – such as the proportion of necrotic tissue within a tumor – are widely used in clinical practice to support diagnosis, prognosis and treatment planning. These biomarkers are typically estimated from soft segmentation outputs by computing region-wise ratios. Despite the high-stakes nature of clinical decision making, existing methods provide only point estimates, offering no measure of uncertainty. In this work, we propose a unified \textitconfidence-aware framework for estimating ratio-based biomarkers. We conduct a systematic analysis of error propagation in the segmentation-to-biomarker pipeline and identify model miscalibration as the dominant source of uncertainty. To mitigate this, we incorporate a lightweight, post-hoc calibration module that can be applied using internal hospital data without retraining. We leverage a tunable parameter Q to control the confidence level of the derived bounds, allowing adaptation towards clinical practice. Extensive experiments show that our method produces statistically sound confidence intervals, with tunable confidence levels, enabling more trustworthy application of predictive biomarkers in clinical workflows.
zh

[CV-98] Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

【速读】:该论文旨在解决深度伪造(deepfake)攻击对个人身份安全的威胁,特别是针对名人和政治人物等“VIP个体”的面部信息容易被获取和滥用的问题。现有方法多聚焦于通用场景,忽视了已知身份的先验知识。论文提出的解决方案关键在于构建一个统一的多模态框架——\textbfVIPGuard,通过微调多模态大语言模型(MLLM)学习细粒度的面部属性,进行身份级的判别学习以区分高度相似的面孔,并引入用户特定定制化机制,结合语义推理实现个性化和可解释的深度伪造检测。

链接: https://arxiv.org/abs/2505.19582
作者: Kaiqing Lin,Zhiyuan Yan,Ke-Yue Zhang,Li Hao,Yue Zhou,Yuzhen Lin,Weixiang Li,Taiping Yao,Shouhong Ding,Bin Li
机构: Shenzhen University (深圳大学); Peking Univerisity (北京大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e.g., “VIP individuals” whose authentic facial data are already available. In this paper, we propose \textbfVIPGuard, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we built a comprehensive identity-aware benchmark called \textbfVIPBench for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation.
zh

[CV-99] VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

【速读】:该论文旨在解决虚拟试穿(Virtual Try-On)模型在真实场景下评估不足的问题,具体包括现有评价指标无法准确反映人类感知、现有测试集多局限于室内场景且缺乏复杂性,以及缺乏指导未来虚拟试穿生成发展的理想系统。其解决方案的关键在于提出VTBench,一个分层的基准套件,通过系统地将虚拟图像试穿分解为层次化、解耦的维度,并为每个维度配备定制的测试集和评估标准,从而实现多维评估框架、与人类感知对齐以及提供有价值的性能分析。

链接: https://arxiv.org/abs/2505.19571
作者: Hu Xiaobin,Liang Yujie,Luo Donghao,Peng Xu,Zhang Jiangning,Zhu Junwei,Wang Chengjie,Fu Yanwei
机构: Tencent(腾讯); Fudan University(复旦大学); Xiamen University(厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Websit: \url{ this https URL }

点击查看摘要

Abstract:While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark’s alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.
zh

[CV-100] What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

【速读】:该论文旨在解决开放词汇图像分割中区域分割与目标概念对齐不足的问题,即现有方法通常先进行无类别的区域分割再进行类别匹配,这与人类基于语义概念识别物体的视觉系统过程不符。其解决方案的关键在于提出一种受认知启发的框架,该框架模拟人类视觉识别过程:首先形成对象的概念理解,然后感知其空间范围。该框架包含三个核心组件:生成式视觉-语言模型(Generative Vision-Language Model, G-VLM)提供语义引导,概念感知视觉增强模块融合文本概念特征与全局视觉表示,以及受认知启发的解码器整合局部实例特征与语义提示,实现对相关类别的选择性分类。

链接: https://arxiv.org/abs/2505.19569
作者: Jianghang Lin,Yue Hu,Jiangtao Shen,Yunhang Shen,Liujuan Cao,Shengchuan Zhang,Rongrong Ji
机构: Xiamen University (厦门大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system’s process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching 27.2 PQ, 17.0 mAP, and 35.3 mIoU on A-150. It further attains 56.2 , 28.2 , 15.4 , 59.2 , 18.7 , and 95.8 mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.
zh

[CV-101] Few-Shot Class-Incremental Learning For Efficient SAR Automatic Target Recognition

【速读】:该论文旨在解决合成孔径雷达自动目标识别(SAR-ATR)系统在数据稀缺环境下面临的识别挑战,传统方法难以有效应对此类问题。其解决方案的关键在于提出一种基于双分支架构的少样本类增量学习(FSCIL)框架,该框架通过局部特征提取、离散傅里叶变换和全局滤波器捕捉长期空间依赖性,并引入轻量级交叉注意力机制,以融合领域特定特征与全局依赖关系,同时保持计算效率。此外,框架结合了焦点损失和中心损失,以增强类别区分能力和紧凑的类内分布。

链接: https://arxiv.org/abs/2505.19565
作者: George Karantaidis,Athanasios Pantsios,Ioannis Kompatsiaris,Symeon Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic aperture radar automatic target recognition (SAR-ATR) systems have rapidly evolved to tackle incremental recognition challenges in operational settings. Data scarcity remains a major hurdle that conventional SAR-ATR techniques struggle to address. To cope with this challenge, we propose a few-shot class-incremental learning (FSCIL) framework based on a dual-branch architecture that focuses on local feature extraction and leverages the discrete Fourier transform and global filters to capture long-term spatial dependencies. This incorporates a lightweight cross-attention mechanism that fuses domain-specific features with global dependencies to ensure robust feature interaction, while maintaining computational efficiency by introducing minimal scale-shift parameters. The framework combines focal loss for class distinction under imbalance and center loss for compact intra-class distributions to enhance class separation boundaries. Experimental results on the MSTAR benchmark dataset demonstrate that the proposed framework consistently outperforms state-of-the-art methods in FSCIL SAR-ATR, attesting to its effectiveness in real-world scenarios.
zh

[CV-102] K-Buffers: A Plug-in Method for Enhancing Neural Fields with Multiple Buffers IJCAI2025

【速读】:该论文试图解决神经场(Neural Fields)在渲染过程中的性能提升问题,现有方法主要关注场景表示的优化,而较少研究渲染过程的改进。解决方案的关键在于提出一种名为K-Buffers的插件方法,该方法通过生成多个缓冲区并构建像素级特征图,再利用K-Feature Fusion Network(KFN)进行特征融合,最后通过特征解码器生成渲染图像,同时引入加速策略以提高渲染速度和质量。

链接: https://arxiv.org/abs/2505.19564
作者: Haofan Ren,Zunjie Zhu,Xiang Chen,Ming Lu,Rongfeng Lu,Chenggang Yan
机构: Hangzhou Dianzi University (杭州电子科技大学); Intel Labs China (英特尔中国实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, IJCAI 2025

点击查看摘要

Abstract:Neural fields are now the central focus of research in 3D vision and computer graphics. Existing methods mainly focus on various scene representations, such as neural points and 3D Gaussians. However, few works have studied the rendering process to enhance the neural fields. In this work, we propose a plug-in method named K-Buffers that leverages multiple buffers to improve the rendering performance. Our method first renders K buffers from scene representations and constructs K pixel-wise feature maps. Then, We introduce a K-Feature Fusion Network (KFN) to merge the K pixel-wise feature maps. Finally, we adopt a feature decoder to generate the rendering image. We also introduce an acceleration strategy to improve rendering speed and quality. We apply our method to well-known radiance field baselines, including neural point fields and 3D Gaussian Splatting (3DGS). Extensive experiments demonstrate that our method effectively enhances the rendering performance of neural point fields and 3DGS.
zh

[CV-103] Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation

【速读】:该论文试图解决自动化布局生成中的时间消耗大、手动设计复杂以及现有基于图的布局生成方法生成能力有限导致输出不合理和不兼容的问题,同时解决基于视觉的生成模型忽视原始结构信息导致组件交叉与重叠的问题。解决方案的关键在于提出一种聚合结构表示(Aggregation Structural Representation, ASR)模块,该模块将图网络与大语言模型(LLM)相结合,在保留结构信息的同时增强生成能力,通过图特征作为层次化先验知识替代多模态大语言模型(MLLM)中的传统视觉Transformer(ViT)模块,首次实现了全布局信息的预测,并利用可编辑的中间图矩阵支持人机协同的渐进式设计生成。

链接: https://arxiv.org/abs/2505.19554
作者: Jiongchao Jin,Shengchu Zhao,Dajun Chen,Wei Jiang,Yong Li
机构: Ant Group(蚂蚁集团); Agency for Science, Technology and Research(科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time consumption and the complexity of manual layout design make automated layout generation a critical task, especially for multiple applications across different mobile devices. Existing graph-based layout generation approaches suffer from limited generative capability, often resulting in unreasonable and incompatible outputs. Meanwhile, vision based generative models tend to overlook the original structural information, leading to component intersections and overlaps. To address these challenges, we propose an Aggregation Structural Representation (ASR) module that integrates graph networks with large language models (LLMs) to preserve structural information while enhancing generative capability. This novel pipeline utilizes graph features as hierarchical prior knowledge, replacing the traditional Vision Transformer (ViT) module in multimodal large language models (MLLM) to predict full layout information for the first time. Moreover, the intermediate graph matrix used as input for the LLM is human editable, enabling progressive, human centric design generation. A comprehensive evaluation on the RICO dataset demonstrates the strong performance of ASR, both quantitatively using mean Intersection over Union (mIoU), and qualitatively through a crowdsourced user study. Additionally, sampling on relational features ensures diverse layout generation, further enhancing the adaptability and creativity of the proposed approach.
zh

[CV-104] SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds

【速读】:该论文旨在解决3D点云分类中因分布偏移(distribution shifts)导致的模型性能下降问题。现有方法通常依赖于计算成本较高的反向传播机制进行适应,限制了其在实时场景中的应用。论文提出的解决方案关键在于SMART-PC框架,该框架通过利用3D点云的几何结构,在预训练阶段预测骨骼表示,从而提取对噪声和损坏不敏感的鲁棒几何特征,提升模型对测试时分布偏移的适应能力。与以往方法不同,SMART-PC通过消除反向传播并仅更新BatchNorm统计信息实现实时适应,从而构建了一个轻量且高效的框架。

链接: https://arxiv.org/abs/2505.19546
作者: Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Mehrdad Noori,Gustavo Adolfo Vargas Hakim,David Osowiechi,Farzad Beizaee,Ismail Ben Ayed,Christian Desrosiers
机构: LIVIA, ÉTS Montréal, Canada; International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-Time Training (TTT) has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in real-world, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts. Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency. The implementation is available at: this https URL.
zh

[CV-105] DVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs

【速读】:该论文旨在解决文本驱动视频编辑(text-driven video editing)中缺乏专门的视频质量评估(Video Quality Assessment, VQA)模型的问题,这一问题限制了对编辑质量细微差别的准确评估。解决方案的关键在于构建一个大规模基准数据集TDVE-DB,并基于此提出一种新的VQA模型TDVE-Assessor。TDVE-DB包含3,857个由12种不同模型生成的编辑视频及其在三个关键维度上的173,565条人工主观评分,为评估提供了详实的数据基础;而TDVE-Assessor通过将空间和时间视频特征融合到大语言模型(LLM)中,实现了对视频内容的全面语义理解,从而提升了评估的准确性。

链接: https://arxiv.org/abs/2505.19535
作者: Juntong Wang,Jiarui Wang,Huiyu Duan,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 14 figures, 8 tables

点击查看摘要

Abstract:Text-driven video editing is rapidly advancing, yet its rigorous evaluation remains challenging due to the absence of dedicated video quality assessment (VQA) models capable of discerning the nuances of editing quality. To address this critical gap, we introduce TDVE-DB, a large-scale benchmark dataset for text-driven video editing. TDVE-DB consists of 3,857 edited videos generated from 12 diverse models across 8 editing categories, and is annotated with 173,565 human subjective ratings along three crucial dimensions, i.e., edited video quality, editing alignment, and structural consistency. Based on TDVE-DB, we first conduct a comprehensive evaluation for the 12 state-of-the-art editing models revealing the strengths and weaknesses of current video techniques, and then benchmark existing VQA methods in the context of text-driven video editing evaluation. Building on these insights, we propose TDVE-Assessor, a novel VQA model specifically designed for text-driven video editing assessment. TDVE-Assessor integrates both spatial and temporal video features into a large language model (LLM) for rich contextual understanding to provide comprehensive quality assessment. Extensive experiments demonstrate that TDVE-Assessor substantially outperforms existing VQA models on TDVE-DB across all three evaluation dimensions, setting a new state-of-the-art. Both TDVE-DB and TDVE-Assessor will be released upon the publication.
zh

[CV-106] Applications and Effect Evaluation of Generative Adversarial Networks in Semi-Supervised Learning

【速读】:该论文试图解决深度学习模型在实际场景中应用受限的问题,尤其是由于高质量标注数据不足导致的图像分类性能瓶颈。其解决方案的关键在于构建一种基于生成对抗网络(Generative Adversarial Networks, GANs)的半监督图像分类模型,并引入生成器、判别器和分类器的协同训练机制,从而有效利用有限的标注数据和大量未标注数据,提升图像生成质量和分类准确率。

链接: https://arxiv.org/abs/2505.19522
作者: Jiyu Hu,Haijiang Zeng,Zhen Tian
机构: Huaiyin Institute of Technology, China; Walmart Inc., USA; University of Glasgow, U.K.
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, image classification, as a core task in computer vision, relies on high-quality labelled data, which restricts the wide application of deep learning models in practical scenarios. To alleviate the problem of insufficient labelled samples, semi-supervised learning has gradually become a research hotspot. In this paper, we construct a semi-supervised image classification model based on Generative Adversarial Networks (GANs), and through the introduction of the collaborative training mechanism of generators, discriminators and classifiers, we achieve the effective use of limited labelled data and a large amount of unlabelled data, improve the quality of image generation and classification accuracy, and provide an effective solution for the task of image recognition in complex environments.
zh

[CV-107] Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift

【速读】:该论文试图解决在使用文本到图像扩散模型进行个性化时,如何在仅用少量图像示例适应新主体的同时,保持模型生成多样化且连贯输出的能力。其核心挑战在于防止遗忘,即模型在学习新概念时避免输出分布偏离原始预训练模型的分布。论文指出标准训练目标与个性化目标之间存在不匹配,并提出了一种基于Lipschitz有界公式的新型训练目标,该方法通过显式约束对预训练分布的偏离来有效控制分布漂移,从而在数据稀缺情况下仍能取得良好性能。

链接: https://arxiv.org/abs/2505.19519
作者: Gihoon Kim,Hyungjin Park,Taesup Kim
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples. This task presents a fundamental challenge, as the model must not only learn the new subject effectively but also preserve its ability to generate diverse and coherent outputs across a wide range of prompts. In other words, successful personalization requires integrating new concepts without forgetting previously learned generative capabilities. Forgetting denotes unintended distributional drift, where the model’s output distribution deviates from that of the original pretrained model. In this paper, we provide an analysis of this issue and identify a mismatch between standard training objectives and the goals of personalization. To address this, we propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution. Our method provides improved control over distributional drift and performs well even in data-scarce scenarios. Experimental results demonstrate that our approach consistently outperforms existing personalization methods, achieving higher CLIP-T, CLIP-I, and DINO scores.
zh

[CV-108] oward Patient-specific Partial Point Cloud to Surface Completion for Pre- to Intra-operative Registration in Image-guided Liver Interventions

【速读】:该论文试图解决术中图像引导手术中由于缺乏表层下信息而导致的关键区域(如血管和肿瘤)定位困难的问题,以及由于术中点云部分可见性导致的图像-物理配准过程失效的问题。解决方案的关键在于提出一种患者特定的点云补全方法,利用VN-OccNet网络从部分术中点云生成完整的肝脏表面,并通过将补全后的表面集成到Go-ICP配准算法中,以提升初始刚性配准的效果。

链接: https://arxiv.org/abs/2505.19518
作者: Nakul Poudel,Zixin Yang,Kelly Merrell,Richard Simon,Cristian A. Linte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intra-operative data captured during image-guided surgery lacks sub-surface information, where key regions of interest, such as vessels and tumors, reside. Image-to-physical registration enables the fusion of pre-operative information and intra-operative data, typically represented as a point cloud. However, this registration process struggles due to partial visibility of the intra-operative point cloud. In this research, we propose a patient-specific point cloud completion approach to assist with the registration process. Specifically, we leverage VN-OccNet to generate a complete liver surface from a partial intra-operative point cloud. The network is trained in a patient-specific manner, where simulated deformations from the pre-operative model are used to train the model. First, we conduct an in-depth analysis of VN-OccNet’s rotation-equivariant property and its effectiveness in recovering complete surfaces from partial intra-operative surfaces. Next, we integrate the completed intra-operative surface into the Go-ICP registration algorithm to demonstrate its utility in improving initial rigid registration outcomes. Our results highlight the promise of this patient-specific completion approach in mitigating the challenges posed by partial intra-operative visibility. The rotation equivariant and surface generation capabilities of VN-OccNet hold strong promise for developing robust registration frameworks for variations of the intra-operative point cloud.
zh

[CV-109] Multimodal Machine Translation with Visual Scene Graph Pruning

【速读】:该论文旨在解决多模态机器翻译(Multimodal Machine Translation, MMT)中视觉信息冗余的问题,这一问题限制了现有方法在处理语言多义性和歧义性时的性能。论文提出的解决方案关键在于引入一种基于视觉场景图剪枝(Visual Scene Graph Pruning, PSG)的方法,通过语言场景图信息指导对视觉场景图中冗余节点的剪枝,从而降低下游翻译任务中的噪声,提升翻译质量。

链接: https://arxiv.org/abs/2505.19507
作者: Chenyu Lu,Shiliang Sun,Jing Zhao,Nan Zhang,Tengfei Song,Hao Yang
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学); Wenzhou University (温州大学); Huawei Technologies Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal machine translation (MMT) seeks to address the challenges posed by linguistic polysemy and ambiguity in translation tasks by incorporating visual information. A key bottleneck in current MMT research is the effective utilization of visual data. Previous approaches have focused on extracting global or region-level image features and using attention or gating mechanisms for multimodal information fusion. However, these methods have not adequately tackled the issue of visual information redundancy in MMT, nor have they proposed effective solutions. In this paper, we introduce a novel approach–multimodal machine translation with visual Scene Graph Pruning (PSG), which leverages language scene graph information to guide the pruning of redundant nodes in visual scene graphs, thereby reducing noise in downstream translation tasks. Through extensive comparative experiments with state-of-the-art methods and ablation studies, we demonstrate the effectiveness of the PSG model. Our results also highlight the promising potential of visual information pruning in advancing the field of MMT.
zh

[CV-110] Locality-Aware Zero-Shot Human-Object Interaction Detection CVPR2025

【速读】:该论文旨在解决零样本Human-Object Interaction (HOI)检测中,现有方法难以有效利用CLIP(Contrastive Language-Image Pretraining)模型对人类-物体对的细粒度信息进行建模的问题。其关键解决方案是提出LAIN框架,通过增强CLIP表示的局部性意识和交互意识,从而更好地捕捉人类与物体之间的详细互动信息。局部性意识通过聚合相邻区域块的信息和空间先验来实现,而交互意识则通过捕捉人与物体之间的交互模式来实现。

链接: https://arxiv.org/abs/2505.19503
作者: Sanghyun Kim,Deunsol Jung,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025; Code is available at: this https URL

点击查看摘要

Abstract:Recent methods for zero-shot Human-Object Interaction (HOI) detection typically leverage the generalization ability of large Vision-Language Model (VLM), i.e., CLIP, on unseen categories, showing impressive results on various zero-shot settings. However, existing methods struggle to adapt CLIP representations for human-object pairs, as CLIP tends to overlook fine-grained information necessary for distinguishing interactions. To address this issue, we devise, LAIN, a novel zero-shot HOI detection framework enhancing the locality and interaction awareness of CLIP representations. The locality awareness, which involves capturing fine-grained details and the spatial structure of individual objects, is achieved by aggregating the information and spatial priors of adjacent neighborhood patches. The interaction awareness, which involves identifying whether and how a human is interacting with an object, is achieved by capturing the interaction pattern between the human and the object. By infusing locality and interaction awareness into CLIP representation, LAIN captures detailed information about the human-object pairs. Our extensive experiments on existing benchmarks show that LAIN outperforms previous methods on various zero-shot settings, demonstrating the importance of locality and interaction awareness for effective zero-shot HOI detection.
zh

[CV-111] Objective Absolute and Hue-aware Metrics for Intrinsic Image Decomposition on Real-World Scenes: A Proof of Concept

【速读】:该论文旨在解决真实场景中固有图像分解(Intrinsic Image Decomposition, IID)质量难以定量评估的问题,因为缺乏真实的地面真值数据。现有方法依赖于人工标注的相对反射强度,但存在主观性、相对评价和色相未评估等挑战。为了解决这些问题,论文提出了一种基于高光谱成像和光探测与测距(LiDAR)强度计算得到的定量评估方法,并引入了基于光谱相似性的可选反照率增强技术。其关键在于利用物理测量数据实现客观、绝对且考虑色相的评估体系。

链接: https://arxiv.org/abs/2505.19500
作者: Shogo Sato,Masaru Tsuchida,Mariko Yamaguchi,Takuhiro Kaneko,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Intrinsic image decomposition (IID) is the task of separating an image into albedo and shade. In real-world scenes, it is difficult to quantitatively assess IID quality due to the unavailability of ground truth. The existing method provides the relative reflection intensities based on human-judged annotations. However, these annotations have challenges in subjectivity, relative evaluation, and hue non-assessment. To address these, we propose a concept of quantitative evaluation with a calculated albedo from a hyperspectral imaging and light detection and ranging (LiDAR) intensity. Additionally, we introduce an optional albedo densification approach based on spectral similarity. This paper conducted a concept verification in a laboratory environment, and suggested the feasibility of an objective, absolute, and hue-aware assessment. (This paper is accepted by IEEE ICIP 2025. )
zh

[CV-112] Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在生成文本时出现的幻觉(hallucination)问题,即生成的文本虽然符合上下文逻辑,但与视觉输入不一致。解决方案的关键在于增强文本生成对视觉内容的依赖性,而非仅通过调整单一模态(视觉或文本)的特征或输出来缓解幻觉。论文从贝叶斯视角全面分析了影响视觉依赖性的因素,并提出从三个层面减轻幻觉:去除冗余视觉标记、修正不恰当的先验信息以及在后验分布退化时停止文本生成。

链接: https://arxiv.org/abs/2505.19498
作者: Nanxing Hu,Xiaoyue Duan,Jinchao Zhang,Guoliang Kang
机构: Beihang University (北京航空航天大学); Tencent WXG (腾讯微信事业群)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) usually generate texts which satisfy context coherence but don’t match the visual input. Such a hallucination issue hinders LVLMs’ applicability in the real world. The key to solving hallucination in LVLM is to make the text generation rely more on the visual content. Most previous works choose to enhance/adjust the features/output of a specific modality (i.e., visual or textual) to alleviate hallucinations in LVLM, which do not explicitly or systematically enhance the visual reliance. In this paper, we comprehensively investigate the factors which may degenerate the visual reliance in text generation of LVLM from a Bayesian perspective. Based on our observations, we propose to mitigate hallucination in LVLM from three aspects. Firstly, we observe that not all visual tokens are informative in generating meaningful texts. We propose to evaluate and remove redundant visual tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate prior information, making it lean toward generating unexpected words. We propose a simple yet effective way to rectify the prior from a Bayesian perspective. Thirdly, we observe that starting from certain steps, the posterior of next-token prediction conditioned on visual tokens may collapse to a prior distribution which does not depend on any informative visual tokens at all. Thus, we propose to stop further text generation to avoid hallucination. Extensive experiments on three benchmarks including POPE, CHAIR, and MME demonstrate that our method can consistently mitigate the hallucination issue of LVLM and performs favorably against previous state-of-the-arts.
zh

[CV-113] he Role of Video Generation in Enhancing Data-Limited Action Understanding IJCAI2025

【速读】:该论文旨在解决真实场景中视频动作理解任务面临的数据限制问题(data limitations)。其解决方案的关键在于通过文本到视频扩散变压器(text-to-video diffusion transformer)生成带注释的数据,以弥补数据稀缺性。该方法能够在无需人工干预的情况下生成无限规模的逼真带注释数据,并结合信息增强策略和基于不确定性的标签平滑策略,以提升生成样本的信息含量并减少低质量样本对模型训练的负面影响。

链接: https://arxiv.org/abs/2505.19495
作者: Wei Li,Dezhao Luo,Dongbao Yang,Zhenhang Li,Weiping Wang,Yu Zhou
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); VCIP & TMCC & DISSec, College of Computer Science, Nankai University (南开大学计算机学院VCIP&TMCC&DISSec); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI2025

点击查看摘要

Abstract:Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.
zh

[CV-114] ViewCraft3D: High-Fidelity and View-Consistent 3D Vector Graphics Synthesis

【速读】:该论文旨在解决现有生成3D矢量图形方法在处理时间较长以及难以保持视图一致性的问题。其解决方案的关键在于提出ViewCraft3D(VC3D),该方法利用3D先验知识,通过3D物体分析、几何提取算法拟合底层结构,并应用视图一致性的优化过程来提升视觉质量,从而在保持高效计算的同时实现更高质量的3D矢量图形生成。

链接: https://arxiv.org/abs/2505.19492
作者: Chuang Wang,Haitao Zhou,Ling Luo,Qian Yu
机构: Beihang University (北京航空航天大学); Ningbo University (宁波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D vector graphics play a crucial role in various applications including 3D shape retrieval, conceptual design, and virtual reality interactions due to their ability to capture essential structural information with minimal representation. While recent approaches have shown promise in generating 3D vector graphics, they often suffer from lengthy processing times and struggle to maintain view consistency. To address these limitations, we propose ViewCraft3D (VC3D), an efficient method that leverages 3D priors to generate 3D vector graphics. Specifically, our approach begins with 3D object analysis, employs a geometric extraction algorithm to fit 3D vector graphics to the underlying structure, and applies view-consistent refinement process to enhance visual quality. Our comprehensive experiments demonstrate that VC3D outperforms previous methods in both qualitative and quantitative evaluations, while significantly reducing computational overhead. The resulting 3D sketches maintain view consistency and effectively capture the essential characteristics of the original objects.
zh

[CV-115] SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

【速读】:该论文旨在解决传统帧基相机在快速变化场景中进行立体深度估计的困难。其解决方案的关键在于提出SpikeStereoNet,这是一个受脑启发的框架,首次直接从原始脉冲流(spike streams)中估计立体深度。该模型通过融合来自两个视角的原始脉冲流,并利用循环脉冲神经网络(RSNN)更新模块迭代优化深度估计,从而有效捕捉纹理缺失表面和极端光照条件下的细微边缘和强度变化。

链接: https://arxiv.org/abs/2505.19487
作者: Zhuoheng Gao,Yihao Li,Jiyao Zhang,Rui Zhao,Tong Wu,Hao Tang,Zhaofei Yu,Hao Dong,Guozhang Chen,Tiejun Huang
机构: Peking University (北京大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams’ ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.
zh

[CV-116] Revolutionizing Wildfire Detection with Convolutional Neural Networks: A VGG16 Model Approach

【速读】:该论文旨在解决 wildfires(野火)检测的准确性问题,以提高预警系统的效率并减少灾害性后果。研究的关键解决方案是利用基于 VGG16 架构的卷积神经网络(Convolutional Neural Network, CNN)对 D-FIRE 数据集进行训练,通过数据增强技术丰富数据集并优化模型以实现二分类任务,从而有效降低误报率,提升早期野火识别的可靠性。

链接: https://arxiv.org/abs/2505.19479
作者: Lakshmi Aishwarya Malladi,Navarun Gupta,Ahmed El-Sayed,Xingguo Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Conference at ASEE 2025

点击查看摘要

Abstract:Over 8,024 wildfire incidents have been documented in 2024 alone, affecting thousands of fatalities and significant damage to infrastructure and ecosystems. Wildfires in the United States have inflicted devastating losses. Wildfires are becoming more frequent and intense, which highlights how urgently efficient warning systems are needed to avoid disastrous outcomes. The goal of this study is to enhance the accuracy of wildfire detection by using Convolutional Neural Network (CNN) built on the VGG16 architecture. The D-FIRE dataset, which includes several kinds of wildfire and non-wildfire images, was employed in the study. Low-resolution images, dataset imbalance, and the necessity for real-time applicability are some of the main challenges. These problems were resolved by enriching the dataset using data augmentation techniques and optimizing the VGG16 model for binary classification. The model produced a low false negative rate, which is essential for reducing unexplored fires, despite dataset boundaries. In order to help authorities execute fast responses, this work shows that deep learning models such as VGG16 can offer a reliable, automated approach for early wildfire recognition. For the purpose of reducing the impact of wildfires, our future work will concentrate on connecting to systems with real-time surveillance networks and enlarging the dataset to cover more varied fire situations.
zh

[CV-117] Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory ICIP2025

【速读】:该论文试图解决生成式数据蒸馏中蒸馏数据集分布不够多样化,导致下游验证准确率下降的问题。其解决方案的关键在于提出一种基于扩散模型的多样性驱动的数据蒸馏方法,通过引入自适应记忆机制来对齐蒸馏数据与真实数据的分布,并以此引导扩散模型生成更具多样性的数据集。

链接: https://arxiv.org/abs/2505.19469
作者: Mingzhuo Li,Guang Li,Jiafeng Mao,Takahiro Ogawa,Miki Haseyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICIP 2025

点击查看摘要

Abstract:Dataset distillation enables the training of deep neural networks with comparable performance in significantly reduced time by compressing large datasets into small and representative ones. Although the introduction of generative models has made great achievements in this field, the distributions of their distilled datasets are not diverse enough to represent the original ones, leading to a decrease in downstream validation accuracy. In this paper, we present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem. We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness. The degree of alignment leads the diffusion model to generate more diverse datasets during the distillation process. Extensive experiments show that our method outperforms existing state-of-the-art methods in most situations, proving its ability to tackle dataset distillation tasks.
zh

[CV-118] MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering

【速读】:该论文旨在解决持续视觉问答(CVQA)中由于采用跨模态提示隔离导致的模态不平衡问题,该问题会随着时间推移导致性能下降。其解决方案的关键在于提出MM-Prompt框架,该框架通过跨模态提示查询和跨模态提示恢复机制实现模态间的平衡交互与联合重建,从而有效缓解表示漂移并提升模型在持续学习中的准确性和知识保留能力。

链接: https://arxiv.org/abs/2505.19455
作者: Xu Li,Fan Lyu
机构: Khoury College of Computer Sciences, Northeastern University (东北大学计算机科学学院); New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs) has achieved promising progress by leveraging prompt tuning to enable continual multi-modal learning. However, most existing methods adopt cross-modal prompt isolation, constructing visual and textual prompts separately, which exacerbates modality imbalance and leads to degraded performance over time. To tackle this issue, we propose MM-Prompt, a novel framework incorporating cross-modal prompt query and cross-modal prompt recovery. The former enables balanced prompt selection by incorporating cross-modal signals during query formation, while the latter promotes joint prompt reconstruction through iterative cross-modal interactions, guided by an alignment loss to prevent representational drift. Extensive experiments show that MM-Prompt surpasses prior approaches in accuracy and knowledge retention, while maintaining balanced modality engagement throughout continual learning.
zh

[CV-119] CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features ICML25

【速读】:该论文旨在解决RGB-X跟踪器设计中如何高效建模和利用时空特征的问题,现有方法通常采用两个并行分支分别处理RGB和X输入流,导致模型结构复杂且计算过程繁琐,同时在各模态内部的空间建模会带来较大的计算开销,限制了跨模态和时间建模的资源。解决方案的关键在于提出一种新的跟踪器CSTrack,其核心是通过紧凑的时空特征建模实现简单而有效的跟踪,具体包括引入一种创新的Spatial Compact Module,将RGB-X双输入流整合为紧凑的空间特征以实现全面的模态内和跨模态空间建模,以及设计一个高效的Temporal Compact Module,通过构建优化的目标分布热图来紧凑表示时间特征。

链接: https://arxiv.org/abs/2505.19434
作者: X. Feng,D. Zhang,S. Hu,X. Li,M. Wu,J. Zhang,X. Chen,K. Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML25!

点击查看摘要

Abstract:Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: this https URL.
zh

[CV-120] Structure Disruption: Subverting Malicious Diffusion-Based Inpainting via Self-Attention Query Perturbation

【速读】:该论文旨在解决扩散模型在图像修复和编辑过程中可能被恶意利用,导致生成误导性或有害内容的问题。其解决方案的关键在于提出一种名为结构破坏攻击(Structure Disruption Attack, SDA)的保护框架,该框架通过在初始去噪步骤中干扰自注意力机制中的查询,破坏轮廓生成过程,从而直接削弱扩散模型的结构生成能力,有效防止其生成连贯图像。

链接: https://arxiv.org/abs/2505.19425
作者: Yuhao He,Jinyu Tian,Haiwei Wu,Jianqing Li
机构: Macau University of Science and Technology (澳门科技大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of diffusion models has enhanced their image inpainting and editing capabilities but also introduced significant societal risks. Adversaries can exploit user images from social media to generate misleading or harmful content. While adversarial perturbations can disrupt inpainting, global perturbation-based methods fail in mask-guided editing tasks due to spatial constraints. To address these challenges, we propose Structure Disruption Attack (SDA), a powerful protection framework for safeguarding sensitive image regions against inpainting-based editing. Building upon the contour-focused nature of self-attention mechanisms of diffusion models, SDA optimizes perturbations by disrupting queries in self-attention during the initial denoising step to destroy the contour generation process. This targeted interference directly disrupts the structural generation capability of diffusion models, effectively preventing them from producing coherent images. We validate our motivation through visualization techniques and extensive experiments on public datasets, demonstrating that SDA achieves state-of-the-art (SOTA) protection performance while maintaining strong robustness.
zh

[CV-121] LlamaSeg: Image Segmentation via Autoregressive Mask Generation

【速读】:该论文试图解决多任务图像分割问题,旨在通过自然语言指令统一多种图像分割任务。其解决方案的关键在于将图像分割重新定义为视觉生成问题,将掩码表示为“视觉”令牌,并采用类似LLaMA的Transformer架构直接从图像输入中预测这些令牌,从而将分割任务自然地集成到自回归架构中。

链接: https://arxiv.org/abs/2505.19422
作者: Jiru Deng,Tengjin Weng,Tianyu Yang,Wenhan Luo,Zhiheng Li,Wenhao Jiang
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ); Tsinghua University; Shenzhen University; Alibaba DAMO Academy; HKUST (Hong Kong University of Science and Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as “visual” tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.
zh

[CV-122] Certainty and Uncertainty Guided Active Domain Adaptation ICIP2025

【速读】:该论文旨在解决主动域适应(Active Domain Adaptation, ADA)中现有方法仅关注不确定样本而忽略高置信度样本的问题,这些高置信度样本通常与真实标签一致。解决方案的关键在于提出一种协同框架,该框架在标记不确定样本的同时,将高置信度预测视为真实标签,从而缩小搜索空间并提升域适应效果。该方法结合了基于高斯过程的主动采样(GPAS)和基于伪标签的确定性采样(PLCS),通过PLCS优化搜索空间,GPAS减小域差距,进而提高置信样本的比例。

链接: https://arxiv.org/abs/2505.19421
作者: Bardia Safaei,Vibashan VS,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ICIP 2025

点击查看摘要

Abstract:Active Domain Adaptation (ADA) adapts models to target domains by selectively labeling a few target samples. Existing ADA methods prioritize uncertain samples but overlook confident ones, which often match ground-truth. We find that incorporating confident predictions into the labeled set before active sampling reduces the search space and improves adaptation. To address this, we propose a collaborative framework that labels uncertain samples while treating highly confident predictions as ground truth. Our method combines Gaussian Process-based Active Sampling (GPAS) for identifying uncertain samples and Pseudo-Label-based Certain Sampling (PLCS) for confident ones, progressively enhancing adaptation. PLCS refines the search space, and GPAS reduces the domain gap, boosting the proportion of confident samples. Extensive experiments on Office-Home and DomainNet show that our approach outperforms state-of-the-art ADA methods.
zh

[CV-123] ADD-SLAM: Adaptive Dynamic Dense SLAM with Gaussian Splatting

【速读】:该论文旨在解决动态物体对神经辐射场(Neural Radiance Fields, NeRF)和基于3D高斯的同步定位与建图(SLAM)方法造成的场景一致性破坏问题,从而导致的跟踪漂移和地图伪影问题。现有方法依赖于预定义语义类别先验进行动态物体识别与过滤,但忽略了对机器人应用至关重要的动态场景信息。论文提出的ADD-SLAM框架通过基于场景一致性的自适应动态识别机制,结合几何与纹理差异分析,无需预定义语义类别先验即可自适应发现场景动态,并采用动态-静态分离映射策略构建时间高斯模型,实现在线增量动态建模,从而有效提升定位与建图的精度与鲁棒性。

链接: https://arxiv.org/abs/2505.19420
作者: Wenhua Wu,Chenpeng Su,Siting Zhu,Tianchen Deng,Zhe Liu,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Neural Radiance Fields (NeRF) and 3D Gaussian-based Simultaneous Localization and Mapping (SLAM) methods have demonstrated exceptional localization precision and remarkable dense mapping performance. However, dynamic objects introduce critical challenges by disrupting scene consistency, leading to tracking drift and mapping artifacts. Existing methods that employ semantic segmentation or object detection for dynamic identification and filtering typically rely on predefined categorical priors, while discarding dynamic scene information crucial for robotic applications such as dynamic obstacle avoidance and environmental interaction. To overcome these challenges, we propose ADD-SLAM: an Adaptive Dynamic Dense SLAM framework based on Gaussian splitting. We design an adaptive dynamic identification mechanism grounded in scene consistency analysis, comparing geometric and textural discrepancies between real-time observations and historical maps. Ours requires no predefined semantic category priors and adaptively discovers scene dynamics. Precise dynamic object recognition effectively mitigates interference from moving targets during localization. Furthermore, we propose a dynamic-static separation mapping strategy that constructs a temporal Gaussian model to achieve online incremental dynamic modeling. Experiments conducted on multiple dynamic datasets demonstrate our method’s flexible and accurate dynamic segmentation capabilities, along with state-of-the-art performance in both localization and mapping.
zh

[CV-124] MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

【速读】:该论文试图解决当前多模态图像生成模型评估体系不统一的问题,现有工具包存在文本到图像(T2I)基准缺乏多模态条件,而定制化图像生成基准则忽视了组合语义和常识知识。解决方案的关键是提出MMIG-Bench,一个综合性多模态图像生成基准,通过将4,850个丰富标注的文本提示与跨380个主题的1,750张多视角参考图像进行配对,统一了相关任务,并配备了三级评估框架,包括低级视觉伪影和物体身份保留度指标、基于视觉问答(VQA)的中层Aspect Matching Score(AMS)以及高级美学和人类偏好指标,从而实现对模型性能的全面评估。

链接: https://arxiv.org/abs/2505.19415
作者: Hang Hua,Ziyun Zeng,Yizhi Song,Yunlong Tang,Liu He,Daniel Aliaga,Wei Xiong,Jiebo Luo
机构: University of Rochester (罗彻斯特大学); Purdue University (普渡大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.
zh

[CV-125] Exploring the Possibility of TypiClust for Low-Budget Federated Active Learning

【速读】:该论文试图解决在联邦学习(Federated Learning)环境下,由于获取真实标签成本较高而带来的标注负担问题,特别是在低预算设置下如何有效进行主动学习(Active Learning, AL)。其解决方案的关键在于评估TypiClust这一已在低预算AL中表现优异的策略在低预算联邦主动学习(Federated Active Learning, FAL)场景下的有效性。研究结果表明,尽管FAL环境面临数据异质性等额外挑战,TypiClust仍表现出优于其他方法的性能,并且对典型性分布偏移具有较强的鲁棒性。

链接: https://arxiv.org/abs/2505.19404
作者: Yuta Ono,Hiroshi Nakamura,Hideki Takase
机构: The University of Tokyo (东京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. Accepted at COMPSAC 2025

点击查看摘要

Abstract:Federated Active Learning (FAL) seeks to reduce the burden of annotation under the realistic constraints of federated learning by leveraging Active Learning (AL). As FAL settings make it more expensive to obtain ground truth labels, FAL strategies that work well in low-budget regimes, where the amount of annotation is very limited, are needed. In this work, we investigate the effectiveness of TypiClust, a successful low-budget AL strategy, in low-budget FAL settings. Our empirical results show that TypiClust works well even in low-budget FAL settings contrasted with relatively low performances of other methods, although these settings present additional challenges, such as data heterogeneity, compared to AL. In addition, we show that FAL settings cause distribution shifts in terms of typicality, but TypiClust is not very vulnerable to the shifts. We also analyze the sensitivity of TypiClust to feature extraction methods, and it suggests a way to perform FAL even in limited data situations.
zh

[CV-126] Erasing Concepts Steering Generations: A Comprehensive Survey of Concept Suppression

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在无约束生成敏感、受版权保护或有害图像时所面临的伦理、法律和安全问题。其核心解决方案是概念擦除(concept erasure)方法,即在保留模型整体性能的前提下,有选择性地从生成模型中移除特定语义概念。该方法的关键在于通过干预层级、优化结构和语义范围三个维度对现有技术进行系统分类,从而实现对概念擦除特异性、泛化能力和计算复杂性之间的权衡分析。

链接: https://arxiv.org/abs/2505.19398
作者: Yiwei Xie,Ping Liu,Zheng Zhang
机构: Huazhong University of Science and Technology (华中科技大学); University of Nevada, Reno (内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models have demonstrated impressive capabilities in generating high-quality and diverse visual content from natural language prompts. However, uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges. To address these concerns, the concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models while preserving their overall utility. This survey provides a comprehensive overview and in-depth synthesis of concept erasure techniques in T2I diffusion models. We systematically categorize existing approaches along three key dimensions: intervention level, which identifies specific model components targeted for concept removal; optimization structure, referring to the algorithmic strategies employed to achieve suppression; and semantic scope, concerning the complexity and nature of the concepts addressed. This multi-dimensional taxonomy enables clear, structured comparisons across diverse methodologies, highlighting fundamental trade-offs between erasure specificity, generalization, and computational complexity. We further discuss current evaluation benchmarks, standardized metrics, and practical datasets, emphasizing gaps that limit comprehensive assessment, particularly regarding robustness and practical effectiveness. Finally, we outline major challenges and promising future directions, including disentanglement of concept representations, adaptive and incremental erasure strategies, adversarial robustness, and new generative architectures. This survey aims to guide researchers toward safer, more ethically aligned generative models, providing foundational knowledge and actionable recommendations to advance responsible development in generative AI.
zh

[CV-127] Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

【速读】:该论文试图解决视频生成模型在物理交互模拟方面的不足,特别是如何使视频对现实世界中的物理力进行真实响应的问题。其解决方案的关键在于引入力提示(force prompts),通过局部点力和全局风力场等物理控制信号,使视频能够模拟真实的物理交互。该方法利用预训练模型中的视觉和运动先验,无需依赖3D资产或物理模拟器即可实现对物理控制信号的响应,从而提升了视频生成的物理真实性。

链接: https://arxiv.org/abs/2505.19386
作者: Nate Gillman,Charles Herrmann,Michael Freeman,Daksh Aggarwal,Evan Luo,Deqing Sun,Chen Sun
机构: Brown University (布朗大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.
zh

[CV-128] Advancing Limited-Angle CT Reconstruction Through Diffusion-Based Sinogram Completion

【速读】:该论文旨在解决有限角度计算机断层扫描(Limited Angle Computed Tomography, LACT)中由于缺失角度信息而导致的重建质量下降问题。其解决方案的关键在于提出一种基于sinogram inpainting(投影数据补全)的方法,利用均值回复随机微分方程(MR-SDEs)在投影层面填充缺失的角数据,并通过知识蒸馏与伪逆矩阵约束相结合的方式加速扩散过程,实现高效且精确的sinogram补全。随后,通过后处理模块将补全后的sinogram反投影至图像域并进一步优化重建结果,有效抑制伪影同时保留关键结构细节。

链接: https://arxiv.org/abs/2505.19385
作者: Jiaqi Guo,Santiago Lopez-Tapia,Aggelos K. Katsaggelos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the 2025 IEEE International Conference on Image Processing (Oral)

点击查看摘要

Abstract:Limited Angle Computed Tomography (LACT) often faces significant challenges due to missing angular information. Unlike previous methods that operate in the image domain, we propose a new method that focuses on sinogram inpainting. We leverage MR-SDEs, a variant of diffusion models that characterize the diffusion process with mean-reverting stochastic differential equations, to fill in missing angular data at the projection level. Furthermore, by combining distillation with constraining the output of the model using the pseudo-inverse of the inpainting matrix, the diffusion process is accelerated and done in a step, enabling efficient and accurate sinogram completion. A subsequent post-processing module back-projects the inpainted sinogram into the image domain and further refines the reconstruction, effectively suppressing artifacts while preserving critical structural details. Quantitative experimental results demonstrate that the proposed method achieves state-of-the-art performance in both perceptual and fidelity quality, offering a promising solution for LACT reconstruction in scientific and clinical applications.
zh

[CV-129] DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

【速读】:该论文旨在解决端到端自动驾驶中存在的多个挑战,包括昂贵的BEV(鸟瞰图)计算、动作多样性以及复杂现实场景中的次优决策问题。其解决方案的关键在于提出一种基于视觉-语言模型(VLM)的新型混合稀疏-密集扩散策略,即Diff-VLA,通过稀疏扩散表示实现高效的多模态驾驶行为建模,并通过代理、地图实例与VLM输出之间的深度交互提升轨迹生成的指导效果。

链接: https://arxiv.org/abs/2505.19381
作者: Anqing Jiang,Yu Gao,Zhigang Sun,Yiru Wang,Jijun Wang,Jinghao Chai,Qian Cao,Yuweng Heng,Hao Jiang,Zongzheng Zhang,Xianda Guo,Hao Sun,Hao Zhao
机构: RIX, Bosch(博世); AIR, Tsinghua University(清华大学); Shanghai Tongji University(同济大学); Shanghai University(上海大学); Shanghai Jiao Tong University(上海交通大学); Southeast University(东南大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 4pages

点击查看摘要

Abstract:Research interest in end-to-end autonomous driving has surged owing to its fully differentiable design integrating modular tasks, i.e. perception, prediction and planing, which enables optimization in pursuit of the ultimate goal. Despite the great potential of the end-to-end paradigm, existing methods suffer from several aspects including expensive BEV (bird’s eye view) computation, action diversity, and sub-optimal decision in complex real-world scenarios. To address these challenges, we propose a novel hybrid sparse-dense diffusion policy, empowered by a Vision-Language Model (VLM), called Diff-VLA. We explore the sparse diffusion representation for efficient multi-modal driving behavior. Moreover, we rethink the effectiveness of VLM driving decision and improve the trajectory generation guidance through deep interaction across agent, map instances and VLM output. Our method shows superior performance in Autonomous Grand Challenge 2025 which contains challenging real and reactive synthetic scenarios. Our methods achieves 45.0 PDMS.
zh

[CV-130] Absolute Coordinates Make Motion Generation Easy

【速读】:该论文试图解决当前文本到动作生成模型中存在的运动表示局限性问题,具体表现为基于骨盆和前一帧的局部相对运动表示(kinematic-aware, local-relative motion representation)虽然简化了早期生成模型的训练,但对扩散模型造成了关键限制,并阻碍了下游任务的应用。解决方案的关键在于重新审视运动表示,提出一种被长期忽视的绝对关节坐标(absolute joint coordinates)方法,该方法在全局空间中编码运动,从而实现了更高的运动保真度、更好的文本对齐以及更强的可扩展性,同时无需复杂的辅助运动学损失或额外的任务特定重构。

链接: https://arxiv.org/abs/2505.19377
作者: Zichong Meng,Zeyu Han,Xiaogang Peng,Yiming Xie,Huaizu Jiang
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:State-of-the-art text-to-motion generation models rely on the kinematic-aware, local-relative motion representation popularized by HumanML3D, which encodes motion relative to the pelvis and to the previous frame with built-in redundancy. While this design simplifies training for earlier generation models, it introduces critical limitations for diffusion models and hinders applicability to downstream tasks. In this work, we revisit the motion representation and propose a radically simplified and long-abandoned alternative for text-to-motion generation: absolute joint coordinates in global space. Through systematic analysis of design choices, we show that this formulation achieves significantly higher motion fidelity, improved text alignment, and strong scalability, even with a simple Transformer backbone and no auxiliary kinematic-aware losses. Moreover, our formulation naturally supports downstream tasks such as text-driven motion control and temporal/spatial editing without additional task-specific reengineering and costly classifier guidance generation from control signals. Finally, we demonstrate promising generalization to directly generate SMPL-H mesh vertices in motion from text, laying a strong foundation for future research and motion-related applications.
zh

[CV-131] DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models KDD2025 KDD

【速读】:该论文旨在解决现有提示学习方法在适应视觉-语言模型(如CLIP)到下游任务时,容易过拟合已见数据,导致在新类别或未见过的领域上性能显著下降的问题。其解决方案的关键在于提出DiSa框架,该框架通过集成两种互补的正则化策略来增强模型的泛化能力:一是跨交互正则化(CIR),通过促进提示编码器与冻结编码器之间的协作学习,结合显著性感知的掩码策略引导图像编码器关注语义关键区域;二是方向正则化策略,通过在方向上对齐视觉嵌入与类别原型特征,优先保证特征方向的一致性而非严格的邻近性。

链接: https://arxiv.org/abs/2505.19373
作者: Niloufar Alipour Talemi,Hossein Kashiani,Hossein R. Nowdeh,Fatemeh Afghah
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025)

点击查看摘要

Abstract:Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks. However, existing methods often overfit to seen data, leading to significant performance degradation when generalizing to novel classes or unseen domains. To address this limitation, we propose DiSa, a Directional Saliency-Aware Prompt Learning framework that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization (CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliency-aware masking strategy guides the image encoder to prioritize semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization strategy that aligns visual embeddings with class-wise prototype features in a directional manner to prioritize consistency in feature orientation over strict proximity. This approach ensures robust generalization by leveraging stable prototype directions derived from class-mean statistics. Extensive evaluations on 11 diverse image classification benchmarks demonstrate that DiSa consistently outperforms state-of-the-art prompt learning methods across various settings, including base-to-novel generalization, cross-dataset transfer, domain generalization, and few-shot learning.
zh

[CV-132] Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments

【速读】:该论文试图解决预训练感知模型在新环境中部署时因分布偏移导致性能下降的问题,以及现有元认知方法在提高精度时往往牺牲召回率的局限性。解决方案的关键在于利用多个预训练模型的预测结果,将其建模为基于一致性的溯因(consistency-based abduction)问题,通过逻辑程序编码输入预测和模型学习到的错误检测规则,并寻找一个最大化预测覆盖度且逻辑不一致性率低于设定阈值的预测子集。该方法通过整数规划和启发式搜索算法实现知识表示,有效提升了模型在复杂分布偏移场景下的鲁棒性与性能。

链接: https://arxiv.org/abs/2505.19361
作者: Mario Leiva,Noel Ngu,Joshua Shay Kricheli,Aditya Taparia,Ransalu Senanayake,Paulo Shakarian,Nathaniel Bastian,John Corcoran,Gerardo Simari
机构: Universidad Nacional del Sur and Institute for Computer Science and Engineering, Bahía Blanca, Argentina; Arizona State University, Tempe, AZ USA; Syracuse University, Syracuse, NY USA; United States Military Academy, West Point, NY USA; U.S. Department of Defense, Arlington, VA USA
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The deployment of pre-trained perception models in novel environments often leads to performance degradation due to distributional shifts. Although recent artificial intelligence approaches for metacognition use logical rules to characterize and filter model errors, improving precision often comes at the cost of reduced recall. This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction. We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem. The input predictions and the learned error detection rules derived from each model are encoded in a logic program. We then seek an abductive explanation–a subset of model predictions–that maximizes prediction coverage while ensuring the rate of logical inconsistencies (derived from domain constraints) remains below a specified threshold. We propose two algorithms for this knowledge representation task: an exact method based on Integer Programming (IP) and an efficient Heuristic Search (HS). Through extensive experiments on a simulated aerial imagery dataset featuring controlled, complex distributional shifts, we demonstrate that our abduction-based framework outperforms individual models and standard ensemble baselines, achieving, for instance, average relative improvements of approximately 13.6% in F1-score and 16.6% in accuracy across 15 diverse test datasets when compared to the best individual model. Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect reasoners in challenging, novel scenarios.
zh

[CV-133] Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions

【速读】:该论文试图解决当前文本驱动图像编辑方法在数据依赖性与编辑能力之间的平衡问题,即依赖大规模高质量编辑配对数据集以提升编辑精度和多样性,或采用无数据集的替代技术但面临指令理解有限和编辑能力受限的挑战。解决方案的关键在于提出一种新的指令驱动图像编辑范式,该范式利用广泛可用的大规模文本-图像对,而非依赖专门的编辑配对数据集,并引入多尺度可学习区域以定位和引导编辑过程,通过将图像与其文本描述对齐作为监督信号,学习生成任务特定的编辑区域,从而实现高保真、精确且符合指令的图像编辑。

链接: https://arxiv.org/abs/2505.19352
作者: Chenrui Ma,Xi Xiao,Tianyang Wang,Yanning Shen
机构: University of California, Irvine (加州大学欧文分校); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.
zh

[CV-134] BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Behavioural Change

【速读】:该论文试图解决在数字行为改变干预中识别与矛盾和犹豫(Ambivalence/Hesitancy, A/H)相关复杂情绪的问题,这一问题对于提升干预的个性化和有效性具有重要意义。当前缺乏可用于训练机器学习模型识别A/H的数据集,且依赖专家识别成本高、效果有限。该研究提出的关键解决方案是构建首个基于行为的多模态A/H(Behavioral Ambivalence/Hesitancy, BAH)数据集,涵盖来自加拿大9个省份的224名参与者视频,包含面部、语音及身体语言等多模态信息,并提供时间戳标注、帧级与视频级注释以及元数据,以支持多模态识别、零样本预测和无监督域适应等任务。

链接: https://arxiv.org/abs/2505.19328
作者: Manuela González-González,Soufiane Belharbi,Muhammad Osama Zeeshan,Masoumeh Sharafi,Muhammad Haseeb Aslam,Marco Pedersoli,Alessandro Lameiras Koerich,Simon L Bacon,Eric Granger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 41 pages, 13 figures, under review

点击查看摘要

Abstract:Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H. This paper introduces a first Behavioural Ambivalence/Hesitancy (BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data. We include results baselines for BAH at frame- and video-level recognition in multi-modal setups, in addition to zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.
zh

[CV-135] Holistic White-light Polyp Classification via Alignment-free Dense Distillation of Auxiliary Optical Chromoendoscopy MICCAI2025

【速读】:该论文旨在解决在资源有限的临床环境中,基于白光成像(White Light Imaging, WLI)的息肉分类方法性能不足的问题,尤其是在缺乏窄带成像(Narrow Band Imaging, NBI)辅助的情况下。现有方法通过全局特征对齐将NBI的知识迁移至WLI,但通常依赖于裁剪后的病灶区域,易受检测误差影响并忽略上下文和细微诊断线索。该论文提出了一种新的整体分类框架,其关键创新在于无对齐密集蒸馏(Alignment-free Dense Distillation, ADD)模块,该模块无需显式图像对齐即可实现细粒度跨域知识蒸馏,通过像素级跨域关联建立特征图间的对应关系,并结合类别激活映射(Class Activation Mapping, CAM)过滤跨域关联,确保蒸馏路径仅连接语义一致且对息肉诊断有同等贡献的区域。

链接: https://arxiv.org/abs/2505.19319
作者: Qiang Hu,Qimei Wang,Jia Chen,Xuantao Ji,Qiang Li,Zhiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early Accepted by MICCAI 2025. Code and models: this https URL

点击查看摘要

Abstract:White Light Imaging (WLI) and Narrow Band Imaging (NBI) are the two main colonoscopic modalities for polyp classification. While NBI, as optical chromoendoscopy, offers valuable vascular details, WLI remains the most common and often the only available modality in resource-limited settings. However, WLI-based methods typically underperform, limiting their clinical applicability. Existing approaches transfer knowledge from NBI to WLI through global feature alignment but often rely on cropped lesion regions, which are susceptible to detection errors and neglect contextual and subtle diagnostic cues. To address this, this paper proposes a novel holistic classification framework that leverages full-image diagnosis without requiring polyp localization. The key innovation lies in the Alignment-free Dense Distillation (ADD) module, which enables fine-grained cross-domain knowledge distillation regardless of misalignment between WLI and NBI images. Without resorting to explicit image alignment, ADD learns pixel-wise cross-domain affinities to establish correspondences between feature maps, guiding the distillation along the most relevant pixel connections. To further enhance distillation reliability, ADD incorporates Class Activation Mapping (CAM) to filter cross-domain affinities, ensuring the distillation path connects only those semantically consistent regions with equal contributions to polyp diagnosis. Extensive results on public and in-house datasets show that our method achieves state-of-the-art performance, relatively outperforming the other approaches by at least 2.5% and 16.2% in AUC, respectively. Code is available at: this https URL.
zh

[CV-136] From Single Images to Motion Policies via Video-Generation Environment Representations

【速读】:该论文试图解决从单张RGB图像中生成与环境几何结构一致的无碰撞运动策略的问题(collision-free motion generation)。其解决方案的关键在于提出了一种名为Video-Generation Environment Representation (VGER)的框架,该框架利用大规模视频生成模型生成基于输入图像的移动相机视频,并通过预训练的3D基础模型生成密集点云,进而采用多尺度噪声方法训练环境结构的隐式表示,从而构建符合几何结构的运动生成模型。

链接: https://arxiv.org/abs/2505.19306
作者: Weiming Zhi,Ziyong Ma,Tianyi Zhang,Matthew Johnson-Roberson
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.
zh

[CV-137] Alchemist: Turning Public Text-to-Image Data into Generative Gold

【速读】:该论文旨在解决文本到图像(T2I)模型在预训练后仍难以达到高审美质量和对齐度的问题,以及现有监督微调(SFT)数据集普遍局限于特定领域且难以构建高质量通用SFT数据集的挑战。其解决方案的关键在于利用预训练生成式AI (Generative AI) 模型作为高影响训练样本的估计器,从而高效地构建通用SFT数据集。

链接: https://arxiv.org/abs/2505.19297
作者: Valerii Startsev,Alexander Ustyuzhanin,Alexey Kirillov,Dmitry Baranchuk,Sergey Kastryulin
机构: Yandex Research (雅库兹亚研究); HSE (高等经济大学); Yandex (雅库兹亚); MSU (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models’ weights to the public.
zh

[CV-138] xtDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

【速读】:该论文旨在解决文本嵌入图像生成中资源消耗大、在CPU和GPU上运行效率低的问题。现有方法如TextDiffuser-2虽然在生成带有文本的图像方面表现出色,但其计算过程较为复杂,限制了实际应用的效率。该研究提出了一种新的两阶段流水线,其关键在于结合强化学习(Reinforcement Learning, RL)以实现快速且优化的文本布局生成,并与基于扩散模型的图像合成模型相结合,从而显著加速边界框预测步骤并减少重叠,使系统能够在CPU和GPU上高效运行。

链接: https://arxiv.org/abs/2505.19291
作者: Kazi Mahathir Rahman,Showrin Rahman,Sharmin Sultana Srishty
机构: BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 26 figures. Submitted to arXiv for dissemination. Intended for future submission to a Generative AI conference

点击查看摘要

Abstract:Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2’s quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2’s quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.
zh

[CV-139] Improving Novel view synthesis of 360circ Scenes in Extremely Sparse Views by Jointly Training Hemisphere Sampled Synthetic Images

【速读】:该论文旨在解决在极稀疏输入视角下进行360°场景的新视角合成问题(novel view synthesis),这是虚拟现实和增强现实应用中的关键挑战。其解决方案的关键在于利用DUSt3R估计相机位姿并生成密集点云,随后通过从场景上半球空间中密集采样额外视角,并结合点云渲染合成图像,从而扩展3D空间的场景覆盖范围,缓解因输入有限导致的过拟合问题。此外,通过在所构建数据集上微调基于扩散的图像增强模型,进一步提升了点云渲染图像的质量。

链接: https://arxiv.org/abs/2505.19264
作者: Guangan Chen,Anh Minh Truong,Hanhe Lin,Michiel Vlaminck,Wilfried Philips,Hiep Luong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis in 360 ^\circ scenes from extremely sparse input views is essential for applications like virtual reality and augmented reality. This paper presents a novel framework for novel view synthesis in extremely sparse-view cases. As typical structure-from-motion methods are unable to estimate camera poses in extremely sparse-view cases, we apply DUSt3R to estimate camera poses and generate a dense point cloud. Using the poses of estimated cameras, we densely sample additional views from the upper hemisphere space of the scenes, from which we render synthetic images together with the point cloud. Training 3D Gaussian Splatting model on a combination of reference images from sparse views and densely sampled synthetic images allows a larger scene coverage in 3D space, addressing the overfitting challenge due to the limited input in sparse-view cases. Retraining a diffusion-based image enhancement model on our created dataset, we further improve the quality of the point-cloud-rendered images by removing artifacts. We compare our framework with benchmark methods in cases of only four input views, demonstrating significant improvement in novel view synthesis under extremely sparse-view conditions for 360 ^\circ scenes.
zh

[CV-140] Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

【速读】:该论文试图解决当前文本到图像扩散生成中由于完整文本条件(complete-text conditioning)导致的语义理解缺陷问题,即扩散变压器(DiT)在处理复杂语法的完整文本描述时,容易忽略关键语义细节或产生语义混淆。解决方案的关键在于提出一种名为DiT-ST的新型分割文本条件框架,该框架将完整文本描述转换为由简化句子组成的分割文本描述(split-text caption),以显式表达不同语义原始类型及其相互关系,并通过分层和渐进的方式将这些信息注入到DiT-ST的不同去噪阶段,从而提升特定语义原始类型的表征学习效果。

链接: https://arxiv.org/abs/2505.19261
作者: Yu Zhang,Jialei Zhou,Xinchen Li,Qi Zhang,Zhongwei Wan,Tianyu Wang,Duoqian Miao,Changwei Wang,Longbing Cao
机构: Tongji University (同济大学); The Ohio State University (俄亥俄州立大学); Georgia Institute of Technology (佐治亚理工学院); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.
zh

[CV-141] PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms

【速读】:该论文旨在解决在介入手术环境中,如何从有限的2D X-ray图像中确定患者的3D姿态问题。传统术前体积成像(如CT和MRI)虽然能提供精确的3D解剖目标定位,但无法在手术过程中获取,而手术中通常使用快速的2D X-ray成像。为将体积引导整合到术中操作中,本文提出PolyPose方法,其关键在于将复杂的3D形变场参数化为刚性变换的组合,利用骨骼在典型运动中不会弯曲的生物约束,从而在欠定情况下实现更符合解剖学先验的配准,避免了对昂贵形变正则化器和患者及手术特定超参数优化的依赖。

链接: https://arxiv.org/abs/2505.19256
作者: Vivek Gopalakrishnan,Neel Dey,Polina Golland
机构: MIT(麻省理工学院); Harvard Medical School(哈佛医学院); MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient’s preoperative volume to as few as two X-ray images, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail.
zh

[CV-142] Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model

【速读】:该论文旨在解决引用图像分割(referring image segmentation)问题,即通过自然语言表达来定位图像中的特定对象,这需要有效融合视觉与语言信息。其解决方案的关键在于提出SegVLM模型,该模型通过引入squeeze-and-excitation(SE)块进行动态特征重新校准、可变形卷积实现几何适应性以及残差连接促进深度特征学习,同时设计了一种新型的引用感知融合(referring-aware fusion, RAF)损失函数,以平衡区域对齐、边界精度和类别不平衡问题。

链接: https://arxiv.org/abs/2505.19242
作者: Alaa Dalaq,Muzammil Behzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.
zh

[CV-143] DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

【速读】:该论文旨在解决自动驾驶中任务特定模型在分布外场景下的性能瓶颈问题,其核心挑战在于模型优化目标过于狭窄以及对昂贵标注数据的依赖。解决方案的关键在于提出DriveX,一个自监督的世界模型,通过大规模驾驶视频学习可泛化的场景动态和整体表示(几何、语义和运动)。DriveX的核心创新包括Omni Scene Modeling (OSM)模块,用于统一多模态监督以捕捉场景演化,以及解耦潜在世界建模策略,将世界表示学习与未来状态解码分离,并结合动态感知射线采样提升运动建模能力。

链接: https://arxiv.org/abs/2505.19239
作者: Chen Shi,Shaoshuai Shi,Kehua Sheng,Bo Zhang,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); Voyager Research, Didi Chuxing (滴滴出行)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX’s predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX’s effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX’s capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.
zh

[CV-144] CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models ICML2025

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在推理过程中存在的高时间和内存开销问题。其解决方案的关键在于揭示了核心神经元(Core Neurons)与核心标记(Core Tokens)之间的匹配机制,并基于此提出了CoreMatching框架,通过协同利用标记稀疏性和神经元稀疏性来提升推理效率。

链接: https://arxiv.org/abs/2505.19235
作者: Qinsi Wang,Hancheng Ye,Ming-Yu Chung,Yudong Liu,Yueqian Lin,Martin Kuo,Mingyuan Ma,Jianyi Zhang,Yiran Chen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at this https URL.
zh

[CV-145] RAISE: Realness Assessment for Image Synthesis and Evaluation

【速读】:该论文试图解决AI生成视觉内容在感知真实度评估上的难题,即如何可靠地衡量AI生成图像与真实图像之间的主观真实感差异。解决方案的关键在于通过大规模的人类实验构建了一个新的数据集RAISE,该数据集包含带有主观真实度评分的图像对,并基于此训练多种模型以建立真实度预测的基准。研究发现,来自深度基础视觉模型的特征能够有效捕捉主观真实感,为开发鲁棒且客观的感知真实度评估模型提供了重要资源。

链接: https://arxiv.org/abs/2505.19233
作者: Aniruddha Mukherjee,Spriha Dubey,Somdyuti Paul
机构: Indian Institute of Technology Kharagpur (印度理工学院卡哈格普尔分校); Kalinga Institute of Industrial Technology (卡林加工业技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative AI has enabled the creation of highly photorealistic visual content, offering practical substitutes for real images and videos in scenarios where acquiring real data is difficult or expensive. However, reliably substituting real visual content with AI-generated counterparts requires robust assessment of the perceived realness of AI-generated visual content, a challenging task due to its inherent subjective nature. To address this, we conducted a comprehensive human study evaluating the perceptual realness of both real and AI-generated images, resulting in a new dataset, containing images paired with subjective realness scores, introduced as RAISE in this paper. Further, we develop and train multiple models on RAISE to establish baselines for realness prediction. Our experimental results demonstrate that features derived from deep foundation vision models can effectively capture the subjective realness. RAISE thus provides a valuable resource for developing robust, objective models of perceptual realness assessment.
zh

[CV-146] Advancing Video Self-Supervised Learning via Image Foundation Models

【速读】:该论文试图解决视频自监督表示学习中训练开销过高的问题,特别是针对图像基础模型(Image Foundation Models, IFMs)在视频任务中直接应用时的效率不足。解决方案的关键在于引入时间建模模块(ResNet3D)到预训练的IFMs中,构建视频表示模型,并通过播放速率感知的自监督学习方法对时间模块进行训练,同时冻结IFMs组件,从而显著降低训练时间和GPU内存消耗。

链接: https://arxiv.org/abs/2505.19218
作者: Jingwei Wu,Zhewei Huang,Chang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by 3.4\times and GPU memory usage by 8.2\times . This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at this https URL.
zh

[CV-147] owards Understanding the Mechanisms of Classifier-Free Guidance

【速读】:该论文试图解决生成式 AI (Generative AI) 中 Classifier-free guidance (CFG) 技术的内在机制不明确的问题。其解决方案的关键在于通过分析简化线性扩散模型中的 CFG 行为,揭示其在非线性实际模型中的有效性。研究发现,线性 CFG 通过三个关键组件提升生成质量:均值偏移项、正对比主成分(CPC)项和负 CPC 项,分别用于引导样本方向、增强类别特定特征和抑制通用特征。这些发现为理解非线性扩散模型中 CFG 的工作机制提供了重要启示。

链接: https://arxiv.org/abs/2505.19210
作者: Xiang Li,Rongrong Wang,Qing Qu
机构: University of Michigan (密歇根大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG’s mechanism in the nonlinear regime.
zh

[CV-148] Domain and Task-Focused Example Selection for Data-Efficient Contrastive Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因依赖大量手动标注数据而导致的训练成本高、效率低的问题。其核心挑战在于如何在有限标注数据的情况下实现有效的分割模型训练。解决方案的关键在于提出一种新颖的自监督对比学习框架PolyCL,该框架通过利用图像间的内在关系,在无需像素级标注或不合理数据增强的情况下,学习并迁移对分割任务有帮助的上下文感知判别特征。此外,该方法创新性地将Segment Anything Model(SAM)集成到框架中,分别作为后处理优化模块和传播机制,以提升分割精度和生成体积分割能力。

链接: https://arxiv.org/abs/2505.19208
作者: Tyler Ward,Aaron Moseley,Abdullah-Al-Zubaer Imran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation is one of the most important tasks in the medical imaging pipeline as it influences a number of image-based decisions. To be effective, fully supervised segmentation approaches require large amounts of manually annotated training data. However, the pixel-level annotation process is expensive, time-consuming, and error-prone, hindering progress and making it challenging to perform effective segmentations. Therefore, models must learn efficiently from limited labeled data. Self-supervised learning (SSL), particularly contrastive learning via pre-training on unlabeled data and fine-tuning on limited annotations, can facilitate such limited labeled image segmentation. To this end, we propose a novel self-supervised contrastive learning framework for medical image segmentation, leveraging inherent relationships of different images, dubbed PolyCL. Without requiring any pixel-level annotations or unreasonable data augmentations, our PolyCL learns and transfers context-aware discriminant features useful for segmentation from an innovative surrogate, in a task-related manner. Additionally, we integrate the Segment Anything Model (SAM) into our framework in two novel ways: as a post-processing refinement module that improves the accuracy of predicted masks using bounding box prompts derived from coarse outputs, and as a propagation mechanism via SAM 2 that generates volumetric segmentations from a single annotated 2D slice. Experimental evaluations on three public computed tomography (CT) datasets demonstrate that PolyCL outperforms fully-supervised and self-supervised baselines in both low-data and cross-domain scenarios. Our code is available at this https URL.
zh

[CV-149] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning

【速读】:该论文试图解决文本到图像(text-to-image, T2I)扩散模型微调中由于奖励稀疏性导致的步骤级归因不精确和训练效率低下的问题。现有方法将去噪过程重新表述为马尔可夫决策过程,但仅在生成轨迹结束时提供单一延迟奖励,难以准确评估每一步去噪操作的贡献。解决方案的关键在于提出一种动态分配密集奖励的信用分配框架,通过跟踪中间图像与最终图像之间的余弦相似度变化,量化每一步对减少与最终图像距离的贡献,从而提升样本效率和泛化能力。

链接: https://arxiv.org/abs/2505.19196
作者: Xinyao Liao,Wei Wei,Xiaoye Qu,Yu Cheng
机构: Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions. The existing approaches reformulate denoising as a Markov decision process for RL-driven optimization. However, they suffer from reward sparsity, receiving only a single delayed reward per generated trajectory. This flaw hinders precise step-level attribution of denoising actions, undermines training efficiency. To address this, we propose a simple yet effective credit assignment framework that dynamically distributes dense rewards across denoising steps. Specifically, we track changes in cosine similarity between intermediate and final images to quantify each step’s contribution on progressively reducing the distance to the final image. Our approach avoids additional auxiliary neural networks for step-level preference modeling and instead uses reward shaping to highlight denoising phases that have a greater impact on image quality. Our method achieves 1.25 to 2 times higher sample efficiency and better generalization across four human preference reward functions, without compromising the original optimal policy.
zh

[CV-150] CardioCoT: Hierarchical Reasoning for Multimodal Survival Analysis

【速读】:该论文旨在解决急性心肌梗死患者主要不良心血管事件(MACE)复发风险预测中模型可解释性不足与中间推理能力薄弱的问题。现有方法过于关注风险分层能力,而忽视了临床实践中对模型可解释性和稳健推理的需求。为弥补这一差距,论文提出的解决方案关键在于 CardioCoT 框架,其核心是通过两阶段的层次化推理增强生存分析:第一阶段利用证据增强的自我精炼机制生成基于影像学发现的稳健层次化推理轨迹;第二阶段将推理轨迹与影像数据结合进行风险建模与预测,从而在提升预测性能的同时保证模型的可解释性。

链接: https://arxiv.org/abs/2505.19195
作者: Shaohao Rui,Haoyang Su,Jinyi Xiang,Lian-Ming Wu,Xiaosong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate prediction of major adverse cardiovascular events recurrence risk in acute myocardial infarction patients based on postoperative cardiac MRI and associated clinical notes is crucial for precision treatment and personalized intervention. Existing methods primarily focus on risk stratification capability while overlooking the need for intermediate robust reasoning and model interpretability in clinical practice. Moreover, end-to-end risk prediction using LLM/VLM faces significant challenges due to data limitations and modeling complexity. To bridge this gap, we propose CardioCoT, a novel two-stage hierarchical reasoning-enhanced survival analysis framework designed to enhance both model interpretability and predictive performance. In the first stage, we employ an evidence-augmented self-refinement mechanism to guide LLM/VLMs in generating robust hierarchical reasoning trajectories based on associated radiological findings. In the second stage, we integrate the reasoning trajectories with imaging data for risk model training and prediction. CardioCoT demonstrates superior performance in MACE recurrence risk prediction while providing interpretable reasoning processes, offering valuable insights for clinical decision-making.
zh

[CV-151] I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts ICML2025

【速读】:该论文旨在解决多模态学习中模态融合的两个核心问题:一是传统融合方法无法有效处理不同模态之间的异构交互,二是缺乏对多模态交互机制的可解释性。其解决方案的关键在于提出I2MoE(Interpretable Multimodal Interaction-aware Mixture of Experts),该框架通过显式建模多样化的多模态交互,并结合弱监督的交互损失函数来学习数据驱动的交互模式,同时引入重加权模型为每个交互专家的输出分配重要性评分,从而实现样本级和数据集级的解释性。

链接: https://arxiv.org/abs/2505.19190
作者: Jiayi Xin,Sukwon Yun,Jie Peng,Inyoung Choi,Jenna L. Ballard,Tianlong Chen,Qi Long
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 Poster

点击查看摘要

Abstract:Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at this https URL.
zh

[CV-152] PosePilot: An Edge-AI Solution for Posture Correction in Physical Exercises

【速读】:该论文旨在解决AI驱动的健身系统中自动化姿态校正这一重大挑战,尽管在活动识别领域已有大量研究。其解决方案的关键在于提出PosePilot系统,该系统将姿态识别与实时个性化纠正反馈相结合,克服了传统健身解决方案的局限性。PosePilot通过使用Vanilla LSTM捕捉姿态识别中的时间依赖性,并采用带有多头注意力机制的BiLSTM来增强运动上下文的处理能力,从而实现对关键肢体角度的精准检测,同时保持计算效率。此外,PosePilot能够在每个动作阶段提供即时纠正反馈,确保整个锻炼过程中姿势的精确调整。

链接: https://arxiv.org/abs/2505.19186
作者: Rushiraj Gadhvi,Priyansh Desai,Siddharth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at IBPRIA 2025 Conference in Coimbra, Portugal

点击查看摘要

Abstract:Automated pose correction remains a significant challenge in AI-driven fitness systems, despite extensive research in activity recognition. This work presents PosePilot, a novel system that integrates pose recognition with real-time personalized corrective feedback, overcoming the limitations of traditional fitness solutions. Using Yoga, a discipline requiring precise spatio-temporal alignment as a case study, we demonstrate PosePilot’s ability to analyze complex physical movements. Designed for deployment on edge devices, PosePilot can be extended to various at-home and outdoor exercises. We employ a Vanilla LSTM, allowing the system to capture temporal dependencies for pose recognition. Additionally, a BiLSTM with multi-head Attention enhances the model’s ability to process motion contexts, selectively focusing on key limb angles for accurate error detection while maintaining computational efficiency. As part of this work, we introduce a high-quality video dataset used for evaluating our models. Most importantly, PosePilot provides instant corrective feedback at every stage of a movement, ensuring precise posture adjustments throughout the exercise routine. The proposed approach 1) performs automatic human posture recognition, 2) provides personalized posture correction feedback at each instant which is crucial in Yoga, and 3) offers a lightweight and robust posture correction model feasible for deploying on edge devices in real-world environments.
zh

[CV-153] Saliency-guided Emotion Modeling: Predicting Viewer Reactions from Video Stimuli

【速读】:该论文试图解决传统情感计算方法在视频情感预测中忽视视觉显著性(saliency)问题,从而导致情感建模不够准确的问题。其解决方案的关键在于利用深度学习技术,通过提取视频中的显著区域面积和显著区域数量两个关键特征,建立基于显著性的视频情感预测模型,并结合HD2S显著性模型和OpenFace面部动作单元分析,揭示视频显著性与观众情感之间的关系。

链接: https://arxiv.org/abs/2505.19178
作者: Akhila Yaragoppa,Siddharth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at IBPRIA 2025 Conference in Coimbra, Portugal

点击查看摘要

Abstract:Understanding the emotional impact of videos is crucial for applications in content creation, advertising, and Human-Computer Interaction (HCI). Traditional affective computing methods rely on self-reported emotions, facial expression analysis, and biosensing data, yet they often overlook the role of visual saliency – the naturally attention-grabbing regions within a video. In this study, we utilize deep learning to introduce a novel saliency-based approach to emotion prediction by extracting two key features: saliency area and number of salient regions. Using the HD2S saliency model and OpenFace facial action unit analysis, we examine the relationship between video saliency and viewer emotions. Our findings reveal three key insights: (1) Videos with multiple salient regions tend to elicit high-valence, low-arousal emotions, (2) Videos with a single dominant salient region are more likely to induce low-valence, high-arousal responses, and (3) Self-reported emotions often misalign with facial expression-based emotion detection, suggesting limitations in subjective reporting. By leveraging saliency-driven insights, this work provides a computationally efficient and interpretable alternative for emotion modeling, with implications for content creation, personalized media experiences, and affective computing research.
zh

[CV-154] riangle Splatting for Real-Time Radiance Field Rendering

【速读】:该论文试图解决传统三角形(triangle)在计算机图形学中被神经辐射场(Neural Radiance Fields)和3D高斯泼溅(3D Gaussian Splatting)等基于独立基元的表示方法取代后,如何重新实现其在高质量新视角合成中的效率与效果问题。解决方案的关键在于开发一种可微分渲染器,通过端到端梯度直接优化三角形,将三角形的高效性与基于独立基元表示的自适应密度相结合,从而在视觉保真度、收敛速度和渲染吞吐量方面优于现有方法。

链接: https://arxiv.org/abs/2505.19175
作者: Jan Held,Renaud Vandeghen,Adrien Deliege,Abdullah Hamdi,Silvio Giancola,Anthony Cioppa,Andrea Vedaldi,Bernard Ghanem,Andrea Tagliasacchi,Marc Van Droogenbroeck
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 13 figures, 10 tables

点击查看摘要

Abstract:The field of computer graphics was revolutionized by models such as Neural Radiance Fields and 3D Gaussian Splatting, displacing triangles as the dominant representation for photogrammetry. In this paper, we argue for a triangle comeback. We develop a differentiable renderer that directly optimizes triangles via end-to-end gradients. We achieve this by rendering each triangle as differentiable splats, combining the efficiency of triangles with the adaptive density of representations based on independent primitives. Compared to popular 2D and 3D Gaussian Splatting methods, our approach achieves higher visual fidelity, faster convergence, and increased rendering throughput. On the Mip-NeRF360 dataset, our method outperforms concurrent non-volumetric primitives in visual fidelity and achieves higher perceptual quality than the state-of-the-art Zip-NeRF on indoor scenes. Triangles are simple, compatible with standard graphics stacks and GPU hardware, and highly efficient: for the \textitGarden scene, we achieve over 2,400 FPS at 1280x720 resolution using an off-the-shelf mesh renderer. These results highlight the efficiency and effectiveness of triangle-based representations for high-quality novel view synthesis. Triangles bring us closer to mesh-based optimization by combining classical computer graphics with modern differentiable rendering frameworks. The project page is this https URL
zh

[CV-155] EventEgoHands: Event-based Egocentric 3D Hand Mesh Reconstruction

【速读】:该论文旨在解决在低光照环境和运动模糊条件下,基于RGB和深度相机进行3D手部网格重建的挑战。其解决方案的关键在于引入事件相机(event camera),利用其高动态范围和高时间分辨率的优势,同时提出EventEgoHands方法,通过Hand Segmentation Module有效提取手部区域,从而减轻动态背景事件的影响。

链接: https://arxiv.org/abs/2505.19169
作者: Ryosei Hara,Wataru Ikeda,Masashi Hatano,Mariko Isogawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Image Processing 2025

点击查看摘要

Abstract:Reconstructing 3D hand mesh is challenging but an important task for human-computer interaction and AR/VR applications. In particular, RGB and/or depth cameras have been widely used in this task. However, methods using these conventional cameras face challenges in low-light environments and during motion blur. Thus, to address these limitations, event cameras have been attracting attention in recent years for their high dynamic range and high temporal resolution. Despite their advantages, event cameras are sensitive to background noise or camera motion, which has limited existing studies to static backgrounds and fixed cameras. In this study, we propose EventEgoHands, a novel method for event-based 3D hand mesh reconstruction in an egocentric view. Our approach introduces a Hand Segmentation Module that extracts hand regions, effectively mitigating the influence of dynamic background events. We evaluated our approach and demonstrated its effectiveness on the N-HOT3D dataset, improving MPJPE by approximately more than 4.5 cm (43%).
zh

[CV-156] JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models

【速读】:该论文旨在解决扩散模型在测试阶段的主体分离与组合对齐问题,即在不进行重新训练或外部监督的情况下提升模型生成结果的语义解耦性和提示对齐度。解决方案的关键在于提出JEDI方法,通过基于Jensen-Shannon散度的新目标函数最小化注意力图中的语义纠缠,并利用对抗优化提高效率,从而减少更新步骤数量。该方法具有模型无关性,适用于如Stable Diffusion 1.5和3.5等架构,并提供了一种轻量级、无需CLIP的解耦分数,作为测试阶段组合对齐的基准。

链接: https://arxiv.org/abs/2505.19166
作者: Eric Tillmann Bill,Enis Simsar,Thomas Hofmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. We will publicly release the implementation of our method. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.19166 [cs.CV] (or arXiv:2505.19166v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.19166 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-157] Benchmarking Laparoscopic Surgical Image Restoration and Beyond

【速读】:该论文旨在解决腹腔镜手术中由于烟雾、镜头起雾和血/组织液飞溅等引起的视觉退化问题,这些问题严重影响手术视野的清晰度,进而影响手术流程和患者安全。解决方案的关键在于引入一个名为SurgClean的真实世界开源手术图像恢复数据集,该数据集涵盖了多种退化类型,并提供了对应的配对参考标签,从而为多类型图像恢复任务(如去烟、去雾和去溅)提供了系统性的研究基础。此外,基于该数据集建立了标准化评估基准,并对22种代表性图像恢复方法进行了性能分析,揭示了现有算法与临床需求之间的显著差距,为智能手术恢复算法的发展提供了重要方向。

链接: https://arxiv.org/abs/2505.19161
作者: Jialun Pei,Diandian Guo,Donghui Yang,Zhixi Li,Yuxin Feng,Long Ma,Bo Du,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); Dalian University of Technology (大连理工大学); Southern Medical University (南方医科大学); Sun Yat-sen University (中山大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In laparoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impair visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks, e.g., desmoking, defogging, and desplashing. SurgClean comprises 1,020 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic understanding perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower the capabilities of restoration algorithms to increase surgical environments and improve the efficiency of clinical procedures.
zh

[CV-158] A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation

【速读】:该论文旨在解决卫星遥感影像时间序列(Satellite Image Time Series, SITS)中云覆盖导致的时间间隙问题,该问题会破坏时间依赖性并引起特征偏移,从而降低模型在完整SITS上训练后的性能。现有方法通常通过重建整个SITS或使用数据增强来模拟缺失数据,但前者可能引入噪声和冗余,后者仅能处理有限的缺失模式,导致泛化能力差。本文提出的解决方案关键在于构建一个联合学习框架,结合特征重建与预测任务,在训练过程中利用时间掩码模拟数据缺失,并通过真实标签和基于完整SITS训练的教师模型进行指导,使模型能够选择性地重构与教师模型时间特征表示对齐的关键特征,从而减少不必要的重构和噪声传播,提升模型在不同缺失模式和传感器数据上的泛化能力。

链接: https://arxiv.org/abs/2505.19159
作者: Yuze Wang,Mariana Belgiu,Haiyang Wu,Dandan Zhong,Yangyang Cao,Chao Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data-augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data-missing scenarios using temporal masks. The two tasks are guided by both ground-truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher’s temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1-scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel-2 and PlanetScope, under varying temporal missing rates and model backbones.
zh

[CV-159] FHGS: Feature-Homogenized Gaussian Splatting

【速读】:该论文旨在解决基于3D Gaussian Splatting (3DGS) 的场景理解中,高斯基元的各向异性颜色表示与语义特征各向同性需求之间的固有矛盾,从而导致跨视角特征一致性不足的问题。其解决方案的关键在于提出一种名为 \textit{FHGS}(Feature-Homogenized Gaussian Splatting)的新型3D特征融合框架,该框架通过引入通用特征融合架构、非可微特征融合机制以及受电势场启发的双驱动优化策略,实现了从预训练模型中高精度地映射任意2D特征到3D场景,同时保持了3DGS的实时渲染效率。

链接: https://arxiv.org/abs/2505.19154
作者: Q. G. Duan,Benyun Zhao,Mingqiao Han Yijun Huang,Ben M. Chen
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene understanding based on 3D Gaussian Splatting (3DGS) has recently achieved notable advances. Although 3DGS related methods have efficient rendering capabilities, they fail to address the inherent contradiction between the anisotropic color representation of gaussian primitives and the isotropic requirements of semantic features, leading to insufficient cross-view feature consistency. To overcome the limitation, we proposes \textitFHGS (Feature-Homogenized Gaussian Splatting), a novel 3D feature fusion framework inspired by physical models, which can achieve high-precision mapping of arbitrary 2D features from pre-trained models to 3D scenes while preserving the real-time rendering efficiency of 3DGS. Specifically, our \textitFHGS introduces the following innovations: Firstly, a universal feature fusion architecture is proposed, enabling robust embedding of large-scale pre-trained models’ semantic features (e.g., SAM, CLIP) into sparse 3D structures. Secondly, a non-differentiable feature fusion mechanism is introduced, which enables semantic features to exhibit viewpoint independent isotropic distributions. This fundamentally balances the anisotropic rendering of gaussian primitives and the isotropic expression of features; Thirdly, a dual-driven optimization strategy inspired by electric potential fields is proposed, which combines external supervision from semantic feature fields with internal primitive clustering guidance. This mechanism enables synergistic optimization of global semantic alignment and local structural consistency. More interactive results can be accessed on: this https URL.
zh

[CV-160] SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

【速读】:该论文旨在解决基于扩散模型的视频生成在高分辨率和长时长视频生成任务中的计算成本过高的问题。现有方法通过跳过部分计算来加速推理,但通常会导致严重的质量下降。论文提出的解决方案关键在于SRDiffusion框架,该框架通过大模型与小模型的协作实现高效推理:大模型负责处理高噪声步骤以保证语义和运动的一致性(Sketching),而小模型则在低噪声步骤中细化视觉细节(Rendering)。这种分工策略有效降低了推理成本,同时保持了视频生成的质量。

链接: https://arxiv.org/abs/2505.19151
作者: Shenggan Cheng,Yuanxin Wei,Lansong Diao,Yong Liu,Bujiao Chen,Lianghua Huang,Yu Liu,Wenyuan Yu,Jiangsu Du,Wei Lin,Yang You
机构: National University of Singapore (新加坡国立大学); Sun Yat-sen University (中山大学); Alibaba Group (阿里巴巴集团)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3 \times speedup for Wan with nearly no quality loss for VBench, and 2 \times speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.
zh

[CV-161] MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection

【速读】:该论文旨在解决现有图像编辑方法在复杂场景下难以实现高精度和语义准确性的难题。其关键解决方案是提出MIND-Edit框架,该框架将预训练扩散模型与多模态大语言模型(MLLM)相结合,并引入两种互补策略:一是基于MLLM语义推理优化文本指令的策略,二是利用MLLM内在视觉理解能力驱动编辑的策略,通过联合训练方法有效整合两者,从而提升指令解析的准确性与视觉一致性。

链接: https://arxiv.org/abs/2505.19149
作者: Shuyu Wang,Weiqi Li,Qian Wang,Shijie Zhao,Jian Zhang
机构: Peking University (北京大学); ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in AI-generated content (AIGC) have significantly accelerated image editing techniques, driving increasing demand for diverse and fine-grained edits. Despite these advances, existing image editing methods still face challenges in achieving high precision and semantic accuracy in complex scenarios. Recent studies address this issue by incorporating multimodal large language models (MLLMs) into image editing pipelines. However, current MLLM-based methods mainly rely on interpreting textual instructions, leaving the intrinsic visual understanding of large models largely unexplored, thus resulting in insufficient alignment between textual semantics and visual outcomes. To overcome these limitations, we propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM. MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent and guide the diffusion process via generated visual embeddings. Furthermore, we propose a joint training approach to effectively integrate both strategies, allowing them to reinforce each other for more accurate instruction interpretation and visually coherent edits aligned with user intent. Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.
zh

[CV-162] DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing

【速读】:该论文旨在解决红外成像中密集簇状小目标的分辨问题,特别是由于信号重叠导致的目标数量、亚像素位置和辐射强度难以精确确定的问题。其解决方案的关键在于提出一种名为动态迭代收缩阈值网络(DISTA-Net)的深度学习模型,该模型在动态框架内重新定义了传统的稀疏重构方法,通过自适应生成卷积权重和阈值参数,实现实时定制化的重建过程。

链接: https://arxiv.org/abs/2505.19148
作者: Shengdong Han,Shangdong Yang,Xin Zhang,Yuxuan Li,Xiang Li,Jian Yang,Ming-Ming Cheng,Yimian Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Resolving closely-spaced small targets in dense clusters presents a significant challenge in infrared imaging, as the overlapping signals hinder precise determination of their quantity, sub-pixel positions, and radiation intensities. While deep learning has advanced the field of infrared small target detection, its application to closely-spaced infrared small targets has not yet been explored. This gap exists primarily due to the complexity of separating superimposed characteristics and the lack of an open-source infrastructure. In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. DISTA-Net adaptively generates convolution weights and thresholding parameters to tailor the reconstruction process in real time. To the best of our knowledge, DISTA-Net is the first deep learning model designed specifically for the unmixing of closely-spaced infrared small targets, achieving superior sub-pixel detection accuracy. Moreover, we have established the first open-source ecosystem to foster further research in this field. This ecosystem comprises three key components: (1) CSIST-100K, a publicly available benchmark dataset; (2) CSO-mAP, a custom evaluation metric for sub-pixel detection; and (3) GrokCSO, an open-source toolkit featuring DISTA-Net and other models. Our code and dataset are available at this https URL.
zh

[CV-163] he Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agent ic Framework

【速读】:该论文试图解决视觉-语言模型(VLM)代理框架中图像隐私属性分析(image private attribute profiling)的隐私风险问题,即从个人图像集中推断出敏感属性(如年龄、健康信息)甚至抽象属性(如性格和社会特质)。其解决方案的关键在于构建了PAPI数据集,这是目前最大的用于研究个人图像隐私属性分析的数据集,并提出了HolmesEye框架,该框架结合了VLM和大语言模型(LLM),通过VLM提取图像内的和图像间的特征,以及通过LLM引导推理过程并进行结果整合,从而提升隐私推断的准确性。实验表明,HolmesEye在平均准确率上优于现有最先进基线10.8%,在抽象属性预测上超越人类水平15.0%。

链接: https://arxiv.org/abs/2505.19139
作者: Feiran Liu,Yuzhe Zhang,Xinyi Huang,Yinan Peng,Xinfeng Li,Lixu Wang,Yutong Shen,Ranjie Duan,Simeng Qin,Xiaojun Jia,Qingsong Wen,Wei Dong
机构: Nanyang Technological University (南洋理工大学); Beijing University of Technology (北京工业大学); Hengxin Tech (恒信科技); Alibaba Group (阿里巴巴集团); Squirrel Ai Learning (松鼠AI学习)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term “image private attribute profiling.” This threat is particularly severe given that modern apps can easily access users’ photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this work, we construct PAPI, the largest dataset for studying private attribute profiling in personal images, comprising 2,510 images from 251 individuals with 3,012 annotated privacy attributes. We also propose HolmesEye, a hybrid agentic framework that combines VLMs and LLMs to enhance privacy inference. HolmesEye uses VLMs to extract both intra-image and inter-image information and LLMs to guide the inference process as well as consolidate the results through forensic analysis, overcoming existing limitations in long-context visual reasoning. Experiments reveal that HolmesEye achieves a 10.8% improvement in average accuracy over state-of-the-art baselines and surpasses human-level performance by 15.0% in predicting abstract attributes. This work highlights the urgency of addressing privacy risks in image-based profiling and offers both a new dataset and an advanced framework to guide future research in this area.
zh

[CV-164] Veta-GS: View-dependent deformable 3D Gaussian Splatting for thermal infrared Novel-view Synthesis

【速读】:该论文试图解决基于热红外(Thermal Infrared, TIR)图像的新型视角合成中由于传输效应、发射率差异和低分辨率导致的浮动物体和模糊效果问题。其解决方案的关键在于引入Veta-GS,该方法通过视图依赖的形变场和热特征提取器(Thermal Feature Extractor, TFE)来精确捕捉细微的热变化并保持鲁棒性,同时结合MonoSSIM损失函数以考虑外观、边缘和频率信息。

链接: https://arxiv.org/abs/2505.19138
作者: Myeongseok Nam,Wongi Park,Minsol Kim,Hyejin Hur,Soomok Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3D-GS) based on Thermal Infrared (TIR) imaging has gained attention in novel-view synthesis, showing real-time rendering. However, novel-view synthesis with thermal infrared images suffers from transmission effects, emissivity, and low resolution, leading to floaters and blur effects in rendered images. To address these problems, we introduce Veta-GS, which leverages a view-dependent deformation field and a Thermal Feature Extractor (TFE) to precisely capture subtle thermal variations and maintain robustness. Specifically, we design view-dependent deformation field that leverages camera position and viewing direction, which capture thermal variations. Furthermore, we introduce the Thermal Feature Extractor (TFE) and MonoSSIM loss, which consider appearance, edge, and frequency to maintain robustness. Extensive experiments on the TI-NSD benchmark show that our method achieves better performance over existing methods.
zh

[CV-165] RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models

【速读】:该论文旨在解决当前视频-语言基准测试在评估大型多模态模型(Large Multi-modal Models, LMMs)的原子时间事件理解能力方面存在的不足,因为这些基准测试通常可以通过图像-语言模型有效解决。其解决方案的关键在于提出RTime-QA,一个专门设计用于评估LMMs原子时间事件理解能力的新基准,包含822个高质量、精心标注的视频-文本问题,并引入RTime-IT指令微调数据集,通过类似的标注流程进一步提升LMMs的时间事件理解能力。实验结果表明,RTime-QA对LMMs构成了显著挑战,而RTime-IT能够有效增强模型在该任务上的表现。

链接: https://arxiv.org/abs/2505.19125
作者: Yuqi Liu,Qin Jin,Tianyuan Qu,Xuan Liu,Yang Du,Bei Yu,Jiaya Jia
机构: CUHK(香港中文大学); HKUST(香港科技大学); RUC(中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding accurate atomic temporal event is essential for video comprehension. However, current video-language benchmarks often fall short to evaluate Large Multi-modal Models’ (LMMs) temporal event understanding capabilities, as they can be effectively addressed using image-language models. In this paper, we introduce RTime-QA, a novel benchmark specifically designed to assess the atomic temporal event understanding ability of LMMs. RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts. Each question features a video depicting an atomic temporal event, paired with both correct answers and temporal negative descriptions, specifically designed to evaluate temporal understanding. To advance LMMs’ temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA. Extensive experimental analysis demonstrates that RTime-QA presents a significant challenge for LMMs: the state-of-the-art model Qwen2-VL achieves only 34.6 on strict-ACC metric, substantially lagging behind human performance. Furthermore, our experiments reveal that RTime-IT effectively enhance LMMs’ capacity in temporal understanding. By fine-tuning on RTime-IT, our Qwen2-VL achieves 65.9 on RTime-QA.
zh

[CV-166] Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers

【速读】:该论文旨在解决扩散模型在训练过程中因高方差梯度估计导致的收敛缓慢问题,特别是针对Diffusion Transformer (DiT)架构的训练稳定性问题。其解决方案的关键在于提出一种保持激活幅度(magnitude-preserving)的设计,该设计通过避免使用归一化层来稳定训练过程,并引入旋转调制(rotation modulation)作为一种新的条件生成方法,利用学习到的旋转代替传统的缩放或平移操作。实验结果表明,该方法显著提升了模型性能,同时在参数效率上优于AdaLN。

链接: https://arxiv.org/abs/2505.19122
作者: Eric Tillman Bill,Cristian Perez Jensen,Sotiris Anagnostidis,Dimitri von Rütte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably reducing FID scores by \sim 12.8%. Further, we show that rotation modulation combined with scaling is competitive with AdaLN, while requiring \sim 5.4% fewer parameters. This work provides insights into conditioning strategies and magnitude control. We will publicly release the implementation of our method.
zh

[CV-167] Freqformer: Image-Demoiréing Transformer via Efficient Frequency Decomposition

【速读】:该论文旨在解决图像去摩尔纹(demoiréing)问题,该问题由于摩尔纹引起的纹理退化与色彩失真之间的复杂相互作用而具有挑战性。现有方法在直接图像到图像恢复中难以有效解耦这些交织的伪影,而基于小波的频域感知方法虽有潜力但尚未被充分探索。本文提出的解决方案关键在于Freqformer框架,其通过针对性的频域分离实现有效的频率分解,将摩尔纹分为高频空间局部纹理和低频尺度鲁棒色彩失真,并采用双分支结构分别处理;同时引入可学习的频率组合变换(Frequency Composition Transform, FCT)模块以自适应融合频域特定输出,从而实现一致且高保真的重建。

链接: https://arxiv.org/abs/2505.19120
作者: Xiaoyang Liu,Bolin Qiu,Jiezhang Cao,Zheng Chen,Yulun Zhang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image demoiréing remains a challenging task due to the complex interplay between texture corruption and color distortions caused by moiré patterns. Existing methods, especially those relying on direct image-to-image restoration, often fail to disentangle these intertwined artifacts effectively. While wavelet-based frequency-aware approaches offer a promising direction, their potential remains underexplored. In this paper, we present Freqformer, a Transformer-based framework specifically designed for image demoiréing through targeted frequency separation. Our method performs an effective frequency decomposition that explicitly splits moiré patterns into high-frequency spatially-localized textures and low-frequency scale-robust color distortions, which are then handled by a dual-branch architecture tailored to their distinct characteristics. We further propose a learnable Frequency Composition Transform (FCT) module to adaptively fuse the frequency-specific outputs, enabling consistent and high-fidelity reconstruction. To better aggregate the spatial dependencies and the inter-channel complementary information, we introduce a Spatial-Aware Channel Attention (SA-CA) module that refines moiré-sensitive regions without incurring high computational cost. Extensive experiments on various demoiréing benchmarks demonstrate that Freqformer achieves state-of-the-art performance with a compact model size. The code is publicly available at this https URL.
zh

[CV-168] CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

【速读】:该论文旨在解决复杂图形设计场景中多条件控制不足的问题,即现有方法在处理由多种异构用户输入元素(如图像、布局和文本)共同指定的设计意图时,难以实现细粒度的子条件控制并保持整体构图和谐。其解决方案的关键在于提出CreatiDesign,该方案包含统一的多条件驱动架构与多模态注意力掩码机制,前者实现了对异构设计元素的灵活精准整合,后者确保每个条件精确控制指定图像区域并减少条件间的干扰。

链接: https://arxiv.org/abs/2505.19114
作者: Hui Zhang,Dexiang Hong,Maoke Yang,Yutao Chen,Zhao Zhang,Jie Shao,Xinglong Wu,Zuxuan Wu,Yu-Gang Jiang
机构: Fudan University (复旦大学); Bytedance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.
zh

[CV-169] Remote Sensing Image Classification with Decoupled Knowledge Distillation

【速读】:该论文旨在解决现有遥感图像分类模型参数量过大导致在资源受限设备上部署困难的问题。其解决方案的关键在于采用G-GhostNet作为主干网络,通过特征复用减少冗余参数并显著提升推理效率,同时引入解耦知识蒸馏策略,将目标类与非目标类分离以有效提升分类精度。

链接: https://arxiv.org/abs/2505.19111
作者: Yaping He,Jianfeng Cai,Qicong Hu,Peiqing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7

点击查看摘要

Abstract:To address the challenges posed by the large number of parameters in existing remote sensing image classification models, which hinder deployment on resource-constrained devices, this paper proposes a lightweight classification method based on knowledge distillation. Specifically, G-GhostNet is adopted as the backbone network, leveraging feature reuse to reduce redundant parameters and significantly improve inference efficiency. In addition, a decoupled knowledge distillation strategy is employed, which separates target and non-target classes to effectively enhance classification accuracy. Experimental results on the RSOD and AID datasets demonstrate that, compared with the high-parameter VGG-16 model, the proposed method achieves nearly equivalent Top-1 accuracy while reducing the number of parameters by 6.24 times. This approach strikes an excellent balance between model size and classification performance, offering an efficient solution for deployment on resource-limited devices.
zh

[CV-170] An Interpretable Representation Learning Approach for Diffusion Tensor Imaging

【速读】:该论文旨在解决扩散张量成像(Diffusion Tensor Imaging, DTI)轨迹图在深度学习模型中有效表示和解释的挑战。其解决方案的关键在于提出一种新颖的2D表示方法,将轨迹级别的部分各向异性(fractional anisotropy, FA)值编码为9x9灰度图像,并通过带有空间广播解码器的Beta-Total Correlation变分自编码器学习一个解耦且可解释的潜在嵌入。

链接: https://arxiv.org/abs/2505.19110
作者: Vishwa Mohan Singh,Alberto Gaston Villagran Asiares,Luisa Sophie Schuhmacher,Kate Rendall,Simon Weißbrod,David Rügamer,Inga Körte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at MIDL 2025

点击查看摘要

Abstract:Diffusion Tensor Imaging (DTI) tractography offers detailed insights into the structural connectivity of the brain, but presents challenges in effective representation and interpretation in deep learning models. In this work, we propose a novel 2D representation of DTI tractography that encodes tract-level fractional anisotropy (FA) values into a 9x9 grayscale image. This representation is processed through a Beta-Total Correlation Variational Autoencoder with a Spatial Broadcast Decoder to learn a disentangled and interpretable latent embedding. We evaluate the quality of this embedding using supervised and unsupervised representation learning strategies, including auxiliary classification, triplet loss, and SimCLR-based contrastive learning. Compared to the 1D Group deep neural network (DNN) baselines, our approach improves the F1 score in a downstream sex classification task by 15.74% and shows a better disentanglement than the 3D representation.
zh

[CV-171] SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards

【速读】:该论文试图解决在视觉问答(VQA)任务中,直接应用类似文本领域中的生成式 AI (Generative AI) 的强化学习(RL)方法所面临的两个关键问题:一是扩展的推理链会分散对任务关键区域的视觉关注,导致答案准确性下降;二是不可验证的中间步骤会增加策略梯度方差和计算成本。解决方案的关键在于提出 SATORI(Spatially Anchored Task Optimization with Reinforcement Learning),该方法将 VQA 任务分解为三个可验证的阶段,包括全局图像描述、区域定位和答案预测,每个阶段均提供明确的奖励信号,从而提升模型的准确性和稳定性。

链接: https://arxiv.org/abs/2505.19094
作者: Chuming Shen,Wei Wei,Xiaoye Qu,Yu Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ( \textbfSpatially \textbfAnchored \textbfTask \textbfOptimization with \textbfRe\textbfInforcement Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to 15.7% improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at this https URL.
zh

[CV-172] Plug-and-Play Context Feature Reuse for Efficient Masked Generation

【速读】:该论文旨在解决掩码生成模型(Masked Generative Models, MGMs)在图像合成任务中因需要大量迭代解码步骤而导致的高推理成本问题。其核心挑战在于,虽然通过每步解码更多标记可以减少总步骤数以加速生成,但会导致模型无法捕捉标记间的依赖关系,从而显著降低生成质量。该论文提出的解决方案的关键是引入ReCAP(Reused Context-Aware Prediction)模块,该模块通过复用之前解码上下文标记的特征嵌入来构建低成本的解码步骤,从而在保持细粒度、迭代生成优势的同时大幅降低计算量。

链接: https://arxiv.org/abs/2505.19089
作者: Xuejie Liu,Anji Liu,Guy Van den Broeck,Yitao Liang
机构: Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); Computer Science Department, University of California, Los Angeles (加利福尼亚大学洛杉矶分校); School of Intelligence Science and Technology, Peking University (北京大学智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked generative models (MGMs) have emerged as a powerful framework for image synthesis, combining parallel decoding with strong bidirectional context modeling. However, generating high-quality samples typically requires many iterative decoding steps, resulting in high inference costs. A straightforward way to speed up generation is by decoding more tokens in each step, thereby reducing the total number of steps. However, when many tokens are decoded simultaneously, the model can only estimate the univariate marginal distributions independently, failing to capture the dependency among them. As a result, reducing the number of steps significantly compromises generation fidelity. In this work, we introduce ReCAP (Reused Context-Aware Prediction), a plug-and-play module that accelerates inference in MGMs by constructing low-cost steps via reusing feature embeddings from previously decoded context tokens. ReCAP interleaves standard full evaluations with lightweight steps that cache and reuse context features, substantially reducing computation while preserving the benefits of fine-grained, iterative generation. We demonstrate its effectiveness on top of three representative MGMs (MaskGIT, MAGE, and MAR), including both discrete and continuous token spaces and covering diverse architectural designs. In particular, on ImageNet256 class-conditional generation, ReCAP achieves up to 2.4x faster inference than the base model with minimal performance drop, and consistently delivers better efficiency-fidelity trade-offs under various generation settings.
zh

[CV-173] Jodi: Unification of Visual Generation and Understanding via Joint Modeling

【速读】:该论文试图解决视觉生成与理解在传统机器学习中被当作独立任务处理的问题,而实际上它们是人类智能中紧密关联的两个方面。解决方案的关键在于提出Jodi框架,该框架通过联合建模图像域和多个标签域来统一视觉生成与理解。Jodi基于线性扩散Transformer和角色切换机制,能够执行三种特定任务:联合生成、可控生成和图像感知。

链接: https://arxiv.org/abs/2505.19084
作者: Yifeng Xu,Zhenliang He,Meina Kan,Shiguang Shan,Xilin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Visual generation and understanding are two deeply interconnected aspects of human intelligence, yet they have been traditionally treated as separate tasks in machine learning. In this paper, we propose Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Specifically, Jodi is built upon a linear diffusion transformer along with a role switch mechanism, which enables it to perform three particular types of tasks: (1) joint generation, where the model simultaneously generates images and multiple labels; (2) controllable generation, where images are generated conditioned on any combination of labels; and (3) image perception, where multiple labels can be predicted at once from a given image. Furthermore, we present the Joint-1.6M dataset, which contains 200,000 high-quality images collected from public sources, automatic labels for 7 visual domains, and LLM-generated captions. Extensive experiments demonstrate that Jodi excels in both generation and understanding tasks and exhibits strong extensibility to a wider range of visual domains. Code is available at this https URL.
zh

[CV-174] owards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark

【速读】:该论文旨在解决面部交换(face swapping)带来的隐私和安全问题,特别是针对生成式AI(Generative AI)技术发展下,真实与交换后面部之间差异变得细微所带来的检测难题。其解决方案的关键在于主动嵌入复杂的水印(watermark)以对抗未知的面部交换技术,通过在面部轮廓区域嵌入轮廓纹理和面部身份信息,实现渐进式的图像判定,从而提升水印的鲁棒性并保持图像质量的平衡。该方法无需在训练过程中依赖具体的交换技术或预先存储大规模信息,具有良好的泛化能力。

链接: https://arxiv.org/abs/2505.19081
作者: Ruiyang Xia,Dawei Zhou,Decheng Liu,Lin Yuan,Jie Li,Nannan Wang,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures, under review

点击查看摘要

Abstract:Face swapping, recognized as a privacy and security concern, has prompted considerable defensive research. With the advancements in AI-generated content, the discrepancies between the real and swapped faces have become nuanced. Considering the difficulty of forged traces detection, we shift the focus to the face swapping purpose and proactively embed elaborate watermarks against unknown face swapping techniques. Given that the constant purpose is to swap the original face identity while preserving the background, we concentrate on the regions surrounding the face to ensure robust watermark generation, while embedding the contour texture and face identity information to achieve progressive image determination. The watermark is located in the facial contour and contains hybrid messages, dubbed the contour-hybrid watermark (CMark). Our approach generalizes face swapping detection without requiring any swapping techniques during training and the storage of large-scale messages in advance. Experiments conducted across 8 face swapping techniques demonstrate the superiority of our approach compared with state-of-the-art passive and proactive detectors while achieving a favorable balance between the image quality and watermark robustness.
zh

[CV-175] ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

【速读】:该论文试图解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动图表理解任务中面临的挑战,尤其是由于视觉理解错误导致的推理过程难以修正的问题。解决方案的关键在于提出ChartSketcher,这是一种基于多模态反馈的逐步推理方法,通过Sketch-CoT技术使模型能够使用程序化绘图库将中间推理步骤直接标注到图表上,并将这些视觉注释迭代反馈至推理过程中,从而实现对推理的视觉支撑与逐步优化。

链接: https://arxiv.org/abs/2505.19076
作者: Muye Huang,Lingling Zhang,Jie Ma,Han Lai,Fangzhi Xu,Yifei Li,Wenjun Wu,Yaqiang Wu,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Lenovo Research (联想研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.
zh

[CV-176] MMP-2K: A Benchmark Multi-Labeled Macro Photography Image Quality Assessment Database ICIP2025

【速读】:该论文试图解决宏观摄影图像质量评估(Macro Photography Image Quality Assessment, MPIQA)数据不足的问题,这限制了MPIQA度量的发展。解决方案的关键在于构建一个大规模、多标签的MPIQA数据库MMP-2k,通过从三个公共图像网站收集的15,700张宏观摄影图像中采样2,000张图像,并进行实验室研究获取每张图像的17个质量评分及详细的失真程度、类型和位置报告,从而确保内容和质量的多样性。

链接: https://arxiv.org/abs/2505.19065
作者: Jiashuo Chang,Zhengyi Li,Jianxun Lou,Zhen Qiu,Hanhe Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE International Conference on Image Processing, IEEE ICIP 2025

点击查看摘要

Abstract:Macro photography (MP) is a specialized field of photography that captures objects at an extremely close range, revealing tiny details. Although an accurate macro photography image quality assessment (MPIQA) metric can benefit macro photograph capturing, which is vital in some domains such as scientific research and medical applications, the lack of MPIQA data limits the development of MPIQA metrics. To address this limitation, we conducted a large-scale MPIQA study. Specifically, to ensure diversity both in content and quality, we sampled 2,000 MP images from 15,700 MP images, collected from three public image websites. For each MP image, 17 (out of 21 after outlier removal) quality ratings and a detailed quality report of distortion magnitudes, types, and positions are gathered by a lab study. The images, quality ratings, and quality reports form our novel multi-labeled MPIQA database, MMP-2k. Experimental results showed that the state-of-the-art generic IQA metrics underperform on MP images. The database and supplementary materials are available at this https URL.
zh

[CV-177] raining-free Stylized Text-to-Image Generation with Fast Inference

【速读】:该论文旨在解决基于扩散模型的风格化图像生成方法中存在的时间消耗大、依赖文本反转或风格图像微调的问题,从而提升大规模扩散模型的实际应用性。其解决方案的关键在于提出了一种无需微调或额外优化的新型风格化图像生成方法——OmniPainter,该方法利用潜在一致性模型的自洽性特性,从参考风格图像中提取代表性的风格统计信息,并引入自注意力的范数混合机制,以查询与中间输出内容特征最相关的风格模式,从而确保生成结果与参考风格图像分布高度一致。

链接: https://arxiv.org/abs/2505.19063
作者: Xin Ma,Yaohui Wang,Xinyuan Chen,Tien-Tsin Wong,Cunjian Chen
机构: Monash University (莫纳什大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images. Our qualitative and quantitative experimental results demonstrate that the proposed method outperforms state-of-the-art approaches.
zh

[CV-178] Less is More: Efficient Point Cloud Reconstruction via Multi-Head Decoders

【速读】:该论文试图解决深度解码器架构在点云重建中并不总是能带来更好性能的问题,挑战了“更深的解码器结构必然提升性能”的常见假设。其关键解决方案是提出一种多头解码器架构(multi-head decoder architecture),通过从点云中不同子集的点独立重建完整形状,利用点云本身的冗余性,最终通过拼接各头的预测结果来增强输出的多样性和保真度。

链接: https://arxiv.org/abs/2505.19057
作者: Pedro Alonso,Tianrui Li,Chongshou Li
机构: Southwest Jiaotong University (西南交通大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We challenge the common assumption that deeper decoder architectures always yield better performance in point cloud reconstruction. Our analysis reveals that, beyond a certain depth, increasing decoder complexity leads to overfitting and degraded generalization. Additionally, we propose a novel multi-head decoder architecture that exploits the inherent redundancy in point clouds by reconstructing complete shapes from multiple independent heads, each operating on a distinct subset of points. The final output is obtained by concatenating the predictions from all heads, enhancing both diversity and fidelity. Extensive experiments on ModelNet40 and ShapeNetPart demonstrate that our approach achieves consistent improvements across key metrics–including Chamfer Distance (CD), Hausdorff Distance (HD), Earth Mover’s Distance (EMD), and F1-score–outperforming standard single-head baselines. Our findings highlight that output diversity and architectural design can be more critical than depth alone for effective and efficient point cloud reconstruction.
zh

[CV-179] Disentangled Human Body Representation Based on Unsupervised Semantic-Aware Learning

【速读】:该论文试图解决现有3D人体表征方法在语义控制性和表征精度方面受限的问题,主要由于大量手动定义的人体约束复杂性以及缺乏监督数据。其解决方案的关键在于提出一种在无监督学习框架下具有可控细粒度语义和高重建精度的人体表征方法,通过设计全感知骨架分组解耦策略,学习人体几何语义测量与潜在编码之间的对应关系,从而实现对人体形状和姿态的可控调整。此外,引入基于模板的残差学习方案和部分感知解码器,进一步提升了模型在复杂人体形状和姿态空间中的学习能力。

链接: https://arxiv.org/abs/2505.19049
作者: Lu Wang,Xishuai Peng,S. Kevin Zhou
机构: Siemens Shanghai Medical Equipment Ltd.(西门子上海医疗设备有限公司); University of Science and Technology of China(中国科学技术大学); School of Biomedical Engineering(生物医学工程学院); Suzhou Institute for Advanced Research(苏州先进研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:In recent years, more and more attention has been paid to the learning of 3D human representation. However, the complexity of lots of hand-defined human body constraints and the absence of supervision data limit that the existing works controllably and accurately represent the human body in views of semantics and representation ability. In this paper, we propose a human body representation with controllable fine-grained semantics and high precison of reconstruction in an unsupervised learning framework. In particularly, we design a whole-aware skeleton-grouped disentangle strategy to learn a correspondence between geometric semantical measurement of body and latent codes, which facilitates the control of shape and posture of human body by modifying latent coding paramerers. With the help of skeleton-grouped whole-aware encoder and unsupervised disentanglement losses, our representation model is learned by an unsupervised manner. Besides, a based-template residual learning scheme is injected into the encoder to ease of learning human body latent parameter in complicated body shape and pose spaces. Because of the geometrically meaningful latent codes, it can be used in a wide range of applications, from human body pose transfer to bilinear latent code interpolation. Further more, a part-aware decoder is utlized to promote the learning of controllable fine-grained semantics. The experimental results on public 3D human datasets show that the method has the ability of precise reconstruction.
zh

[CV-180] Medical Large Vision Language Models with Multi-Image Visual Ability

【速读】:该论文旨在解决当前医疗视觉语言模型(Medical Vision-Language Models, VLMs)在处理多图像临床场景时能力不足的问题,特别是其在时间推理、跨模态分析等复杂视觉理解任务上的表现有限。解决方案的关键在于构建了Med-MIM指令数据集,包含83.2K个涵盖四种多图像视觉能力(时间理解、推理、比较、共指)的医疗多图像问答对,并基于此数据集对现有模型进行微调,从而得到两个专门优化于多图像分析的医疗VLMs:MIM-LLaVA-Med和Med-Mantis。此外,还开发了Med-MIM基准测试以全面评估模型的多图像理解能力。实验结果表明,所提出的数据集有效提升了VLMs在医疗领域的多图像理解性能。

链接: https://arxiv.org/abs/2505.19031
作者: Xikai Yang,Juzheng Miao,Yuchen Yuan,Jiaze Wang,Qi Dou,Jinpeng Li,Pheng-Ann Heng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we develop the Med-MIM benchmark to comprehensively evaluate the medical multi-image understanding capabilities of LVLMs. We assess eight popular LVLMs, including our two models, on the Med-MIM benchmark. Experimental results show that both Med-Mantis and MIM-LLaVA-Med achieve superior performance on the held-in and held-out subsets of the Med-MIM benchmark, demonstrating that the Med-MIM instruction dataset effectively enhances LVLMs’ multi-image understanding capabilities in the medical domain.
zh

[CV-181] InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解带有设计驱动视觉元素(如图标、图示)的信息图表时存在的挑战,这些问题涉及视觉识别与推理能力的不足。现有视觉问答基准在评估MLLMs的这些能力方面存在局限性,因为缺乏配对的普通图表和基于视觉元素的问题。解决方案的关键在于引入InfoChartQA基准,该基准包含5,642对信息图表和普通图表,它们共享相同的数据但呈现方式不同,并设计了基于视觉元素的问题以捕捉其独特的视觉设计和传达意图,从而更全面地评估MLLMs在信息图表理解方面的性能。

链接: https://arxiv.org/abs/2505.19028
作者: Minzhi Lin,Tianchi Xie,Mengchen Liu,Yilin Ye,Changjian Chen,Shixia Liu
机构: Tsinghua University (清华大学); Meta (元); Hong Kong University of Science and Technology (香港科技大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question answering benchmarks fall short in evaluating these capabilities of MLLMs due to the lack of paired plain charts and visual-element-based questions. To bridge this gap, we introduce InfoChartQA, a benchmark for evaluating MLLMs on infographic chart understanding. It includes 5,642 pairs of infographic and plain charts, each sharing the same underlying data but differing in visual presentations. We further design visual-element-based questions to capture their unique visual designs and communicative intent. Evaluation of 20 MLLMs reveals a substantial performance decline on infographic charts, particularly for visual-element-based questions related to metaphors. The paired infographic and plain charts enable fine-grained error analysis and ablation studies, which highlight new opportunities for advancing MLLMs in infographic chart understanding. We release InfoChartQA at this https URL.
zh

[CV-182] A Smart Healthcare System for Monkeypox Skin Lesion Detection and Tracking

【速读】:该论文旨在解决猴痘(Monkeypox)诊断的迫切需求,特别是在全球爆发背景下,亟需可扩展、易获取且准确的诊断解决方案。其关键解决方案是开发了一个名为ITMAINN的智能AI驱动医疗系统,该系统利用先进的深度学习技术,通过皮肤病变图像检测猴痘。系统的核心包括:基于迁移学习在公开皮肤病变数据集上训练并评估多个预训练模型,其中MobileViT等模型在二分类和多分类任务中均表现出高准确率和F1分数;一个跨平台的智能手机应用,支持图像分析、症状跟踪及附近医疗中心推荐;以及一个实时监控仪表盘,供卫生部门追踪病例、分析症状趋势并采取主动干预措施。

链接: https://arxiv.org/abs/2505.19023
作者: Huda Alghoraibi,Nuha Alqurashi,Sarah Alotaibi,Renad Alkhudaydi,Bdoor Aldajani,Lubna Alqurashi,Jood Batweel,Maha A. Thafar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Monkeypox is a viral disease characterized by distinctive skin lesions and has been reported in many countries. The recent global outbreak has emphasized the urgent need for scalable, accessible, and accurate diagnostic solutions to support public health responses. In this study, we developed ITMAINN, an intelligent, AI-driven healthcare system specifically designed to detect Monkeypox from skin lesion images using advanced deep learning techniques. Our system consists of three main components. First, we trained and evaluated several pretrained models using transfer learning on publicly available skin lesion datasets to identify the most effective models. For binary classification (Monkeypox vs. non-Monkeypox), the Vision Transformer, MobileViT, Transformer-in-Transformer, and VGG16 achieved the highest performance, each with an accuracy and F1-score of 97.8%. For multiclass classification, which contains images of patients with Monkeypox and five other classes (chickenpox, measles, hand-foot-mouth disease, cowpox, and healthy), ResNetViT and ViT Hybrid models achieved 92% accuracy, with F1 scores of 92.24% and 92.19%, respectively. The best-performing and most lightweight model, MobileViT, was deployed within the mobile application. The second component is a cross-platform smartphone application that enables users to detect Monkeypox through image analysis, track symptoms, and receive recommendations for nearby healthcare centers based on their location. The third component is a real-time monitoring dashboard designed for health authorities to support them in tracking cases, analyzing symptom trends, guiding public health interventions, and taking proactive measures. This system is fundamental in developing responsive healthcare infrastructure within smart cities. Our solution, ITMAINN, is part of revolutionizing public health management. Comments: 23 pages, 5 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2505.19023 [cs.CV] (or arXiv:2505.19023v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.19023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-183] Rethinking Metrics and Benchmarks of Video Anomaly Detection

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)评估协议中存在的三个关键问题:现有评估指标受单一标注偏差显著影响、当前指标未能奖励早期异常检测以及可用基准缺乏评估场景过拟合的能力。其解决方案的关键在于提出三种新的评估方法:首先,通过多轮标注计算平均AUC/AP指标以缓解单一标注偏差;其次,设计一种考虑时延的平均精度(Latency-aware Average Precision, LaAP)指标以鼓励早期且准确的异常检测;最后,引入两个针对场景过拟合评估的硬性正常基准(UCF-HN, MSAD-HN)。

链接: https://arxiv.org/abs/2505.19022
作者: Zihao Liu,Xiaoyu Wu,Wenna Li,Linlin Yang
机构: Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation protocols through comprehensive experimental analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting. To address these limitations, we propose three novel evaluation methods: first, we establish averaged AUC/AP metrics over multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development.
zh

[CV-184] WorldEval: World Model as Real-World Robot Policies Evaluator

【速读】:该论文旨在解决在真实世界场景中评估通用机器人操作策略的耗时与挑战问题,尤其是在任务数量增加和环境条件变化的情况下。其关键解决方案是提出Policy2Vec方法,将视频生成模型转化为遵循潜在动作的世界模拟器,从而生成准确反映机器人行为的策略视频,进而通过WorldEval自动化流水线实现对真实世界机器人策略的高效、可靠评估。

链接: https://arxiv.org/abs/2505.19017
作者: Yaxuan Li,Yichen Zhu,Junjie Wen,Chaomin Shen,Yi Xu
机构: Midea Group(美的集团); East China Normal University(华东师范大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The project page is available at this https URL

点击查看摘要

Abstract:The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.
zh

[CV-185] Can Multimodal Large Language Models Understand Spatial Relations?

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间关系推理任务中的不足,这些问题包括依赖边界框、忽略视角替换以及仅依靠模型先验知识即可回答问题而无需图像理解。解决方案的关键在于引入了一个由人类标注的空间关系推理基准SpatialMQA,该基准基于COCO2017数据集,通过精心设计的标注流程确保数据质量,从而促使MLLMs更专注于客观世界中的图像理解。

链接: https://arxiv.org/abs/2505.19015
作者: Jingping Liu,Ziyan Liu,Zhedong Cen,Yan Zhou,Yinan Zou,Weiyan Zhang,Haiyun Jiang,Tong Ruan
机构: East China University of Science and Technology (华东理工大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 13 pages, 19 figures

点击查看摘要

Abstract:Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model’s prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at this https URL.
zh

[CV-186] VPGS-SLAM: Voxel-based Progressive 3D Gaussian SLAM in Large-Scale Scenes

【速读】:该论文旨在解决现有基于3D Gaussian Splatting (3DGS) 的SLAM方法在大规模场景和长序列中面临内存爆炸和场景表示不准确的问题。其解决方案的关键在于提出一种基于体素的渐进式3D高斯映射方法,通过多子地图实现紧凑且精确的场景表示,从而支持任意场景的扩展并提高鲁棒性;同时结合2D-3D融合的相机跟踪方法和2D-3D高斯回环闭合方法,以提升跟踪精度和消除位姿漂移,最终通过在线蒸馏的子地图融合方法实现大规模场景下的全局一致性。

链接: https://arxiv.org/abs/2505.18992
作者: Tianchen Deng,Wenhua Wu,Junjie He,Yue Pan,Xirui Jiang,Shenghai Yuan,Danwei Wang,Hesheng Wang,Weidong Chen
机构: Shanghai Jiao Tong University (上海交通大学); HKUST (香港科技大学); University of Bonn (波恩大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting has recently shown promising results in dense visual SLAM. However, existing 3DGS-based SLAM methods are all constrained to small-room scenarios and struggle with memory explosion in large-scale scenes and long sequences. To this end, we propose VPGS-SLAM, the first 3DGS-based large-scale RGBD SLAM framework for both indoor and outdoor scenarios. We design a novel voxel-based progressive 3D Gaussian mapping method with multiple submaps for compact and accurate scene representation in large-scale and long-sequence scenes. This allows us to scale up to arbitrary scenes and improves robustness (even under pose drifts). In addition, we propose a 2D-3D fusion camera tracking method to achieve robust and accurate camera tracking in both indoor and outdoor large-scale scenes. Furthermore, we design a 2D-3D Gaussian loop closure method to eliminate pose drift. We further propose a submap fusion method with online distillation to achieve global consistency in large-scale scenes when detecting a loop. Experiments on various indoor and outdoor datasets demonstrate the superiority and generalizability of the proposed framework. The code will be open source on this https URL.
zh

[CV-187] Kernel Space Diffusion Model for Efficient Remote Sensing Pansharpening

【速读】:该论文旨在解决遥感图像中全色锐化(pansharpening)任务中存在的全局先验信息捕捉不足以及基于扩散模型的推理延迟问题。其解决方案的关键在于提出一种名为核空间扩散模型(Kernel Space Diffusion Model, KSDiff)的新方法,该方法通过在潜在空间中进行扩散过程生成富含全局上下文信息的卷积核,从而在提升全色锐化质量的同时实现更快的推理速度。KSDiff通过低秩核心张量生成器与统一因子生成器的集成,并结合结构感知的多头注意力机制,实现了对核空间的有效建模。

链接: https://arxiv.org/abs/2505.18991
作者: Hancong Jin,Zihan Cao,Liangjian Deng
机构: University of Electronic Science and Technology of China (中国电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pansharpening is a fundamental task in remote sensing that integrates high-resolution panchromatic imagery (PAN) with low-resolution multispectral imagery (LRMS) to produce an enhanced image with both high spatial and spectral resolution. Despite significant progress in deep learning-based approaches, existing methods often fail to capture the global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities; however, they suffer from significant inference latency, which limits their practical applicability. In this work, we propose the Kernel Space Diffusion Model (KSDiff), a novel approach that leverages diffusion processes in a latent space to generate convolutional kernels enriched with global contextual information, thereby improving pansharpening quality while enabling faster inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, enabling KSDiff to serve as a framework for enhancing existing pansharpening architectures. Experiments on three widely used datasets, including WorldView-3, GaoFen-2, and QuickBird, demonstrate the superior performance of KSDiff both qualitatively and quantitatively. Code will be released upon possible acceptance.
zh

[CV-188] SPARS: Self-Play Adversarial Reinforcement Learning for Segmentation of Liver Tumours

【速读】:该论文旨在解决癌症肿瘤分割中依赖大量成本高且主观性强的3D体素级标注数据的问题,从而影响模型的泛化能力。其解决方案的关键在于提出一种名为SPARS(Self-Play Adversarial Reinforcement Learning for Segmentation)的弱监督语义分割框架,该框架利用少量图像级二分类癌症存在标签训练对象存在分类器,以在CT扫描中定位癌变区域,从而实现更客观的癌症定位。

链接: https://arxiv.org/abs/2505.18989
作者: Catalina Tan,Yipeng Hu,Shaheer U. Saeed
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Medical Image Understanding and Analysis (MIUA) 2025

点击查看摘要

Abstract:Accurate tumour segmentation is vital for various targeted diagnostic and therapeutic procedures for cancer, e.g., planning biopsies or tumour ablations. Manual delineation is extremely labour-intensive, requiring substantial expert time. Fully-supervised machine learning models aim to automate such localisation tasks, but require a large number of costly and often subjective 3D voxel-level labels for training. The high-variance and subjectivity in such labels impacts model generalisability, even when large datasets are available. Histopathology labels may offer more objective labels but the infeasibility of acquiring pixel-level annotations to develop tumour localisation methods based on histology remains challenging in-vivo. In this work, we propose a novel weakly-supervised semantic segmentation framework called SPARS (Self-Play Adversarial Reinforcement Learning for Segmentation), which utilises an object presence classifier, trained on a small number of image-level binary cancer presence labels, to localise cancerous regions on CT scans. Such binary labels of patient-level cancer presence can be sourced more feasibly from biopsies and histopathology reports, enabling a more objective cancer localisation on medical images. Evaluating with real patient data, we observed that SPARS yielded a mean dice score of 77.3 \pm 9.4 , which outperformed other weakly-supervised methods by large margins. This performance was comparable with recent fully-supervised methods that require voxel-level annotations. Our results demonstrate the potential of using SPARS to reduce the need for extensive human-annotated labels to detect cancer in real-world healthcare settings.
zh

[CV-189] NTIRE 2025 Challenge on Video Quality Enhancement for Video Conferencing: Datasets Methods and Results

【速读】:该论文旨在解决视频会议场景中的视频质量增强(Video Quality Enhancement, VQE)问题,具体包括改善光照、增强色彩、降低噪声以及提升清晰度,以实现专业演播室般的视觉效果。解决方案的关键在于利用提供的可微分视频质量评估(Differentiable Video Quality Assessment, VQA)模型,结合训练和测试视频数据,优化视频处理算法以达到提升视频质量的目标。

链接: https://arxiv.org/abs/2505.18988
作者: Varun Jain,Zongwei Wu,Quan Zou,Louis Florentin,Henrik Turbell,Sandeep Siddhartha,Radu Timofte,others
机构: Microsoft(微软); University of Würzburg(维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive review of the 1st Challenge on Video Quality Enhancement for Video Conferencing held at the NTIRE workshop at CVPR 2025, and highlights the problem statement, datasets, proposed solutions, and results. The aim of this challenge was to design a Video Quality Enhancement (VQE) model to enhance video quality in video conferencing scenarios by (a) improving lighting, (b) enhancing colors, © reducing noise, and (d) enhancing sharpness - giving a professional studio-like effect. Participants were given a differentiable Video Quality Assessment (VQA) model, training, and test videos. A total of 91 participants registered for the challenge. We received 10 valid submissions that were evaluated in a crowdsourced framework.
zh

[CV-190] VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

【速读】:该论文旨在解决开放世界环境下未见过物体的检测问题,即在无需人类提供类别输入的情况下,识别和定位新出现的物体。现有方法在开放集感知(open-set perception)中依赖人工定义的类别输入,而在开放末端感知(open-ended perception)中则面临性能不足的问题。论文提出的解决方案关键在于构建VL-SAM-V2框架,通过融合开放集与开放末端模型的查询,并引入通用与特定查询融合模块,使不同查询能够相互作用,从而提升模型对未见过物体的发现能力。此外,通过引入排序可学习查询和去噪点训练策略,进一步增强了模型的多样性和训练效果。

链接: https://arxiv.org/abs/2505.18986
作者: Zhiwei Lin,Yongtao Wang
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.
zh

[CV-191] AmorLIP: Efficient Language-Image Pretraining via Amortization

【速读】:该论文旨在解决现有对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)方法在训练过程中需要极大批次大小以实现鲁棒表示学习所带来的计算资源消耗过高和可扩展性受限的问题。其解决方案的关键在于提出AmorLIP框架,通过轻量级神经网络对对比学习中的高成本计算进行摊销(amortization),从而显著提升训练效率和性能,同时保持或提升下游任务的零样本分类与检索能力。

链接: https://arxiv.org/abs/2505.18983
作者: Haotian Sun,Yitong Li,Yuchen Zhuang,Niao He,Hanjun Dai,Bo Dai
机构: Georgia Institute of Technology (佐治亚理工学院); Swiss Federal Institute of Technology (瑞士联邦理工学院); Precur.ai (Precur.ai)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.
zh

[CV-192] MGD3: Mode-Guided Dataset Distillation using Diffusion Models

【速读】:该论文试图解决数据集蒸馏中样本多样性不足的问题,现有方法依赖于通过蒸馏损失进行模型微调以促进多样性,但无法保证样本的多样性,从而限制了性能。其解决方案的关键在于提出一种基于模式引导的扩散模型,利用预训练扩散模型而不需通过蒸馏损失进行微调,通过三个阶段——模式发现、模式引导和停止引导——来提升数据集的多样性并减少合成样本中的伪影,从而显著降低计算成本并提升性能。

链接: https://arxiv.org/abs/2505.18963
作者: Jeffrey A. Chan-Santiago,Praveen Tirupattur,Gaurav Kumar Nayak,Gaowen Liu,Mubarak Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance. We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance. Our approach outperforms state-of-the-art methods, achieving accuracy gains of 4.4%, 2.9%, 1.6%, and 1.6% on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, respectively. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs. Our code is available on the project webpage: this https URL
zh

[CV-193] CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation

【速读】:该论文旨在解决医学影像分割中因数据集部分标注不全而导致模型难以学习跨数据集共享解剖结构表示的问题,以及视觉单一框架在捕捉复杂解剖关系和任务特定差异方面的不足。其解决方案的关键在于提出了一种结合自监督视觉变压器、基于CLIP的文本嵌入和任务特定文本提示的分割网络(CDPDNet),通过融合细粒度与全局视觉特征,并利用文本嵌入增强器官与肿瘤间复杂关系的建模能力,同时引入文本任务提示生成模块以提升任务间的判别性。

链接: https://arxiv.org/abs/2505.18958
作者: Jiong Wu,Yang Xing,Boxiao Yu,Wei Shao,Kuang Gong
机构: University of Florida(佛罗里达大学); J. Crayton Pruitt Family Department of Biomedical Engineering(佛罗里达大学生物医学工程系); Department of Medicine(佛罗里达大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most publicly available medical segmentation datasets are only partially labeled, with annotations provided for a subset of anatomical structures. When multiple datasets are combined for training, this incomplete annotation poses challenges, as it limits the model’s ability to learn shared anatomical representations among datasets. Furthermore, vision-only frameworks often fail to capture complex anatomical relationships and task-specific distinctions, leading to reduced segmentation accuracy and poor generalizability to unseen datasets. In this study, we proposed a novel CLIP-DINO Prompt-Driven Segmentation Network (CDPDNet), which combined a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges. Specifically, the framework was constructed upon a convolutional neural network (CNN) and incorporated DINOv2 to extract both fine-grained and global visual features, which were then fused using a multi-head cross-attention module to overcome the limited long-range modeling capability of CNNs. In addition, CLIP-derived text embeddings were projected into the visual space to help model complex relationships among organs and tumors. To further address the partial label challenge and enhance inter-task discriminative capability, a Text-based Task Prompt Generation (TTPG) module that generated task-specific prompts was designed to guide the segmentation. Extensive experiments on multiple medical imaging datasets demonstrated that CDPDNet consistently outperformed existing state-of-the-art segmentation methods. Code and pretrained model are available at: this https URL.
zh

[CV-194] How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation ICML

【速读】:该论文旨在解决LiDAR-based 3D panoptic segmentation中由于LiDAR数据固有的稀疏性导致的远距离或小目标识别困难问题。其关键解决方案是提出一种名为Image-Assists-LiDAR (IAL)的多模态3D全景分割框架,该框架通过引入模态同步的数据增强策略PieAug、基于Transformer的解码器、几何引导的Token融合模块(GTF)以及基于先验的查询生成模块(PQG),有效提升了多模态特征融合与实例分割的准确性。

链接: https://arxiv.org/abs/2505.18956
作者: Yining Pan,Qiongjie Cui,Xulei Yang,Na Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted at the 2025 International Conference on Machine Learning (ICML)

点击查看摘要

Abstract:LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder’s ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at this https URL.
zh

[CV-195] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model

【速读】:该论文旨在解决开放世界中真实三维手-物体交互(3D HOI)的泛化问题,传统方法在封闭集物体和预定义任务上表现良好,但难以处理未见过的物体或开放式语言指令。其解决方案的关键在于提出OpenHOI框架,该框架集成了一种针对协同属性定位与语义任务分解微调的三维多模态大语言模型(3D MLLM),并结合基于属性驱动的扩散模型与无训练物理优化阶段,以生成符合物理规律且能响应自由形式语言命令的长时序操作序列。

链接: https://arxiv.org/abs/2505.18947
作者: Zhenhao Zhang,Ye Shi,Lingxiao Yang,Suting Ni,Qi Ye,Jingya Wang
机构: ShanghaiTech University (上海科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding and synthesizing realistic 3D hand-object interactions (HOI) is critical for applications ranging from immersive AR/VR to dexterous robotics. Existing methods struggle with generalization, performing well on closed-set objects and predefined tasks but failing to handle unseen objects or open-vocabulary instructions. We introduce OpenHOI, the first framework for open-world HOI synthesis, capable of generating long-horizon manipulation sequences for novel objects guided by free-form language commands. Our approach integrates a 3D Multimodal Large Language Model (MLLM) fine-tuned for joint affordance grounding and semantic task decomposition, enabling precise localization of interaction regions (e.g., handles, buttons) and breakdown of complex instructions (e.g., “Find a water bottle and take a sip”) into executable sub-tasks. To synthesize physically plausible interactions, we propose an affordance-driven diffusion model paired with a training-free physics refinement stage that minimizes penetration and optimizes affordance alignment. Evaluations across diverse scenarios demonstrate OpenHOI’s superiority over state-of-the-art methods in generalizing to novel object categories, multi-stage tasks, and complex language instructions. Our project page at \hrefthis https URL
zh

[CV-196] Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back

【速读】:该论文旨在解决现代端到端自动驾驶系统中规划模块缺乏时间一致性机制的问题,即预测轨迹与动态场景演变之间无法保持同步,导致早期预测误差随时间累积并引发灾难性后果。其解决方案的关键在于提出了一种自校正框架——Echo Planning,通过建立当前-未来-当前(CFC)的闭环循环,使轨迹预测与场景一致性相协调。该方法的核心思想是:合理的未来轨迹必须具备双向一致性,即不仅能够从当前观测生成,还能够重构当前观测。通过在BEV场景表示上进行未来轨迹预测并逆向映射回当前状态,利用循环损失强制原始与重构BEV表示的一致性,从而内在地惩罚物理上不可行或对齐错误的轨迹。

链接: https://arxiv.org/abs/2505.18945
作者: Jintao Sun,Hu Zhang,Gangyi Ding,Zhedong Zheng
机构: Beijing Institute of Technology (北京理工大学); CSIRO DATA61 (澳大利亚联邦科学与工业研究组织数据61); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Modern end-to-end autonomous driving systems suffer from a critical limitation: their planners lack mechanisms to enforce temporal consistency between predicted trajectories and evolving scene dynamics. This absence of self-supervision allows early prediction errors to compound catastrophically over time. We introduce Echo Planning, a novel self-correcting framework that establishes a closed-loop Current - Future - Current (CFC) cycle to harmonize trajectory prediction with scene coherence. Our key insight is that plausible future trajectories must be bi-directionally consistent, ie, not only generated from current observations but also capable of reconstructing them. The CFC mechanism first predicts future trajectories from the Bird’s-Eye-View (BEV) scene representation, then inversely maps these trajectories back to estimate the current BEV state. By enforcing consistency between the original and reconstructed BEV representations through a cycle loss, the framework intrinsically penalizes physically implausible or misaligned trajectories. Experiments on nuScenes demonstrate state-of-the-art performance, reducing L2 error by 0.04 m and collision rate by 0.12% compared to one-shot planners. Crucially, our method requires no additional supervision, leveraging the CFC cycle as an inductive bias for robust planning. This work offers a deployable solution for safety-critical autonomous systems.
zh

[CV-197] Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency CVPR2025

【速读】:该论文试图解决在线视频新视角合成中的视图一致性和时间一致性问题,传统方法虽然能够从密集多视角相机设置中生成高质量结果,但计算资源需求大;而选择性输入方法虽降低了成本,但常导致多视角和时间不一致的问题,如闪烁伪影。解决方案的关键在于利用全局几何信息引导基于图像的渲染流程,通过时间上渐进式地优化深度图,并结合截断有符号距离场在合成视角的图像空间中累积深度信息,从而获得具有视图和时间一致性的深度表示,进而指导预训练融合网络融合多个前向渲染的输入视角图像,实现高效且高质量的视频合成。

链接: https://arxiv.org/abs/2505.18932
作者: Hyunho Ha,Lei Xiao,Christian Richardt,Thu Nguyen-Phuoc,Changil Kim,Min H. Kim,Douglas Lanman,Numair Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project website: this https URL

点击查看摘要

Abstract:We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view’s image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.
zh

[CV-198] WeedNet: A Foundation Model-Based Global-to-Local AI Approach for Real-Time Weed Species Identification and Classification

【速读】:该论文试图解决 weeds(杂草)早期识别的难题,以实现有效的管理和控制,但传统方法在专家验证数据有限、形态特征复杂多变等方面存在挑战。其解决方案的关键在于提出 WeedNet,这是一个基于自监督学习、微调和增强可信度策略的端到端实时杂草识别系统,能够在全球范围内识别大量杂草物种,包括有害和入侵性植物,并通过全局到局部(Global-to-Local)策略实现区域特异性适应,从而显著提升了模型的准确性和泛化能力。

链接: https://arxiv.org/abs/2505.18930
作者: Yanben Shen,Timilehin T. Ayanlade,Venkata Naresh Boddepalli,Mojdeh Saadati,Ashlyn Rairdin,Zi K. Deng,Muhammad Arbab Arshad,Aditya Balu,Daren Mueller,Asheesh K Singh,Wesley Everman,Nirav Merchant,Baskar Ganapathysubramanian,Meaghan Anderson,Soumik Sarkar,Arti Singh
机构: Iowa State University (爱荷华州立大学); University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early identification of weeds is essential for effective management and control, and there is growing interest in automating the process using computer vision techniques coupled with AI methods. However, challenges associated with training AI-based weed identification models, such as limited expert-verified data and complexity and variability in morphological features, have hindered progress. To address these issues, we present WeedNet, the first global-scale weed identification model capable of recognizing an extensive set of weed species, including noxious and invasive plant species. WeedNet is an end-to-end real-time weed identification pipeline and uses self-supervised learning, fine-tuning, and enhanced trustworthiness strategies. WeedNet achieved 91.02% accuracy across 1,593 weed species, with 41% species achieving 100% accuracy. Using a fine-tuning strategy and a Global-to-Local approach, the local Iowa WeedNet model achieved an overall accuracy of 97.38% for 85 Iowa weeds, most classes exceeded a 90% mean accuracy per class. Testing across intra-species dissimilarity (developmental stages) and inter-species similarity (look-alike species) suggests that diversity in the images collected, spanning all the growth stages and distinguishable plant characteristics, is crucial in driving model performance. The generalizability and adaptability of the Global WeedNet model enable it to function as a foundational model, with the Global-to-Local strategy allowing fine-tuning for region-specific weed communities. Additional validation of drone- and ground-rover-based images highlights the potential of WeedNet for integration into robotic platforms. Furthermore, integration with AI for conversational use provides intelligent agricultural and ecological conservation consulting tools for farmers, agronomists, researchers, land managers, and government agencies across diverse landscapes.
zh

[CV-199] Words as Geometric Features: Estimating Homography using Optical Character Recognition as Compressed Image Representation

【速读】:该论文旨在解决文档对齐与配准问题,特别是在无法获取原始文档图像的情况下,如何实现高效的文档对齐。传统方法依赖于基于图像的特征(如关键点、边缘和纹理)来估计几何变换(如单应性变换),但这些方法通常需要访问原始图像数据,而这一需求可能因隐私、存储或传输限制而无法满足。论文提出的解决方案的关键在于利用光学字符识别(OCR)输出作为特征进行单应性估计,通过结合OCR检测到的文本的空间位置和内容信息,实现无需像素级图像数据的文档对齐,同时具备对OCR噪声的鲁棒性,通过RANSAC处理异常值和不准确性。

链接: https://arxiv.org/abs/2505.18925
作者: Ross Greer,Alisha Ukani,Katherine Izhikevich,Earlence Fernandes,Stefan Savage,Alex C. Snoeren
机构: University of California, San Diego(加州大学圣地亚哥分校); University of California, Merced(加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document alignment and registration play a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation. Traditional methods for document alignment rely on image-based features like keypoints, edges, and textures to estimate geometric transformations, such as homographies. However, these approaches often require access to the original document images, which may not always be available due to privacy, storage, or transmission constraints. This paper introduces a novel approach that leverages Optical Character Recognition (OCR) outputs as features for homography estimation. By utilizing the spatial positions and textual content of OCR-detected words, our method enables document alignment without relying on pixel-level image data. This technique is particularly valuable in scenarios where only OCR outputs are accessible. Furthermore, the method is robust to OCR noise, incorporating RANSAC to handle outliers and inaccuracies in the OCR data. On a set of test documents, we demonstrate that our OCR-based approach even performs more accurately than traditional image-based methods, offering a more efficient and scalable solution for document registration tasks. The proposed method facilitates applications in document processing, all while reducing reliance on high-dimensional image data.
zh

[CV-200] LLM -Guided Taxonomy and Hierarchical Uncertainty for 3D Point CLoud Active Learning

【速读】:该论文旨在解决3D点云语义分割中在极低标注预算下如何高效且准确地进行样本选择的问题。传统方法将标签视为平级且独立的,忽略了点云数据中固有的语义结构。其解决方案的关键在于首次将大语言模型(Large Language Models, LLMs)引入主动学习框架,通过LLM提示生成多层级语义分类体系,并结合递归不确定性投影机制,实现跨层次的不确定性传播,从而在保持语义结构的同时,选择具有空间多样性和标签感知性的样本。

链接: https://arxiv.org/abs/2505.18924
作者: Chenxi Li,Nuo Chen,Fengyun Tan,Yantong Chen,Bochun Yuan,Tianrui Li,Chongshou Li
机构: Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel active learning framework for 3D point cloud semantic segmentation that, for the first time, integrates large language models (LLMs) to construct hierarchical label structures and guide uncertainty-based sample selection. Unlike prior methods that treat labels as flat and independent, our approach leverages LLM prompting to automatically generate multi-level semantic taxonomies and introduces a recursive uncertainty projection mechanism that propagates uncertainty across hierarchy levels. This enables spatially diverse, label-aware point selection that respects the inherent semantic structure of 3D scenes. Experiments on S3DIS and ScanNet v2 show that our method achieves up to 4% mIoU improvement under extremely low annotation budgets (e.g., 0.02%), substantially outperforming existing baselines. Our results highlight the untapped potential of LLMs as knowledge priors in 3D vision and establish hierarchical uncertainty modeling as a powerful paradigm for efficient point cloud annotation.
zh

[CV-201] Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering NEURIPS2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在3D临床诊断中的适用性问题,特别是针对腹部肿瘤的CT影像分析。现有视觉-语言模型(Vision-Language Models, VLMs)在2D视觉任务中表现良好,但在3D医学影像诊断中仍面临识别精度、推理能力和领域知识等方面的挑战。为系统评估这些能力,研究提出了DeepTumorVQA,一个面向腹部肿瘤CT扫描的诊断视觉问答(VQA)基准数据集。其关键解决方案在于通过大规模多模态预训练提升模型性能,并强调图像预处理和视觉模块设计对3D感知的重要性。

链接: https://arxiv.org/abs/2505.18915
作者: Yixiong Chen,Wenjie Xiao,Pedro R. A. S. Bassi,Xinze Zhou,Sezgin Er,Ibrahim Ethem Hamamci,Zongwei Zhou,Alan Yuille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 datasetsbenchmarks track submission

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown promise in various 2D visual tasks, yet their readiness for 3D clinical diagnosis remains unclear due to stringent demands for recognition precision, reasoning ability, and domain knowledge. To systematically evaluate these dimensions, we present DeepTumorVQA, a diagnostic visual question answering (VQA) benchmark targeting abdominal tumors in CT scans. It comprises 9,262 CT volumes (3.7M slices) from 17 public datasets, with 395K expert-level questions spanning four categories: Recognition, Measurement, Visual Reasoning, and Medical Reasoning. DeepTumorVQA introduces unique challenges, including small tumor detection and clinical reasoning across 3D anatomy. Benchmarking four advanced VLMs (RadFM, M3D, Merlin, CT-CHAT), we find current models perform adequately on measurement tasks but struggle with lesion recognition and reasoning, and are still not meeting clinical needs. Two key insights emerge: (1) large-scale multimodal pretraining plays a crucial role in DeepTumorVQA testing performance, making RadFM stand out among all VLMs. (2) Our dataset exposes critical differences in VLM components, where proper image preprocessing and design of vision modules significantly affect 3D perception. To facilitate medical multimodal research, we have released DeepTumorVQA as a rigorous benchmark: this https URL.
zh

[CV-202] Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos

【速读】:该论文试图解决在专家演示与学习者环境之间存在领域偏移(domain shifts)时,基于视频的模仿学习(imitation learning)失效的问题,例如光照、颜色或纹理的差异。解决方案的关键在于重新思考感知表示本身,通过引入受生物视觉系统启发的事件感知(event-inspired perception)方法,将标准RGB视频转换为稀疏的、基于事件的表示,该表示编码时间强度梯度并摒弃静态外观特征,从而将运动动力学与视觉风格解耦,实现对视觉干扰的鲁棒性模仿。

链接: https://arxiv.org/abs/2505.18899
作者: Andrea Ramazzina,Vittorio Giammarino,Matteo El-Hariry,Mario Bijelic
机构: Mercedes-Benz AG (梅赛德斯-奔驰集团); Technical University of Munich (慕尼黑工业大学); Purdue University (普渡大学); Space Robotics Research Group (空间机器人研究组); Torc Robotics (托克机器人公司); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Imitation from videos often fails when expert demonstrations and learner environments exhibit domain shifts, such as discrepancies in lighting, color, or texture. While visual randomization partially addresses this problem by augmenting training data, it remains computationally intensive and inherently reactive, struggling with unseen scenarios. We propose a different approach: instead of randomizing appearances, we eliminate their influence entirely by rethinking the sensory representation itself. Inspired by biological vision systems that prioritize temporal transients (e.g., retinal ganglion cells) and by recent sensor advancements, we introduce event-inspired perception for visually robust imitation. Our method converts standard RGB videos into a sparse, event-based representation that encodes temporal intensity gradients, discarding static appearance features. This biologically grounded approach disentangles motion dynamics from visual style, enabling robust visual imitation from observations even in the presence of visual mismatches between expert and agent environments. By training policies on event streams, we achieve invariance to appearance-based distractors without requiring computationally expensive and environment-specific data augmentation techniques. Experiments across the DeepMind Control Suite and the Adroit platform for dynamic dexterous manipulation show the efficacy of our method. Our code is publicly available at Eb-LAIfO.
zh

[CV-203] LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders

【速读】:该论文试图解决视觉编码器在对抗扰动下的鲁棒性问题,特别是现有监督和无监督对抗微调策略在早期微调阶段的不稳定性以及鲁棒性与干净数据准确性的权衡问题。解决方案的关键在于提出一种新的无监督对抗微调框架——拉格朗日优化鲁棒嵌入(Lagrangian-Optimized Robust Embeddings, LORE),通过约束优化方法平衡提升鲁棒性与保持原始性能之间的矛盾,并通过嵌入空间接近性约束有效维持干净数据的性能。

链接: https://arxiv.org/abs/2505.18884
作者: Borna Khodabandeh,Amirabbas Afzali,Amirhossein Afsharrad,Seyed Shahabeddin Mousavi,Sanjay Lall,Sajjad Amini,Seyed-Mohsen Moosavi-Dezfooli
机构: Stanford University (斯坦福大学); Aktus AI (Aktus AI); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Apple (苹果)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Visual encoders have become fundamental components in modern computer vision pipelines. However, ensuring robustness against adversarial perturbations remains a critical challenge. Recent efforts have explored both supervised and unsupervised adversarial fine-tuning strategies. We identify two key limitations in these approaches: (i) they often suffer from instability, especially during the early stages of fine-tuning, resulting in suboptimal convergence and degraded performance on clean data, and (ii) they exhibit a suboptimal trade-off between robustness and clean data accuracy, hindering the simultaneous optimization of both objectives. To overcome these challenges, we propose Lagrangian-Optimized Robust Embeddings (LORE), a novel unsupervised adversarial fine-tuning framework. LORE utilizes constrained optimization, which offers a principled approach to balancing competing goals, such as improving robustness while preserving nominal performance. By enforcing embedding-space proximity constraints, LORE effectively maintains clean data performance throughout adversarial fine-tuning. Extensive experiments show that LORE significantly improves zero-shot adversarial robustness with minimal degradation in clean data accuracy. Furthermore, we demonstrate the effectiveness of the adversarially fine-tuned CLIP image encoder in out-of-distribution generalization and enhancing the interpretability of image embeddings.
zh

[CV-204] SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

【速读】:该论文旨在解决开放词汇目标导航(open-vocabulary object navigation)在动态场景中的训练与评估问题,传统数据集通常局限于静态环境,难以满足复杂现实场景下导航代理的训练需求。其解决方案的关键在于提出一种语义感知的数据集与基准生成流水线(Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes, SD-OVON),利用预训练多模态基础模型生成符合真实语义和日常常识的无限独特逼真场景变体,并配套生成兼容Habitat模拟器的目标导航任务实例。该方法通过引入动态场景和可操作物体,提升了导航任务的真实性和复杂性,为实现实时仿真与仿真到现实的机器人应用提供了支持。

链接: https://arxiv.org/abs/2505.18881
作者: Dicong Qiu,Jiadi You,Zeying Gong,Ronghe Qiu,Hui Xiong,Junwei Liang
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint. 21 pages

点击查看摘要

Abstract:We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.
zh

[CV-205] REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

【速读】:该论文试图解决视频摘要生成中存在的问题,即现有的抽取式方法难以生成连贯的叙事,而现有的抽象式方法无法从输入视频中“引用”(quote)视频片段,即在输出中插入短视频片段。解决方案的关键在于提出一种基于检索的生成框架(REGen系统),该框架首先利用微调的大语言模型生成包含引用占位符的故事脚本,然后通过新颖的检索模型从候选可引用视频片段中选择最能支持叙事的视频片段来替换占位符,从而实现连贯叙事与嵌入视频片段的结合。

链接: https://arxiv.org/abs/2505.18880
作者: Weihan Xu,Yimeng Ma,Jingyue Huang,Yang Li,Wenye Ma,Taylor Berg-Kirkpatrick,Julian McAuley,Paul Pu Liang,Hao-Wen Dong
机构: Duke University (杜克大学); University of California, San Diego (加利福尼亚大学圣地亚哥分校); MBZUAI (MBZUAI); MIT (麻省理工学院); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote’ from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.
zh

[CV-206] Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

【速读】:该论文旨在解决Diffusion Transformers(DiTs)在视频生成中因注意力机制的二次复杂度导致的显著延迟问题,同时确保在有限计算预算下达到最优生成质量。现有方法在关键token识别和计算效率方面存在不足,主要表现为基于位置而非语义的token聚类导致表示不准确,以及关键token分布稀疏造成GPU计算资源浪费。该论文提出的SVG2框架通过语义感知的排列(semantic-aware permutation)技术,利用k-means算法根据语义相似性对token进行聚类和重排序,从而提高关键token识别的准确性并减少计算浪费,实现生成质量与效率之间的帕累托前沿优化。

链接: https://arxiv.org/abs/2505.18875
作者: Shuo Yang,Haocheng Xi,Yilong Zhao,Muyang Li,Jintao Zhang,Han Cai,Yujun Lin,Xiuyu Li,Chenfeng Xu,Kelly Peng,Jianfei Chen,Song Han,Kurt Keutzer,Ion Stoica
机构: University of California, Berkeley (加州大学伯克利分校); MIT (麻省理工学院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.
zh

[CV-207] Eye-See-You: Reverse Pass-Through VR and Head Avatars IJCAI2025

【速读】:该论文试图解决虚拟现实(VR)头显在使用过程中导致用户眼睛和面部部分区域被遮挡的问题,这一问题阻碍了视觉交流并可能引发社交孤立。解决方案的关键在于提出RevAvatar框架,该框架利用生成式AI(Generative AI)方法实现反向透视技术,通过整合先进的生成模型和多模态AI技术,从部分可见的眼部和下脸部区域重建高保真2D面部图像并生成精确的3D头部化身,从而实现虚拟与物理环境之间的无缝交互。

链接: https://arxiv.org/abs/2505.18869
作者: Ankan Dash,Jingyi Gu,Guiling Wang,Chen Chen
机构: New Jersey Institute of Technology (新泽西理工学院); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34th International Joint Conference on Artificial Intelligence, IJCAI 2025

点击查看摘要

Abstract:Virtual Reality (VR) headsets, while integral to the evolving digital ecosystem, present a critical challenge: the occlusion of users’ eyes and portions of their faces, which hinders visual communication and may contribute to social isolation. To address this, we introduce RevAvatar, an innovative framework that leverages AI methodologies to enable reverse pass-through technology, fundamentally transforming VR headset design and interaction paradigms. RevAvatar integrates state-of-the-art generative models and multimodal AI techniques to reconstruct high-fidelity 2D facial images and generate accurate 3D head avatars from partially observed eye and lower-face regions. This framework represents a significant advancement in AI4Tech by enabling seamless interaction between virtual and physical environments, fostering immersive experiences such as VR meetings and social engagements. Additionally, we present VR-Face, a novel dataset comprising 200,000 samples designed to emulate diverse VR-specific conditions, including occlusions, lighting variations, and distortions. By addressing fundamental limitations in current VR systems, RevAvatar exemplifies the transformative synergy between AI and next-generation technologies, offering a robust platform for enhancing human connection and interaction in virtual environments.
zh

[CV-208] Localizing Knowledge in Diffusion Transformers

【速读】:该论文试图解决如何在Diffusion Transformer (DiT) 模型中定位特定类型知识的编码位置,以提升模型的可解释性、可控性和适应性。其解决方案的关键在于提出一种与模型和知识类型无关的方法,用于识别DiT块中特定知识的编码位置,并验证这些位置在生成输出中的因果关联性。基于此,该方法被应用于模型个性化和知识擦除任务,实现了高效且针对性的微调,降低了计算成本并提升了任务性能。

链接: https://arxiv.org/abs/2505.18832
作者: Arman Zarei,Samyadeep Basu,Keivan Rezaei,Zihao Lin,Sayan Nag,Soheil Feizi
机构: University of Maryland (马里兰大学); University of California, Davis (加州大学戴维斯分校); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding how knowledge is distributed across the layers of generative models is crucial for improving interpretability, controllability, and adaptation. While prior work has explored knowledge localization in UNet-based architectures, Diffusion Transformer (DiT)-based models remain underexplored in this context. In this paper, we propose a model- and knowledge-agnostic method to localize where specific types of knowledge are encoded within the DiT blocks. We evaluate our method on state-of-the-art DiT-based models, including PixArt-alpha, FLUX, and SANA, across six diverse knowledge categories. We show that the identified blocks are both interpretable and causally linked to the expression of knowledge in generated outputs. Building on these insights, we apply our localization framework to two key applications: model personalization and knowledge unlearning. In both settings, our localized fine-tuning approach enables efficient and targeted updates, reducing computational cost, improving task-specific performance, and better preserving general model behavior with minimal interference to unrelated or surrounding content. Overall, our findings offer new insights into the internal structure of DiTs and introduce a practical pathway for more interpretable, efficient, and controllable model editing.
zh

[CV-209] How to build a consistency model: Learning flow maps via self-distillation

【速读】:该论文旨在解决生成式 AI (Generative AI) 中流模型和扩散模型相关流图(flow maps)的学习问题,特别是如何提高生成模型的效率。其解决方案的关键在于利用连续时间流下的速度场与流图瞬时变化率之间的关系,通过自蒸馏(self-distillation)将现有的知识蒸馏方案转换为直接训练算法,从而无需依赖预训练模型。研究还表明,针对不同维度的任务,采用不同的目标函数可以优化模型性能,例如高维任务如图像合成更适合避免流图的时间和空间导数,而低维任务则可通过引入高阶导数来捕捉尖锐特征。

链接: https://arxiv.org/abs/2505.18825
作者: Nicholas M. Boffi,Michael S. Albergo,Eric Vanden-Eijnden
机构: Carnegie Mellon University (卡内基梅隆大学); Harvard University (哈佛大学); Courant Institute of Mathematical Sciences (柯朗数学科学研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building on the framework proposed in Boffi et al. (2024), we present a systematic approach for learning flow maps associated with flow and diffusion models. Flow map-based models, commonly known as consistency models, encompass recent efforts to improve the efficiency of generative models based on solutions to differential equations. By exploiting a relationship between the velocity field underlying a continuous-time flow and the instantaneous rate of change of the flow map, we show how to convert existing distillation schemes into direct training algorithms via self-distillation, eliminating the need for pre-trained models. We empirically evaluate several instantiations of our framework, finding that high-dimensional tasks like image synthesis benefit from objective functions that avoid temporal and spatial derivatives of the flow map, while lower-dimensional tasks can benefit from objectives incorporating higher-order derivatives to capture sharp features.
zh

[CV-210] MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割任务中传统卷积神经网络(CNN)难以有效捕捉全局上下文信息以及基于Transformer的方法在局部特征建模不足且计算复杂度高的问题。其解决方案的关键在于提出了一种新型的混合CNN-Transformer架构MSLAU-Net,该架构融合了两种范式的优点,通过引入多尺度线性注意力机制以高效提取多尺度特征并建模长程依赖,同时采用自上而下的特征聚合机制实现多层级特征融合与空间分辨率恢复。

链接: https://arxiv.org/abs/2505.18823
作者: Libin Lan,Yanxin Li,Xiaojuan Liu,Juan Zhou,Jianxun Zhang,Nannan Huang,Yudong Zhang
机构: Chongqing University of Technology (重庆理工大学); Chongqing University of Technology (重庆理工大学); the Second Affiliated Hospital of Army Military Medical University (中国人民解放军陆军军医大学第二附属医院); the Affiliated Stomatological Hospital of Chongqing Medical University (重庆医科大学附属口腔医院); Chongqing Key Laboratory of Oral Diseases (重庆市口腔疾病重点实验室); Chongqing Municipal Key Laboratory of Oral Biomedical Engineering of Higher Education (重庆市高校口腔生物医学工程重点实验室); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at this https URL.
zh

[CV-211] Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

【速读】:该论文旨在解决3D场景理解中由于数据集特定空间尺度敏感性导致的跨域泛化能力不足的问题。其核心解决方案是提出一种具有尺度不变性的通用3D分词器(3D tokenizer),通过结合基于超点(superpoint)的分组与坐标尺度归一化,实现对场景尺度无关的语义感知分词。该方法在无需标注的情况下,利用掩码点建模、基于聚类的目标以及跨模态蒸馏来对齐3D分词与2D多视角图像特征,从而提升模型的泛化能力和表示学习效果。

链接: https://arxiv.org/abs/2505.18819
作者: Guofeng Mei,Bin Ren,Juan Liu,Luigi Riz,Xiaoshui Huang,Xu Zheng,Yongshun Gong,Ming-Hsuan Yang,Nicu Sebe,Fabio Poiesi
机构: Fondazione Bruno Kessler; University of Trento; University of Pisa; Beijing Forestry University; Shanghai Jiao Tong University; Hong Kong University of Science and Technology (GZ); Shandong University; University of California, Merced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, tokenizer

点击查看摘要

Abstract:Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.
zh

[CV-212] Reasoning Segmentation for Images and Videos: A Survey

【速读】:该论文旨在解决基于隐式文本查询进行目标分割的问题,即通过自然语言描述实现对图像或视频中特定对象的精准定位与分割,这一过程需要结合推理和知识整合能力。传统分割方法依赖于固定语义类别或显式提示,而该研究提出的Reasoning Segmentation (RS) 通过融合视觉感知与类人推理能力,提升了人机交互的直观性。其解决方案的关键在于构建能够理解并执行复杂自然语言指令的模型架构,并结合多模态信息进行有效推理与分割。

链接: https://arxiv.org/abs/2505.18816
作者: Yiqing Shen,Chenjia Li,Fei Xiong,Jeong-O Jeong,Tianpeng Wang,Michael Latman,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学); Amazon Web Services (亚马逊网络服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning Segmentation (RS) aims to delineate objects based on implicit text queries, the interpretation of which requires reasoning and knowledge integration. Unlike the traditional formulation of segmentation problems that relies on fixed semantic categories or explicit prompting, RS bridges the gap between visual perception and human-like reasoning capabilities, facilitating more intuitive human-AI interaction through natural language. Our work presents the first comprehensive survey of RS for image and video processing, examining 26 state-of-the-art methods together with a review of the corresponding evaluation metrics, as well as 29 datasets and benchmarks. We also explore existing applications of RS across diverse domains and identify their potential extensions. Finally, we identify current research gaps and highlight promising future directions.
zh

[CV-213] SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

【速读】:该论文旨在解决视频中细粒度时空理解的问题,特别是视频指代理解与视频定位任务的协同学习难题。现有方法通常将这两个任务孤立处理,限制了统一、指代 grounded 的视频交互进展。论文指出,缺乏高质量、统一的视频指令数据和全面的评估基准是主要瓶颈。解决方案的关键在于从三个核心方面进行贡献:构建大规模的SAMA-239K数据集以支持联合学习,提出SAMA模型以增强细粒度视频理解和精确定位能力,以及建立SAMA-Bench基准以全面评估视频大模型在多轮时空指代理解和 grounded 对话中的综合能力。

链接: https://arxiv.org/abs/2505.18812
作者: Ye Sun,Hao Zhang,Henghui Ding,Tiehua Zhang,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); HKUST (香港科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.
zh

[CV-214] VORTA: Efficient Video Diffusion via Routing Sparse Attention

【速读】:该论文旨在解决视频扩散变压器(Video Diffusion Transformers, VDiTs)在高保真视频生成任务中计算成本过高的问题,特别是由于注意力机制在高维视频序列上的二次复杂度所导致的效率瓶颈。其解决方案的关键在于提出一种名为VORTA的加速框架,该框架包含两个创新组件:一是高效捕捉长距离依赖关系的稀疏注意力机制,二是通过自适应路由策略在采样过程中用特定的稀疏注意力变体替代全3D注意力,从而显著提升计算效率。

链接: https://arxiv.org/abs/2505.18809
作者: Wenhao Sun,Rong-Cheng Tu,Yifu Ding,Zhao Jin,Jingyi Liao,Shunyu Liu,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 15 figures. The code is available at this https URL

点击查看摘要

Abstract:Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose \textbfVORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a 1.76\times end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to 14.41\times speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.
zh

[CV-215] hink Twice before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation IJCAI-25

【速读】:该论文旨在解决深度伪造(Deepfake, DF)检测器在实际部署中遇到的挑战,特别是当测试样本与训练数据存在后处理操作或分布偏移时,检测性能会显著下降的问题。其解决方案的关键在于提出一种名为\textttT ^2 A的新型在线测试时适应方法,该方法通过引入不确定性感知的负学习目标,使模型在推理过程中探索替代选项,而非仅依赖初始预测,从而提升检测器的适应能力。此外,还结合了不确定样本优先策略和梯度掩码技术,以优化重要样本和模型参数的适应过程。

链接: https://arxiv.org/abs/2505.18787
作者: Hong-Hanh Nguyen-Le,Van-Tuan Tran,Dinh-Thuc Nguyen,Nhien-An Le-Khac
机构: University College Dublin (爱尔兰都柏林大学); Trinity College Dublin (都柏林三一学院); University of Science (科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at 34th International Joint Conference on Artificial Intelligence (IJCAI-25)

点击查看摘要

Abstract:Deepfake (DF) detectors face significant challenges when deployed in real-world environments, particularly when encountering test samples deviated from training data through either postprocessing manipulations or distribution shifts. We demonstrate postprocessing techniques can completely obscure generation artifacts presented in DF samples, leading to performance degradation of DF detectors. To address these challenges, we propose Think Twice before Adaptation (\textttT ^2 A), a novel online test-time adaptation method that enhances the adaptability of detectors during inference without requiring access to source training data or labels. Our key idea is to enable the model to explore alternative options through an Uncertainty-aware Negative Learning objective rather than solely relying on its initial predictions as commonly seen in entropy minimization (EM)-based approaches. We also introduce an Uncertain Sample Prioritization strategy and Gradients Masking technique to improve the adaptation by focusing on important samples and model parameters. Our theoretical analysis demonstrates that the proposed negative learning objective exhibits complementary behavior to EM, facilitating better adaptation capability. Empirically, our method achieves state-of-the-art results compared to existing test-time adaptation (TTA) approaches and significantly enhances the resilience and generalization of DF detectors during inference. Code is available \hrefthis https URLhere.
zh

[CV-216] OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50 Tasks

【速读】:该论文试图解决当前基准测试在广度和深度上不足以全面评估大型多模态模型(LMMs)多样化能力的问题。解决方案的关键在于提出OmniGenBench,这是一个精心设计的综合性基准,旨在从感知导向和认知导向两个维度系统评估最先进的LMMs的指令遵循能力。该基准包含57个基于现实场景的子任务,并采用双模式评估协议,分别利用现成的视觉解析工具和强大的大语言模型(LLM)评判器来评估生成图像与用户指令的一致性。

链接: https://arxiv.org/abs/2505.18775
作者: Jiayu Wang,Yang Jiao,Yue Yu,Tianwen Qian,Shaoxiang Chen,Jingjing Chen,Yu-Gang Jiang
机构: Fudan University (复旦大学); East China Normal University (华东师范大学); MiniMax
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent breakthroughs in large multimodal models (LMMs), such as the impressive GPT-4o-Native, have demonstrated remarkable proficiency in following general-purpose instructions for image generation. However, current benchmarks often lack the necessary breadth and depth to fully evaluate the diverse capabilities of these models. To overcome this limitation, we introduce OmniGenBench, a novel and comprehensive benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs across both perception-centric and cognition-centric dimensions. Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand. For rigorous evaluation, we further employ a dual-mode protocol. This protocol utilizes off-the-shelf visual parsing tools for perception-centric tasks and a powerful LLM-based judger for cognition-centric tasks to assess the alignment between generated images and user instructions. Using OmniGenBench, we evaluate mainstream generative models, including prevalent models like GPT-4o, Gemini-2.0-Flash, and Seedream, and provide in-depth comparisons and analyses of their this http URL and data are available at this https URL.
zh

[CV-217] CageNet: A Meta-Framework for Learning on Wild Meshes

【速读】:该论文试图解决通用三角形网格框架在处理“野生”(meshes in-the-wild)数据时的适用性问题,这类数据通常包含多个组件、非流形元素或连接性破坏等复杂情况。解决方案的关键在于提出一种基于笼状几何(caged geometry)的可配置元框架:给定一个网格,笼是一个紧密包裹该网格的单组件流形三角形网格,通过广义重心坐标在笼与网格之间的函数映射,使得可以在多种数据和应用中进行学习与测试。

链接: https://arxiv.org/abs/2505.18772
作者: Michal Edelstein,Hsueh-Ti Derek Liu,Mirela Ben-Chen
机构: Technion - Israel Institute of Technology (以色列技术学院); Roblox (罗布乐思); University of British Columbia (不列颠哥伦比亚大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 13 figures (excluding supplementary material)

点击查看摘要

Abstract:Learning on triangle meshes has recently proven to be instrumental to a myriad of tasks, from shape classification, to segmentation, to deformation and animation, to mention just a few. While some of these applications are tackled through neural network architectures which are tailored to the application at hand, many others use generic frameworks for triangle meshes where the only customization required is the modification of the input features and the loss function. Our goal in this paper is to broaden the applicability of these generic frameworks to “wild”, i.e. meshes in-the-wild which often have multiple components, non-manifold elements, disrupted connectivity, or a combination of these. We propose a configurable meta-framework based on the concept of caged geometry: Given a mesh, a cage is a single component manifold triangle mesh that envelopes it closely. Generalized barycentric coordinates map between functions on the cage, and functions on the mesh, allowing us to learn and test on a variety of data, in different applications. We demonstrate this concept by learning segmentation and skinning weights on difficult data, achieving better performance to state of the art techniques on wild meshes.
zh

[CV-218] Dual-Path Stable Soft Prompt Generation for Domain Generalization

【速读】:该论文旨在解决领域泛化(Domain Generalization, DG)中提示生成方法存在的提示变异性(Prompt Variability)问题,即相同输入在不同随机种子下生成显著不同且次优提示的现象。解决方案的关键在于引入负学习(negative learning)机制,并提出双路径稳定软提示生成(Dual-Path Stable Soft Prompt Generation, DPSPG)框架,通过引入互补的提示生成器生成负向提示,从而减少误导信息的引入,提升提示的稳定性和泛化能力。

链接: https://arxiv.org/abs/2505.18770
作者: Yuedi Zhang,Shuanghao Bai,Wanqi Zhou,Zhirong Luan,Badong Chen
机构: Xi’an Jiaotong University (西安交通大学); Xi’an University of Technology (西安理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Domain generalization (DG) aims to learn a model using data from one or multiple related but distinct source domains that can generalize well to unseen out-of-distribution target domains. Inspired by the success of large pre-trained vision-language models (VLMs), prompt tuning has emerged as an effective generalization strategy. However, it often struggles to capture domain-specific features due to its reliance on manually or fixed prompt inputs. Recently, some prompt generation methods have addressed this limitation by dynamically generating instance-specific and domain-specific prompts for each input, enriching domain information and demonstrating potential for enhanced generalization. Through further investigation, we identify a notable issue in existing prompt generation methods: the same input often yields significantly different and suboptimal prompts across different random seeds, a phenomenon we term Prompt Variability. To address this, we introduce negative learning into the prompt generation process and propose Dual-Path Stable Soft Prompt Generation (DPSPG), a transformer-based framework designed to improve both the stability and generalization of prompts. Specifically, DPSPG incorporates a complementary prompt generator to produce negative prompts, thereby reducing the risk of introducing misleading information. Both theoretical and empirical analyses demonstrate that negative learning leads to more robust and effective prompts by increasing the effective margin and reducing the upper bound of the gradient norm. Extensive experiments on five DG benchmark datasets show that DPSPG consistently outperforms state-of-the-art methods while maintaining prompt stability.
zh

[CV-219] StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations NIPS2025

【速读】:该论文试图解决文本到图像扩散模型在风格模仿和个性化定制中引发的知识产权保护及虚假内容生成问题,以及现有防御方法(如Glaze和Anti-DreamBooth)在面对净化类攻击(如DiffPure和Noise Upscaling)时的脆弱性。解决方案的关键在于提出一种新型反模仿方法StyleGuard,其核心是设计了一种新颖的风格损失(style loss),通过优化潜在空间中的风格相关特征,使生成结果偏离原始图像,从而提升方法的模型无关迁移能力;同时引入一种上采样损失(upscale loss),结合集成净化器和上采样器进行训练,以增强扰动对基于扩散的净化机制的绕过能力。

链接: https://arxiv.org/abs/2505.18766
作者: Yanjie Li,Wenxuan Zhang,Xinqi Lyu,Yihao Liu,Bin Xiao
机构: Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: submitted to NIPS2025

点击查看摘要

Abstract:Recently, text-to-image diffusion models have been widely used for style mimicry and personalized customization through methods such as DreamBooth and Textual Inversion. This has raised concerns about intellectual property protection and the generation of deceptive content. Recent studies, such as Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect images from these attacks. However, recent purification-based methods, such as DiffPure and Noise Upscaling, have successfully attacked these latest defenses, showing the vulnerabilities of these methods. Moreover, present methods show limited transferability across models, making them less effective against unknown text-to-image models. To address these issues, we propose a novel anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes the style-related features in the latent space to make it deviate from the original image, which improves model-agnostic transferability. Additionally, to enhance the perturbation’s ability to bypass diffusion-based purification, we designed a novel upscale loss that involves ensemble purifiers and upscalers during training. Extensive experiments on the WikiArt and CelebA datasets demonstrate that StyleGuard outperforms existing methods in robustness against various transformations and purifications, effectively countering style mimicry in various models. Moreover, StyleGuard is effective on different style mimicry methods, including DreamBooth and Textual Inversion.
zh

[CV-220] oDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models

【速读】:该论文旨在解决大规模视觉-语言模型(LVLMs)中视觉输入表示所包含的token数量远多于文本输入,导致计算开销过大的问题。其解决方案的关键在于提出ToDRE框架,该框架通过基于token多样性(token Diversity)和token-任务相关性(token-task RElevance)的两阶段无训练token压缩方法,实现更高效的视觉token剪枝。与传统依赖token重要性的方法不同,ToDRE首先利用贪心k-center算法在视觉编码器后选择多样化的视觉token子集,其次在大型语言模型(LLM)解码器中进一步移除任务不相关的视觉token,从而有效减少计算负载并提升推理效率。

链接: https://arxiv.org/abs/2505.18757
作者: Duo Li,Zuhao Yang,Shijian Lu
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:The representation of visual inputs of large vision-language models (LVLMs) usually involves substantially more tokens than that of textual inputs, leading to significant computational overhead. Several recent studies strive to mitigate this issue by either conducting token compression to prune redundant visual tokens or guiding them to bypass certain computational stages. While most existing work exploits token importance as the redundancy indicator, our study reveals that two largely neglected factors, namely, the diversity of retained visual tokens and their task relevance, often offer more robust criteria in token pruning. To this end, we design ToDRE, a two-stage and training-free token compression framework that achieves superior performance by pruning Tokens based on token Diversity and token-task RElevance. Instead of pruning redundant tokens, ToDRE introduces a greedy k-center algorithm to select and retain a small subset of diverse visual tokens after the vision encoder. Additionally, ToDRE addresses the “information migration” by further eliminating task-irrelevant visual tokens within the decoder of large language model (LLM). Extensive experiments show that ToDRE effectively reduces 90% of visual tokens after vision encoder and adaptively prunes all visual tokens within certain LLM’s decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.1% of model performance and excellent compatibility with efficient attention operators.
zh

[CV-221] C3R: Channel Conditioned Cell Representations for unified evaluation in microscopy imaging

【速读】:该论文旨在解决免疫组化(Immunohistochemical, IHC)图像在深度学习模型中的跨数据集泛化问题,特别是由于不同实验室和研究中染色协议差异导致的通道数量和配置不一致所带来的挑战。现有方法虽然构建了通道自适应模型,但无法支持跨数据集的分布外(out-of-distribution, OOD)评估,也无法在通道数量不匹配的情况下实现真正的零样本(zero-shot)应用。该论文的关键解决方案是引入一种结构化的细胞图像通道视图,将通道分为上下文(context)和概念(concept)两类,并基于上下文-概念原则构建了通道条件化细胞表示(Channel Conditioned Cell Representations, C3R),该框架通过通道自适应编码器架构和掩码知识蒸馏训练策略,实现了对分布内(in-distribution, ID)和OOD数据集的统一评估,显著提升了模型的跨数据集泛化能力。

链接: https://arxiv.org/abs/2505.18745
作者: Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito
机构: University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Immunohistochemical (IHC) images reveal detailed information about structures and functions at the subcellular level. However, unlike natural images, IHC datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Existing approaches build channel-adaptive models, which unfortunately fail to support out-of-distribution (OOD) evaluation across IHC datasets and cannot be applied in a true zero-shot setting with mismatched channel counts. To address this, we introduce a structured view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference to the concept channels in the image. We leverage this context-concept principle to develop Channel Conditioned Cell Representations (C3R), a framework designed for unified evaluation on in-distribution (ID) and OOD datasets. C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while a trivial implementation of our core idea also outperforms the channel-adaptive methods reported on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IHC datasets, without requiring dataset-specific adaptation or retraining.
zh

[CV-222] MoMBS: Mixed-order minibatch sampling enhances model training from diverse-quality images

【速读】:该论文试图解决在训练深度模型时如何有效利用具有多样质量的训练图像的问题,特别是在噪声标签图像分类和长尾图像分类中,以及在通用病灶检测(ULD)中,由于图像质量和标签正确性差异带来的挑战。解决方案的关键在于重新审视小批量采样(minibatch sampling, MBS),并提出一种新的混合顺序小批量采样(Mixed-order Minibatch Sampling, MoMBS)方法。MoMBS通过结合损失和不确定性来衡量样本难度,从而更精确地区分高损失样本为标签错误且代表性不足或标签正确但过拟合的样本,并优先将代表性不足的样本作为小批量中的主要梯度贡献者,以减少不良标签或过拟合样本的负面影响。

链接: https://arxiv.org/abs/2505.18741
作者: Han Li,Hu Han,S. Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages,8 figures

点击查看摘要

Abstract:Natural images exhibit label diversity (clean vs. noisy) in noisy-labeled image classification and prevalence diversity (abundant vs. sparse) in long-tailed image classification. Similarly, medical images in universal lesion detection (ULD) exhibit substantial variations in image quality, encompassing attributes such as clarity and label correctness. How to effectively leverage training images with diverse qualities becomes a problem in learning deep models. Conventional training mechanisms, such as self-paced curriculum learning (SCL) and online hard example mining (OHEM), relieve this problem by reweighting images with high loss values. Despite their success, these methods still confront two challenges: (i) the loss-based measure of sample hardness is imprecise, preventing optimum handling of different cases, and (ii) there exists under-utilization in SCL or over-utilization OHEM with the identified hard samples. To address these issues, this paper revisits the minibatch sampling (MBS), a technique widely used in deep network training but largely unexplored concerning the handling of diverse-quality training samples. We discover that the samples within a minibatch influence each other during training; thus, we propose a novel Mixed-order Minibatch Sampling (MoMBS) method to optimize the use of training samples with diverse qualities. MoMBS introduces a measure that takes both loss and uncertainty into account to surpass a sole reliance on loss and allows for a more refined categorization of high-loss samples by distinguishing them as either poorly labeled and under represented or well represented and overfitted. We prioritize under represented samples as the main gradient contributors in a minibatch and keep them from the negative influences of poorly labeled or overfitted samples with a mixed-order minibatch sampling design.
zh

[CV-223] Rethinking Direct Preference Optimization in Diffusion Models

【速读】:该论文试图解决文本到图像(text-to-image, T2I)扩散模型与人类偏好对齐的问题,特别是在偏好优化过程中存在的探索能力有限的问题。其解决方案的关键在于提出两种新颖的策略:首先,引入一种稳定的参考模型更新策略,通过放松冻结的参考模型并结合参考模型正则化,既鼓励探索又保持优化的稳定性;其次,提出一种时间步感知的训练策略,以缓解不同时间步之间的奖励尺度不平衡问题。这些方法能够集成到多种偏好优化算法中,并在人类偏好评估基准上显著提升了现有先进方法的性能。

链接: https://arxiv.org/abs/2505.18736
作者: Junyong Kang,Seohyun Lim,Kyungjune Baek,Hyunjung Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 12 figures, preprint

点击查看摘要

Abstract:Aligning text-to-image (T2I) diffusion models with human preferences has emerged as a critical research challenge. While recent advances in this area have extended preference optimization techniques from large language models (LLMs) to the diffusion setting, they often struggle with limited exploration. In this work, we propose a novel and orthogonal approach to enhancing diffusion-based preference optimization. First, we introduce a stable reference model update strategy that relaxes the frozen reference model, encouraging exploration while maintaining a stable optimization anchor through reference model regularization. Second, we present a timestep-aware training strategy that mitigates the reward scale imbalance problem across timesteps. Our method can be integrated into various preference optimization algorithms. Experimental results show that our approach improves the performance of state-of-the-art methods on human preference evaluation benchmarks.
zh

[CV-224] Align Beyond Prompts: Evaluating World Knowledge Alignment in Text-to-Image Generation

【速读】:该论文试图解决当前文本到图像生成模型在生成图像时与真实世界知识的对齐问题,而不仅仅是与用户显式提示的对齐。现有评估基准主要关注生成图像与提示之间的显式对齐,忽视了图像与更广泛现实知识之间的关联。解决方案的关键在于引入Align Beyond Prompts (ABP)基准和ABPScore度量,该度量利用现有的Multimodal Large Language Models (MLLMs)来评估生成图像与超越提示的真实世界知识之间的对齐程度,并通过Inference-Time Knowledge Injection (ITKI)策略在不改变模型训练的情况下提升生成图像与现实知识的一致性。

链接: https://arxiv.org/abs/2505.18730
作者: Wenchao Zhang,Jiahe Tian,Runze He,Jizhong Han,Jiao Dai,Miaomiao Feng,Wei Mi,Xiaodan Zhang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Recent text-to-image (T2I) generation models have advanced significantly, enabling the creation of high-fidelity images from textual prompts. However, existing evaluation benchmarks primarily focus on the explicit alignment between generated images and prompts, neglecting the alignment with real-world knowledge beyond prompts. To address this gap, we introduce Align Beyond Prompts (ABP), a comprehensive benchmark designed to measure the alignment of generated images with real-world knowledge that extends beyond the explicit user prompts. ABP comprises over 2,000 meticulously crafted prompts, covering real-world knowledge across six distinct scenarios. We further introduce ABPScore, a metric that utilizes existing Multimodal Large Language Models (MLLMs) to assess the alignment between generated images and world knowledge beyond prompts, which demonstrates strong correlations with human judgments. Through a comprehensive evaluation of 8 popular T2I models using ABP, we find that even state-of-the-art models, such as GPT-4o, face limitations in integrating simple real-world knowledge into generated images. To mitigate this issue, we introduce a training-free strategy within ABP, named Inference-Time Knowledge Injection (ITKI). By applying this strategy to optimize 200 challenging samples, we achieved an improvement of approximately 43% in ABPScore. The dataset and code are available in this https URL.
zh

[CV-225] FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment

【速读】:该论文试图解决真实自由视角下的多视角多目标跟踪(Multi-view Multi-object Tracking, MVMOT)系统缺乏的问题,这一问题限制了协作跟踪系统的灵活性和可扩展性。解决方案的关键在于提出FusionTrack框架,该框架通过合理整合跟踪与重识别任务,充分利用多视角信息实现鲁棒的轨迹关联,从而提升跟踪性能。

链接: https://arxiv.org/abs/2505.18727
作者: Xiaohe Li,Pengfei Li,Zide Fan,Ying Geng,Fangli Mou,Haohua Wu,Yunping Ge
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (航天信息研究院,中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbfFusionTrack, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.
zh

[CV-226] Deep Learning for Breast Cancer Detection: Comparative Analysis of ConvNeXT and EfficientNet

【速读】:该论文旨在解决乳腺癌早期检测的问题,以降低癌症相关死亡率。其解决方案的关键在于利用两种卷积神经网络(ConvNeXT 和 EfficientNet)对筛查性乳腺X线摄影图像进行分类,通过预处理、分类及性能评估来预测癌症的可能性。研究结果表明,ConvNeXT在AUC、准确率和F-score等指标上均优于EfficientNet,显示出其在乳腺癌筛查中的优越性能。

链接: https://arxiv.org/abs/2505.18725
作者: Mahmudul Hasan
机构: BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast cancer is the most commonly occurring cancer worldwide. This cancer caused 670,000 deaths globally in 2022, as reported by the WHO. Yet since health officials began routine mammography screening in age groups deemed at risk in the 1980s, breast cancer mortality has decreased by 40% in high-income nations. Every day, a greater and greater number of people are receiving a breast cancer diagnosis. Reducing cancer-related deaths requires early detection and treatment. This paper compares two convolutional neural networks called ConvNeXT and EfficientNet to predict the likelihood of cancer in mammograms from screening exams. Preprocessing of the images, classification, and performance evaluation are main parts of the whole procedure. Several evaluation metrics were used to compare and evaluate the performance of the models. The result shows that ConvNeXT generates better results with a 94.33% AUC score, 93.36% accuracy, and 95.13% F-score compared to EfficientNet with a 92.34% AUC score, 91.47% accuracy, and 93.06% F-score on RSNA screening mammography breast cancer dataset.
zh

[CV-227] GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

【速读】:该论文旨在解决地理定位(geo-localization)任务中视觉语言模型(VLMs)缺乏稳健推理机制和可解释性的问题。现有方法在处理多粒度视觉线索提取与外部世界知识整合时表现不足,限制了其有效性。解决方案的关键在于提出Geo Reason Enhancement (GRE) Suite,该框架通过结构化推理链增强VLMs,实现准确且可解释的位置推断。GRE Suite从数据集、模型和基准三个维度系统构建,其中GRE模型采用多阶段推理策略逐步推断场景属性、局部细节和语义特征,从而提升地理区域定位的精度。

链接: https://arxiv.org/abs/2505.18700
作者: Chun Wang,Xiaoran Pan,Zihao Pan,Haofan Wang,Yiren Song
机构: Zhejiang University (浙江大学); Sun Yat-sen University (中山大学); LibLib.ai (LibLib.ai); NUS (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at this https URL.
zh

[CV-228] Affective Image Editing: Shaping Emotional Factors via Text Descriptions

【速读】:该论文旨在解决现有文本驱动图像编辑技术中对用户情感需求理解不足的问题(Emotional Request Understanding)。其关键解决方案是提出AIEdiT框架,通过自适应调整图像中的多种情感因素来激发特定情绪。该框架包括构建连续情感光谱以表示普遍的情感先验、设计情感映射器将视觉抽象的情感请求转化为视觉具体的语义表示,并引入多模态大语言模型(MLLM)监督模型训练,从而确保编辑结果能够准确反映用户的情感需求。

链接: https://arxiv.org/abs/2505.18699
作者: Peixuan Zhang,Shuchen Weng,Chengxuan Zhu,Binghao Tang,Zijian Jia,Si Li,Boxin Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In daily life, images as common affective stimuli have widespread applications. Despite significant progress in text-driven image editing, there is limited work focusing on understanding users’ emotional requests. In this paper, we introduce AIEdiT for Affective Image Editing using Text descriptions, which evokes specific emotions by adaptively shaping multiple emotional factors across the entire images. To represent universal emotional priors, we build the continuous emotional spectrum and extract nuanced emotional requests. To manipulate emotional factors, we design the emotional mapper to translate visually-abstract emotional requests to visually-concrete semantic representations. To ensure that editing results evoke specific emotions, we introduce an MLLM to supervise the model training. During inference, we strategically distort visual elements and subsequently shape corresponding emotional factors to edit images according to users’ instructions. Additionally, we introduce a large-scale dataset that includes the emotion-aligned text and image pair set for training and evaluation. Extensive experiments demonstrate that AIEdiT achieves superior performance, effectively reflecting users’ emotional requests.
zh

[CV-229] WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation CVPR2025

【速读】:该论文旨在解决弱监督指代表达理解(Weakly supervised referring expression comprehension, WREC)和分割(Weakly supervised referring expression segmentation, WRES)任务中对象定位的问题,传统上这些任务被分别建模,而本文提出通过多任务框架进行联合学习以提升性能。解决方案的关键在于提出一种名为WeakMCN的新型多任务协作网络,其采用双分支结构,其中WREC分支基于锚点对比学习,并作为教师指导WRES分支;同时引入了两个创新设计:动态视觉特征增强(Dynamic Visual Feature Enhancement, DVFE)和协作一致性模块(Collaborative Consistency Module, CCM),分别用于动态融合预训练视觉知识和促进跨任务一致性。

链接: https://arxiv.org/abs/2505.18686
作者: Yang Liu,Silin Cheng,Xinwei He,Sebastien Ourselin,Lei Tan,Gen Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Weakly supervised referring expression comprehension(WREC) and segmentation(WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module(CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO. The code is publicly available at this https URL.
zh

[CV-230] Manifold-aware Representation Learning for Degradation-agnostic Image Restoration

【速读】:该论文旨在解决图像恢复(Image Restoration, IR)中现有方法普遍将IR视为直接映射问题,缺乏对不同退化类型结构多样性的建模,从而限制了模型的泛化能力和效率的问题。其解决方案的关键在于提出MIRAGE框架,该框架通过将输入特征空间显式分解为三个语义对齐的并行分支,分别由全局上下文注意力模块、局部纹理卷积模块和通道统计MLP模块处理,实现了对多种退化类型的高效统一处理。此外,引入跨层对比学习机制,并在对称正定(SPD)流形空间中进行对比学习,以增强共享表示的判别性。

链接: https://arxiv.org/abs/2505.18679
作者: Bin Ren,Yawei Li,Xu Zheng,Yuqian Fu,Danda Pani Paudel,Ming-Hsuan Yang,Luc Van Gool,Nicu Sebe
机构: University of Pisa; University of Trento; INSAIT, Sofia University “St. Kliment Ohridski”; ETH Zürich; Hong Kong University of Science and Technology (Guangzhou); University of California, Merced
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ALl-in-One Image Restoration, low-level vision

点击查看摘要

Abstract:Image Restoration (IR) aims to recover high quality images from degraded inputs affected by various corruptions such as noise, blur, haze, rain, and low light conditions. Despite recent advances, most existing approaches treat IR as a direct mapping problem, relying on shared representations across degradation types without modeling their structural diversity. In this work, we present MIRAGE, a unified and lightweight framework for all in one IR that explicitly decomposes the input feature space into three semantically aligned parallel branches, each processed by a specialized module attention for global context, convolution for local textures, and MLP for channel-wise statistics. This modular decomposition significantly improves generalization and efficiency across diverse degradations. Furthermore, we introduce a cross layer contrastive learning scheme that aligns shallow and latent features to enhance the discriminability of shared representations. To better capture the underlying geometry of feature representations, we perform contrastive learning in a Symmetric Positive Definite (SPD) manifold space rather than the conventional Euclidean space. Extensive experiments show that MIRAGE not only achieves new state of the art performance across a variety of degradation types but also offers a scalable solution for challenging all-in-one IR scenarios. Our code and models will be publicly available at this https URL.
zh

[CV-231] Restoring Real-World Images with an Internal Detail Enhancement Diffusion Model

【速读】:该论文旨在解决真实世界退化图像(如老照片或低分辨率图像)的高保真恢复问题,特别是面对复杂混合退化因素(如划痕、色彩褪色和噪声)时,如何实现高质量的修复并提供对象级的颜色控制。其解决方案的关键在于提出一种内部细节保留的扩散模型,利用预训练的Stable Diffusion模型作为生成先验,并引入内部图像细节增强(IIDE)技术,以在去噪过程中保留关键结构和纹理信息,同时减轻退化影响。

链接: https://arxiv.org/abs/2505.18674
作者: Peng Xiao,Hongbo Zhao,Yijun Wang,Jianxin Lin
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Restoring real-world degraded images, such as old photographs or low-resolution images, presents a significant challenge due to the complex, mixed degradations they exhibit, such as scratches, color fading, and noise. Recent data-driven approaches have struggled with two main challenges: achieving high-fidelity restoration and providing object-level control over colorization. While diffusion models have shown promise in generating high-quality images with specific controls, they often fail to fully preserve image details during restoration. In this work, we propose an internal detail-preserving diffusion model for high-fidelity restoration of real-world degraded images. Our method utilizes a pre-trained Stable Diffusion model as a generative prior, eliminating the need to train a model from scratch. Central to our approach is the Internal Image Detail Enhancement (IIDE) technique, which directs the diffusion model to preserve essential structural and textural information while mitigating degradation effects. The process starts by mapping the input image into a latent space, where we inject the diffusion denoising process with degradation operations that simulate the effects of various degradation factors. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art models in both qualitative assessments and perceptual quantitative evaluations. Additionally, our approach supports text-guided restoration, enabling object-level colorization control that mimics the expertise of professional photo editing.
zh

[CV-232] DVD-Quant: Data-free Video Diffusion Transformers Quantization

【速读】:该论文旨在解决Diffusion Transformers (DiTs)在视频生成任务中计算和内存需求高导致实际部署困难的问题,以及现有后训练量化(PTQ)方法在量化过程中依赖耗时的校准流程和量化后性能显著下降的两大关键挑战。其解决方案的关键在于提出一种无需校准数据的量化框架DVD-Quant,该框架通过三个核心创新:渐进式有界量化(PBQ)、自动缩放旋转量化(ARQ)和δ引导位切换(δ-GBS),实现了量化误差的降低和自适应位宽分配,从而在保持视频质量的前提下显著提升了模型推理速度。

链接: https://arxiv.org/abs/2505.18663
作者: Zhiteng Li,Hanxuan Li,Junyi Wu,Kai Liu,Linghe Kong,Guihai Chen,Yulun Zhang,Xiaokang Yang
机构: Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models will be available at \url{ this https URL }

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) \delta -Guided Bit Switching ( \delta -GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2 \times speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at this https URL.
zh

[CV-233] So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection

【速读】:该论文旨在解决社交平台上由生成式 AI (Generative AI) 生成的高真实感合成图像对信息完整性和公众信任造成的威胁。现有检测方法和数据集在多样性、规模及真实性方面存在不足,难以有效应对不断演进的生成技术。解决方案的关键在于构建一个面向社交媒体的全面数据集 So-Fake-Set,包含超过200万张高质量图像,涵盖35种先进的生成模型,并建立一个大规模的域外基准 So-Fake-OOD,用于评估模型在未见生成技术下的泛化能力。此外,提出了一种基于强化学习的视觉-语言框架 So-Fake-R1,实现了高精度的伪造检测、精确定位和可解释推理。

链接: https://arxiv.org/abs/2505.18660
作者: Zhenglin Huang,Tianxiao Li,Xiangtai Li,Haiquan Wen,Yiwei He,Jiangning Zhang,Hao Fei,Xi Yang,Xiaowei Huang,Bei Peng,Guangliang Cheng
机构: University of Liverpool, UK (利物浦大学); Bytedance (字节跳动); Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in AI-powered generative models have enabled the creation of increasingly realistic synthetic images, posing significant risks to information integrity and public trust on social media platforms. While robust detection frameworks and diverse, large-scale datasets are essential to mitigate these risks, existing academic efforts remain limited in scope: current datasets lack the diversity, scale, and realism required for social media contexts, while detection methods struggle with generalization to unseen generative technologies. To bridge this gap, we introduce So-Fake-Set, a comprehensive social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and photorealistic imagery synthesized using 35 state-of-the-art generative models. To rigorously evaluate cross-domain robustness, we establish a novel and large-scale (100K) out-of-domain benchmark (So-Fake-OOD) featuring synthetic imagery from commercial models explicitly excluded from the training distribution, creating a realistic testbed for evaluating real-world performance. Leveraging these resources, we present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales. Extensive experiments show that So-Fake-R1 outperforms the second-best method, with a 1.3% gain in detection accuracy and a 4.5% increase in localization IoU. By integrating a scalable dataset, a challenging OOD benchmark, and an advanced detection framework, this work establishes a new foundation for social media-centric forgery detection research. The code, models, and datasets will be released publicly.
zh

[CV-234] Why Not Replace? Sustaining Long-Term Visual Localization via Handcrafted-Learned Feature Collaboration on CPU

【速读】:该论文旨在解决复杂工业环境中移动机器人系统的鲁棒长期视觉定位问题。现有方法存在局限性:手工特征对光照敏感,学习特征计算成本高,语义或标记方法受环境约束。论文提出了一种分层定位框架,其关键在于融合手工特征与学习特征的优势,利用实时手工特征提取进行相对位姿估计,同时在优化的关键帧上采用选择性学习关键点检测实现绝对定位,从而实现高效、长期的视觉定位。

链接: https://arxiv.org/abs/2505.18652
作者: Yicheng Lin,Yunlong Jiang,Xujia Jiao,Bin Han
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 gifures

点击查看摘要

Abstract:Robust long-term visual localization in complex industrial environments is critical for mobile robotic systems. Existing approaches face limitations: handcrafted features are illumination-sensitive, learned features are computationally intensive, and semantic- or marker-based methods are environmentally constrained. Handcrafted and learned features share similar representations but differ functionally. Handcrafted features are optimized for continuous tracking, while learned features excel in wide-baseline matching. Their complementarity calls for integration rather than replacement. Building on this, we propose a hierarchical localization framework. It leverages real-time handcrafted feature extraction for relative pose estimation. In parallel, it employs selective learned keypoint detection on optimized keyframes for absolute positioning. This design enables CPU-efficient, long-term visual localization. Experiments systematically progress through three validation phases: Initially establishing feature complementarity through comparative analysis, followed by computational latency profiling across algorithm stages on CPU platforms. Final evaluation under photometric variations (including seasonal transitions and diurnal cycles) demonstrates 47% average error reduction with significantly improved localization consistency. The code implementation is publicly available at this https URL.
zh

[CV-235] ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos

【速读】:该论文试图解决自动驾驶中世界模型(world model)在动作控制和动作预测方面的不足,传统方法在预测视频时需要给定与视频长度相同的动作序列,并忽略了动态动作规律。解决方案的关键在于提出ProphetDWM,一个端到端的驾驶世界模型,其通过动作模块学习从当前到未来时期的潜在动作,并利用基于扩散模型的转移模块学习状态分布,从而实现未来视频和动作的联合预测,使动作动力学与状态相互关联,支持长期未来预测。

链接: https://arxiv.org/abs/2505.18650
作者: Xiaodong Wang,Peixi Peng
机构: Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Real-world driving requires people to observe the current environment, anticipate the future, and make appropriate driving decisions. This requirement is aligned well with the capabilities of world models, which understand the environment and predict the future. However, recent world models in autonomous driving are built explicitly, where they could predict the future by controllable driving video generation. We argue that driving world models should have two additional abilities: action control and action prediction. Following this line, previous methods are limited because they predict the video requires given actions of the same length as the video and ignore the dynamical action laws. To address these issues, we propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions. Our world model has an action module to learn latent action from the present to the future period by giving the action sequence and observations. And a diffusion-model-based transition module to learn the state distribution. The model is jointly trained by learning latent actions given finite states and predicting action and video. The joint learning connects the action dynamics and states and enables long-term future prediction. We evaluate our method in video generation and action prediction tasks on the Nuscenes dataset. Compared to the state-of-the-art methods, our method achieves the best video consistency and best action prediction accuracy, while also enabling high-quality long-term video and action generation.
zh

[CV-236] SuperGS: Consistent and Detailed 3D Super-Resolution Scene Reconstruction via Gaussian Splatting

【速读】:该论文旨在解决高分辨率新视角合成(HRNVS)中由于低分辨率输入视图生成的原始几何体过于粗糙而导致的挑战。其解决方案的关键在于提出SuperGS,这是一种基于两阶段粗到精训练框架的扩展方法。在低分辨率阶段,引入潜在特征场作为场景初始化和超分辨率优化的基础信息;在高分辨率阶段,采用多视角一致的密集化策略,通过误差图反投影高分辨率深度图并结合多视角投票机制,减少由伪标签不一致性引起的歧义,同时避免高斯冗余。此外,通过变分特征学习建模不确定性,并用于指导场景表示的进一步优化和调整伪标签的监督效果,从而实现一致且细节丰富的场景重建。

链接: https://arxiv.org/abs/2505.18649
作者: Shiyun Xie,Zhiru Wang,Yinghao Zhu,Xu Wang,Chengwei Pan,Xiwang Dong
机构: Beihang University (北京航空航天大学); Sino-French Engineer School, Beihang University (中法工程师学院,北京航空航天大学); Institute of Unmanned System, Beihang University (无人系统研究院,北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has excelled in novel view synthesis (NVS) with its real-time rendering capabilities and superior quality. However, it encounters challenges for high-resolution novel view synthesis (HRNVS) due to the coarse nature of primitives derived from low-resolution input views. To address this issue, we propose SuperGS, an expansion of Scaffold-GS designed with a two-stage coarse-to-fine training framework. In the low-resolution stage, we introduce a latent feature field to represent the low-resolution scene, which serves as both the initialization and foundational information for super-resolution optimization. In the high-resolution stage, we propose a multi-view consistent densification strategy that backprojects high-resolution depth maps based on error maps and employs a multi-view voting mechanism, mitigating ambiguities caused by multi-view inconsistencies in the pseudo labels provided by 2D prior models while avoiding Gaussian redundancy. Furthermore, we model uncertainty through variational feature learning and use it to guide further scene representation refinement and adjust the supervisory effect of pseudo-labels, ensuring consistent and detailed scene reconstruction. Extensive experiments demonstrate that SuperGS outperforms state-of-the-art HRNVS methods on both forward-facing and 360-degree datasets.
zh

[CV-237] SerendibCoins: Exploring The Sri Lankan Coins Dataset

【速读】:该论文试图解决的是斯里兰卡硬币的识别与分类问题,旨在提升自动化货币识别系统的性能。其解决方案的关键在于构建了一个全面的斯里兰卡硬币图像数据集,并通过传统机器学习分类器(如K-Nearest Neighbors, KNN;Support Vector Machines, SVM;Random Forest)以及自定义的卷积神经网络(Convolutional Neural Network, CNN)进行模型评估与性能对比,从而验证数据集对分类准确性的提升作用。

链接: https://arxiv.org/abs/2505.18634
作者: NH Wanigasingha,ES Sithpahan,MKA Ariyaratne,PRS De Silva
机构: University of Sri Jayewardenepura (斯里贾亚瓦尔德纳普拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages

点击查看摘要

Abstract:The recognition and classification of coins are essential in numerous financial and automated systems. This study introduces a comprehensive Sri Lankan coin image dataset and evaluates its impact on machine learning model accuracy for coin classification. We experiment with traditional machine learning classifiers K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forest as well as a custom Convolutional Neural Network (CNN) to benchmark performance at different levels of classification. Our results show that SVM outperforms KNN and Random Forest in traditional classification approaches, while the CNN model achieves near-perfect classification accuracy with minimal misclassifications. The dataset demonstrates significant potential in enhancing automated coin recognition systems, offering a robust foundation for future research in regional currency classification and deep learning applications.
zh

[CV-238] Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

【速读】:该论文旨在解决多概念个性化文本到图像生成中的挑战,特别是如何有效定制对象概念和抽象概念(如姿态、光照)而无需在测试阶段进行微调。现有方法在处理抽象概念时通常需要针对每个新概念进行耗时且容易过拟合的微调,限制了其应用效果。该论文提出的解决方案关键在于一种无需微调的多概念个性化方法,其核心是基于预训练扩散变换器(DiT)模型中的调制机制,通过引入Mod-Adapter模块,预测与概念相关的文本标记的调制方向,并结合视觉-语言交叉注意力和Mixture-of-Experts(MoE)层来提取和映射概念特征。此外,为缓解概念图像空间与调制空间之间的差异,还采用了VLM引导的预训练策略,以提升模型的语义理解能力。

链接: https://arxiv.org/abs/2505.18612
作者: Weizhi Zhong,Huan Yang,Zheng Liu,Huiguo He,Zijian He,Xuesong Niu,Di Zhang,Guanbin Li
机构: Sun Yat-sen University (中山大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pretrained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It incorporates vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pretraining strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.
zh

[CV-239] Spiking Transformers Need High Frequency Information

【速读】:该论文旨在解决脉冲变压器(Spiking Transformers)在性能上与传统人工神经网络存在显著差距的问题。研究发现,脉冲神经元优先传播低频信息,而高频成分的快速衰减是导致性能下降的主要原因。解决方案的关键在于通过引入高通滤波机制来恢复高频信号,具体表现为在 patch embedding 中增加 Max-Pooling 以及用深度可分离卷积替代自注意力机制,从而提升模型的特征表示能力。

链接: https://arxiv.org/abs/2505.18608
作者: Yuetong Fang,Deming Zhou,Ziqing Wang,Hongwei Ren,ZeCui Zeng,Lusong Li,Shibo Zhou,Renjing Xu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Brain Mind Innovation INC (脑智创新有限公司); JD Explore Academy (京东探索研究院); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Transformers offer an energy-efficient alternative to conventional deep learning by transmitting information solely through binary (0/1) spikes. However, there remains a substantial performance gap compared to artificial neural networks. A common belief is that their binary and sparse activation transmission leads to information loss, thus degrading feature representation and accuracy. In this work, however, we reveal for the first time that spiking neurons preferentially propagate low-frequency information. We hypothesize that the rapid dissipation of high-frequency components is the primary cause of performance degradation. For example, on Cifar-100, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73%; interestingly, replacing it with Max-Pooling (high-pass) pushes the top-1 accuracy to 79.12%, surpassing the well-tuned Spikformer baseline by 0.97%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: extra Max-Pooling in patch embedding and Depth-Wise Convolution in place of self-attention. Notably, our Max-Former (63.99 M) hits the top-1 accuracy of 82.39% on ImageNet, showing a +7.58% improvement over Spikformer with comparable model size (74.81%, 66.34 M). We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks, beyond the established practice in standard deep learning.
zh

[CV-240] Rethinking Causal Mask Attention for Vision-Language Inference

【速读】:该论文旨在解决当前自回归视觉语言模型(Vision-Language Models, VLMs)中因果注意力机制在处理视觉标记时存在的局限性,即现有的基于因果掩码的策略继承自大语言模型(Large Language Models, LLMs),其设计主要用于文本解码,未能充分适应视觉标记的预填充阶段。严格遮蔽未来位置导致模型无法有效利用包含关键语义线索的未来上下文。论文的关键解决方案是提出一种面向未来的注意力机制,通过池化操作将未来视觉上下文聚合到过去的表示中,在保持自回归结构的同时增强跨标记依赖关系,从而提升视觉语言推理性能。

链接: https://arxiv.org/abs/2505.18605
作者: Xiaohuan Pei,Tao Huang,YanXiang Ma,Chang Xu
机构: The University of Sydney (悉尼大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model’s ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model’s capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.
zh

[CV-241] Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理文档图像时存在的信息密集性问题,即大多数查询仅依赖于少数相关区域,而现有模型通常无法有效聚焦于关键区域,导致响应不准确。其解决方案的关键在于引入Doc-CoB(Chain-of-Box)机制,该机制通过模拟人类从粗到细的阅读模式,使模型能够自主选择与查询最相关的区域(框),并在此基础上进行更深入的理解,从而提升文档理解的准确性与有效性。

链接: https://arxiv.org/abs/2505.18603
作者: Ye Mo,Zirui Shao,Kai Ye,Xianwei Mao,Bo Zhang,Hangdi Xing,Peng Ye,Gang Huang,Kehan Chen,Zhou Huan,Zixu Yan,Sheng Zhou
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made significant progress in document understanding. However, the information-dense nature of document images still poses challenges, as most queries depend on only a few relevant regions, with the rest being redundant. Existing one-pass MLLMs process entire document images without considering query relevance, often failing to focus on critical regions and producing unfaithful responses. Inspired by the human coarse-to-fine reading pattern, we introduce Doc-CoB (Chain-of-Box), a simple-yet-effective mechanism that integrates human-style visual reasoning into MLLM without modifying its architecture. Our method allows the model to autonomously select the set of regions (boxes) most relevant to the query, and then focus attention on them for further understanding. We first design a fully automatic pipeline, integrating a commercial MLLM with a layout analyzer, to generate 249k training samples with intermediate visual reasoning supervision. Then we incorporate two enabling tasks that improve box identification and box-query reasoning, which together enhance document understanding. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability. All code, data, and models will be released publicly.
zh

[CV-242] Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

【速读】:该论文试图解决单图像超分辨率(Single-Image Super-Resolution, SISR)模型在放大倍率超出训练范围时性能急剧下降的可扩展性瓶颈问题。解决方案的关键在于提出一种模型无关的框架Chain-of-Zoom (CoZ),该框架将SISR分解为一系列中间尺度状态的自回归链,并利用多尺度感知文本提示增强每个缩放步骤,从而在不进行额外训练的情况下实现极端分辨率的图像放大。

链接: https://arxiv.org/abs/2505.18600
作者: Bryan Sangwoo Kim,Jeongsol Kim,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity.
zh

[CV-243] EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

【速读】:该论文旨在解决视觉-语言检索(Vision-Language Retrieval, VLR)中现有方法忽视实体的丰富视觉语义知识,导致检索结果不准确的问题。其解决方案的关键在于提出一种基于实体视觉描述(Entity Visual Description, EVD)增强的CLIP模型(EvdCLIP),通过大语言模型生成EVD作为对齐线索,补充文本数据,并结合可训练的EVD感知重写器(EVD-aware Rewriter, EaRW)对查询进行优化,从而生成高质量、低噪声的视觉丰富型查询。

链接: https://arxiv.org/abs/2505.18594
作者: GuangHao Meng,Sunan He,Jinpeng Wang,Tao Dai,Letian Zhang,Jieming Zhu,Qing Li,Gang Wang,Rui Zhang,Yong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.
zh

[CV-244] HyperFake: Hyperspectral Reconstruction and Attention-Guided Analysis for Advanced Deepfake Detection

【速读】:该论文试图解决深度伪造(deepfake)检测中现有方法在不同篡改技术与数据集之间泛化能力不足的问题。其解决方案的关键在于引入HyperFake,通过从标准RGB视频重建31通道的高光谱数据(hyperspectral data),从而揭示传统方法无法检测到的隐藏篡改痕迹。该方法结合了改进的MST++架构以增强高光谱重建,并采用光谱注意力机制选择关键光谱特征,最终利用基于EfficientNet的分类器进行高效且通用的深度伪造检测。

链接: https://arxiv.org/abs/2505.18587
作者: Pavan C Shekar,Pawan Soni,Vivek Kanhangad
机构: Indian Institute of Technology Indore (印度理工学院印多尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 1 table. Preliminary results on FaceForensics++ dataset. First approach to use hyperspectral reconstruction for deepfake detection

点击查看摘要

Abstract:Deepfakes pose a significant threat to digital media security, with current detection methods struggling to generalize across different manipulation techniques and datasets. While recent approaches combine CNN-based architectures with Vision Transformers or leverage multi-modal learning, they remain limited by the inherent constraints of RGB data. We introduce HyperFake, a novel deepfake detection pipeline that reconstructs 31-channel hyperspectral data from standard RGB videos, revealing hidden manipulation traces invisible to conventional methods. Using an improved MST++ architecture, HyperFake enhances hyperspectral reconstruction, while a spectral attention mechanism selects the most critical spectral features for deepfake detection. The refined spectral data is then processed by an EfficientNet-based classifier optimized for spectral analysis, enabling more accurate and generalizable detection across different deepfake styles and datasets, all without the need for expensive hyperspectral cameras. To the best of our knowledge, this is the first approach to leverage hyperspectral imaging reconstruction for deepfake detection, opening new possibilities for detecting increasingly sophisticated manipulations.
zh

[CV-245] Guiding the Experts: Semantic Priors for Efficient and Focused MoE Routing

【速读】:该论文旨在解决当前Soft Mixture-of-Experts (MoE)模型中专家路由(expert routing)效率不足的问题,具体表现为现有设计未能充分考虑分配权重中隐含的语义结构,导致专家选择不够优化。论文提出的解决方案关键在于引入一种基于前景引导的增强策略,通过引入一个空间感知的辅助损失函数,促使专家激活与语义前景区域对齐,并结合轻量级LayerScale机制以提升信息流动和优化稳定性。该方法仅需少量架构调整即可有效提升模型性能并增强专家路由的可解释性。

链接: https://arxiv.org/abs/2505.18586
作者: Chengxi Min,Wei Wang,Yahui Liu,Weixin Ye,Enver Sangineto,Qi Wang,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Kuaishou Technology (快手科技); University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have emerged as a promising direction for scaling vision architectures efficiently. Among them, Soft MoE improves training stability by assigning each token to all experts via continuous dispatch weights. However, current designs overlook the semantic structure which is implicitly encoded in these weights, resulting in suboptimal expert routing. In this paper, we discover that dispatch weights in Soft MoE inherently exhibit segmentation-like patterns but are not explicitly aligned with semantic regions. Motivated by this observation, we propose a foreground-guided enhancement strategy. Specifically, we introduce a spatially aware auxiliary loss that encourages expert activation to align with semantic foreground regions. To further reinforce this supervision, we integrate a lightweight LayerScale mechanism that improves information flow and stabilizes optimization in skip connections. Our method necessitates only minor architectural adjustments and can be seamlessly integrated into prevailing Soft MoE frameworks. Comprehensive experiments on ImageNet-1K and multiple smaller-scale classification benchmarks not only showcase consistent performance enhancements but also reveal more interpretable expert routing mechanisms.
zh

[CV-246] Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

【速读】:该论文旨在解决Diffusion Transformers (DiTs)在密集视觉对应任务中由于“massive activations”(大量激活)导致的表征不信息性和性能下降问题。其解决方案的关键在于利用零初始化的自适应层归一化(AdaLN-zero)有效定位并归一化这些集中于少数固定维度的大量激活,并通过通道丢弃策略进一步消除其负面影响,从而提出一种无需训练的特征提取框架DiTF,显著提升了DiTs在多种视觉对应任务中的性能。

链接: https://arxiv.org/abs/2505.18584
作者: Chaofan Gan,Yuanpeng Tu,Xi Chen,Tieyuan Chen,Yuxi Li,Mehrtash Harandi,Weiyao Lin
机构: Shanghai Jiao Tong University (上海交通大学); The University of Hong Kong (香港大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textitmassive activations, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4% on Spair-71k and +4.4% on AP-10K-C.S.).
zh

[CV-247] On Denoising Walking Videos for Gait Recognition

【速读】:该论文旨在解决基于视觉的步态识别中如何排除行走视频中与身份无关的线索(如衣物纹理和颜色)以捕捉个体步态特征的问题。传统基于轮廓和姿态的方法由于输入稀疏且信息量不足,难以达到高精度;而新兴的端到端方法则通过使用人类先验直接对RGB视频进行去噪。该研究的关键在于提出了一种名为DenoisingGait的新颖步态去噪方法,其核心是受“我无法创造的东西,我就不理解”哲学启发,利用生成扩散模型部分过滤与步态理解无关的因素,并引入几何驱动的特征匹配模块,结合人体轮廓背景去除,将多通道扩散特征压缩为两通道方向向量,从而生成一种新型的类似流场的步态表示——Gait Feature Field,有效降低扩散特征中的残余噪声。

链接: https://arxiv.org/abs/2505.18582
作者: Dongyang Jin,Chao Fan,Jingzhe Ma,Jingkai Zhou,Weihua Chen,Shiqi Yu
机构: Southern University of Science and Technology (南方科技大学); Shenzhen University (深圳大学); Shenzhen Polytechnic University (深圳职业技术学院); Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8pages, 4 figures

点击查看摘要

Abstract:To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that “what I cannot create, I do not understand”, we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at this https URL.
zh

[CV-248] Learning without Isolation: Pathway Protection for Continual Learning

【速读】:该论文试图解决深度网络在顺序任务学习过程中出现的灾难性遗忘问题(catastrophic forgetting),即在学习新任务时会丢失旧任务的知识。现有方法主要通过调节或保护与旧任务相关的参数来缓解该问题,但这种方法在实践中往往不可行,因为存储旧任务知识的参数规模随任务数量线性增长,难以有效保存。本文从神经科学和物理学的角度提出了一种新的观点:在网络整体中,路径(pathways)比参数更关键,特别是在涉及旧任务知识时。基于这一观点,论文提出了一个名为“学习 without isolation”(LwI)的新型持续学习框架,其关键在于将模型融合建模为图匹配,并在不隔离的情况下保护旧任务占用的路径。利用深度网络中激活通道的稀疏性,LwI能够自适应地为新任务分配可用路径,从而以参数高效的方式实现路径保护并缓解灾难性遗忘。

链接: https://arxiv.org/abs/2505.18568
作者: Zhikang Chen,Abudukelimu Wuerkaixi,Sen Cui,Haoxuan Li,Ding Li,Jingfeng Zhang,Bo Han,Gang Niu,Houfang Liu,Yi Yang,Sifan Yang,Changshui Zhang,Tianling Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:Deep networks are prone to catastrophic forgetting during sequential task learning, i.e., losing the knowledge about old tasks upon learning new tasks. To this end, continual learning(CL) has emerged, whose existing methods focus mostly on regulating or protecting the parameters associated with the previous tasks. However, parameter protection is often impractical, since the size of parameters for storing the old-task knowledge increases linearly with the number of tasks, otherwise it is hard to preserve the parameters related to the old-task knowledge. In this work, we bring a dual opinion from neuroscience and physics to CL: in the whole networks, the pathways matter more than the parameters when concerning the knowledge acquired from the old tasks. Following this opinion, we propose a novel CL framework, learning without isolation(LwI), where model fusion is formulated as graph matching and the pathways occupied by the old tasks are protected without being isolated. Thanks to the sparsity of activation channels in a deep network, LwI can adaptively allocate available pathways for a new task, realizing pathway protection and addressing catastrophic forgetting in a parameter-efficient manner. Experiments on popular benchmark datasets demonstrate the superiority of the proposed LwI.
zh

[CV-249] hinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts

【速读】:该论文旨在解决推理视频目标分割(Reasoning Video Object Segmentation)中的挑战,即从输入视频和隐式、复杂的文本查询中生成精确的掩码序列。现有方法通过微调多模态大语言模型(Multimodal Large Language Models, MLLM)实现基于分割的输出,但在处理时间敏感查询的复杂视频场景时仍存在不足,主要原因是未能有效整合时空信息。论文提出的解决方案关键在于引入ThinkVideo框架,该框架利用MLLM的零样本链式思维(Chain-of-Thought, CoT)能力,通过CoT提示提取特定关键帧的目标选择性,并结合推理图像分割模型与SAM2视频处理器生成掩码序列,从而实现无需训练且兼容闭源MLLM的高效分割。

链接: https://arxiv.org/abs/2505.18561
作者: Shiu-hong Kao,Yu-Wing Tai,Chi-Keung Tang
机构: HKUST(香港科技大学); Dartmouth College(达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reasoning Video Object Segmentation is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query. Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries, primarily due to the failure to integrate temporal and spatial information. In this paper, we propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges. Specifically, ThinkVideo utilizes the CoT prompts to extract object selectivities associated with particular keyframes, then bridging the reasoning image segmentation model and SAM2 video processor to output mask sequences. The ThinkVideo framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. We further extend the framework for online video streams, where the CoT is used to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that ThinkVideo significantly outperforms previous works in both cases, qualitatively and quantitatively.
zh

[CV-250] Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

【速读】:该论文试图解决推理阶段多偏好对齐问题,即在不进行额外微调的情况下,使扩散模型能够根据用户指定的奖励函数和正则化强度的线性组合生成符合需求的图像。解决方案的关键在于提出Diffusion Blend方法,通过融合与微调模型相关的反向扩散过程,实现对多个奖励函数和KL正则化强度的灵活控制,从而在推理时高效地满足不同用户的偏好。

链接: https://arxiv.org/abs/2505.18547
作者: Min Cheng,Fatemeh Doudi,Dileep Kalathil,Mohammad Ghavamzadeh,Panganamala R. Kumar
机构: Texas A&M University (得克萨斯A&M大学); Amazon AGI (亚马逊AGI)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at this https URLthis http URL.
zh

[CV-251] Generative RLHF-V: Learning Principles from Multi-modal Human Preference

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)对齐人类意图的长期挑战,尤其是传统仅基于分数的奖励模型(Score-only Reward Models)在准确性、泛化能力和可解释性方面的不足,这些缺陷限制了对齐方法(如基于人类反馈的强化学习,Reinforcement Learning from Human Feedback, RLHF)的发展。解决方案的关键在于引入一种新的对齐框架——Generative RLHF-V,该框架将生成式奖励模型(Generative Reward Models, GRMs)与多模态RLHF相结合,通过两阶段流程:第一阶段为从强化学习中进行多模态生成式奖励建模,利用强化学习引导GRMs主动捕捉人类意图并预测正确的成对评分;第二阶段为从分组比较中进行强化学习优化,通过对比分组响应提升多模态RL评分精度。

链接: https://arxiv.org/abs/2505.18531
作者: Jiayi Zhou,Jiaming Ji,Boyuan Chen,Jiapeng Sun,Wenqi Chen,Donghai Hong,Sirui Han,Yike Guo,Yaodong Yang
机构: Peking University (北京大学); Hong Kong University of Science and Technology (香港科技大学); University College London (伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs’ intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: \textbfmulti-modal generative reward modeling from RL , where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and \textbfRL optimization from grouped comparison , which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs’ performance across 7 benchmarks by 18.1% , while the baseline RLHF is only 5.3% . We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at this https URL.
zh

[CV-252] K-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation

【速读】:该论文旨在解决3D医学图像分割中的高维数据挑战和复杂空间依赖性问题,传统单模态网络如卷积神经网络(CNN)和Transformer在3D场景下受限于计算效率和上下文建模能力。其解决方案的关键在于提出一种结合Mamba和Kolmogorov-Arnold Networks(KAN)的多模态框架,通过三个核心创新:首先,引入增强门控空间卷积(EGSC)模块以捕捉3D图像展开为1D序列时的空间信息;其次,将Group-Rational KAN(GR-KAN)扩展为3D-Group-Rational KAN(3D-GR-KAN),首次应用于3D医学影像领域,实现更优的特征表示;最后,采用双分支文本驱动策略,利用CLIP的文本嵌入增强语义对齐与器官间关系的保持。

链接: https://arxiv.org/abs/2505.18525
作者: Haoyu Yang,Yuxiang Cai,Jintao Chen,Xuhong Zhang,Wenhui Lei,Xiaoming Shi,Jianwei Yin,Yankai Jiang
机构: Zhejiang University (浙江大学); Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D medical image segmentation is vital for clinical diagnosis and treatment but is challenged by high-dimensional data and complex spatial dependencies. Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling in 3D settings. We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling. Our approach features three key innovations: First, an EGSC (Enhanced Gated Spatial Convolution) module captures spatial information when unfolding 3D images into 1D sequences. Second, we extend Group-Rational KAN (GR-KAN), a Kolmogorov-Arnold Networks variant with rational basis functions, into 3D-Group-Rational KAN (3D-GR-KAN) for 3D medical imaging - its first application in this domain - enabling superior feature representation tailored to volumetric data. Third, a dual-branch text-driven strategy leverages CLIP’s text embeddings: one branch swaps one-hot labels for semantic vectors to preserve inter-organ semantic relationships, while the other aligns images with detailed organ descriptions to enhance semantic alignment. Experiments on the Medical Segmentation Decathlon (MSD) and KiTS23 datasets show our method achieving state-of-the-art performance, surpassing existing approaches in accuracy and efficiency. This work highlights the power of combining advanced sequence modeling, extended network architectures, and vision-language synergy to push forward 3D medical image segmentation, delivering a scalable solution for clinical use. The source code is openly available at this https URL.
zh

[CV-253] Improved Immiscible Diffusion: Accelerate Diffusion Training by Reducing Its Miscibility

【速读】:该论文试图解决扩散模型训练成本高的问题,特别是由于噪声空间中扩散轨迹混合带来的效率瓶颈。其解决方案的关键在于通过减少轨迹的混溶性(miscibility)来加速训练过程,具体表现为利用线性分配以外的多种实现方式(如K近邻噪声选择和图像缩放)在任意层面上降低混溶性,从而简化去噪过程并提升效率。

链接: https://arxiv.org/abs/2505.18521
作者: Yiheng Li,Feng Liang,Dan Kondratyuk,Masayoshi Tomizuka,Kurt Keutzer,Chenfeng Xu
机构: UC Berkeley (加州大学伯克利分校); UT Austin (德克萨斯大学奥斯汀分校); LUMA AI (LUMA AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The substantial training cost of diffusion models hinders their deployment. Immiscible Diffusion recently showed that reducing diffusion trajectory mixing in the noise space via linear assignment accelerates training by simplifying denoising. To extend immiscible diffusion beyond the inefficient linear assignment under high batch sizes and high dimensions, we refine this concept to a broader miscibility reduction at any layer and by any implementation. Specifically, we empirically demonstrate the bijective nature of the denoising process with respect to immiscible diffusion, ensuring its preservation of generative diversity. Moreover, we provide thorough analysis and show step-by-step how immiscibility eases denoising and improves efficiency. Extending beyond linear assignment, we propose a family of implementations including K-nearest neighbor (KNN) noise selection and image scaling to reduce miscibility, achieving up to 4x faster training across diverse models and tasks including unconditional/conditional generation, image editing, and robotics planning. Furthermore, our analysis of immiscibility offers a novel perspective on how optimal transport (OT) enhances diffusion training. By identifying trajectory miscibility as a fundamental bottleneck, we believe this work establishes a potentially new direction for future research into high-efficiency diffusion training. The code is available at this https URL.
zh

[CV-254] Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning ACL2025

【速读】:该论文旨在解决医学大型视觉-语言模型(Med-LVLMs)在视觉输入上表现出的注意力分布不佳问题,这导致了生成结果中的幻觉或不准确性。其解决方案的关键在于提出一种名为A^3 Tune的新型微调框架,该框架利用SAM的零样本弱标签,并通过BioMedCLIP将其细化为提示感知标签,随后选择性地调整视觉关键注意力头以提高对齐度,同时最小化干扰;此外,还引入了A^3 MoE模块,实现跨多样提示和图像的注意力调优自适应参数选择。

链接: https://arxiv.org/abs/2505.18503
作者: Aofei Chang,Le Huang,Alex James Boyd,Parminder Bhatia,Taha Kass-Hout,Cao Xiao,Fenglong Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL2025 (main)

点击查看摘要

Abstract:Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A ^3 Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A ^3 Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A ^3 MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A ^3 Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.
zh

[CV-255] Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

【速读】:该论文旨在解决机器人操作中学习有效视觉表示的根本性挑战,这一挑战源于动作执行过程中复杂的机体动力学。其解决方案的关键在于提出一种名为ICon的对比学习方法,该方法应用于视觉变换器(Vision Transformers, ViTs)的token级表示,通过在特征空间中分离代理特定和环境特定的token,生成以代理为中心的视觉表示,这些表示嵌入了与机体相关的归纳偏置。

链接: https://arxiv.org/abs/2505.18487
作者: Junlin Wang,Zhiyun Lin
机构: SUSTech(南方科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: A preprint version

点击查看摘要

Abstract:Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present \textbfI nter-token \textbfCon trast ( \textbfICon ), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: this https URL
zh

[CV-256] Syn3DTxt: Embedding 3D Cues for Scene Text Generation CVPR

【速读】:该论文试图解决合成数据集中三维上下文不足的问题(insufficient three-dimensional context),这一问题限制了场景文本生成模型在复杂真实场景中对空间布局与视觉效果交互关系的准确捕捉。解决方案的关键在于提出一种新的合成数据集构建标准,通过引入表面法线(surface normals)来增强三维场景特征,从而提升空间关系的表示能力,并为未来的场景文本渲染方法提供更稳健的基础。

链接: https://arxiv.org/abs/2505.18479
作者: Li-Syun Hsiung,Jun-Kai Tu,Kuan-Wu Chu,Yu-Hsuan Chiu,Yan-Tsung Peng,Sheng-Luen Chung,Gee-Sern Jison Hsu
机构: National Taiwan University of Science and Technology (国立台湾科技大学); National Chengchi University (国立政治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR workshop 2025: SyntaGen

点击查看摘要

Abstract:This study aims to investigate the challenge of insufficient three-dimensional context in synthetic datasets for scene text rendering. Although recent advances in diffusion models and related techniques have improved certain aspects of scene text generation, most existing approaches continue to rely on 2D data, sourcing authentic training examples from movie posters and book covers, which limits their ability to capture the complex interactions among spatial layout and visual effects in real-world scenes. In particular, traditional 2D datasets do not provide the necessary geometric cues for accurately embedding text into diverse backgrounds. To address this limitation, we propose a novel standard for constructing synthetic datasets that incorporates surface normals to enrich three-dimensional scene characteristic. By adding surface normals to conventional 2D data, our approach aims to enhance the representation of spatial relationships and provide a more robust foundation for future scene text rendering methods. Extensive experiments demonstrate that datasets built under this new standard offer improved geometric context, facilitating further advancements in text rendering under complex 3D-spatial conditions.
zh

[CV-257] ZooplanktonBench: A Geo-Aware Zooplankton Recognition and Classification Dataset from Marine Observations

【速读】:该论文旨在解决海洋科学中对浮游动物(zooplankton)数量进行准确监测及理解其种群变化与海洋环境关系的难题。由于浮游动物与背景(如海洋雪)在外观上高度相似,传统通用计算机视觉工具难以有效分析由此类新兴成像技术生成的大规模视频数据。论文提出的解决方案关键在于构建ZooplanktonBench数据集,该数据集包含与丰富地理空间元数据相关联的浮游动物图像和视频,定义了检测、分类和跟踪浮游动物的一系列任务,以应对复杂环境下的挑战,推动先进计算机视觉系统在动态且变化巨大的环境中提升视觉理解和地理感知能力。

链接: https://arxiv.org/abs/2505.18477
作者: Fukun Liu,Adam T. Greer,Gengchen Mai,Jin Sun
机构: University of Georgia(佐治亚大学); University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plankton are small drifting organisms found throughout the world’s oceans. One component of this plankton community is the zooplankton, which includes gelatinous animals and crustaceans (e.g. shrimp), as well as the early life stages (i.e., eggs and larvae) of many commercially important fishes. Being able to monitor zooplankton abundances accurately and understand how populations change in relation to ocean conditions is invaluable to marine science research, with important implications for future marine seafood productivity. While new imaging technologies generate massive amounts of video data of zooplankton, analyzing them using general-purpose computer vision tools developed for general objects turns out to be highly challenging due to the high similarity in appearance between the zooplankton and its background (e.g., marine snow). In this work, we present the ZooplanktonBench, a benchmark dataset containing images and videos of zooplankton associated with rich geospatial metadata (e.g., geographic coordinates, depth, etc.) in various water ecosystems. ZooplanktonBench defines a collection of tasks to detect, classify, and track zooplankton in challenging settings, including highly cluttered environments, living vs non-living classification, objects with similar shapes, and relatively small objects. Our dataset presents unique challenges and opportunities for state-of-the-art computer vision systems to evolve and improve visual understanding in a dynamic environment with huge variations and be geo-aware.
zh

[CV-258] HonestFace: Towards Honest Face Restoration with One-Step Diffusion Model

【速读】:该论文旨在解决低质量输入下人脸修复过程中保持高保真度、真实特征以及避免引入伪影或偏见的问题。解决方案的关键在于提出HonestFace,其核心组件包括:一种身份嵌入器以有效捕捉并保留关键身份特征,一种掩码人脸对齐方法以增强细粒度细节和纹理真实性,以及基于仿射变换原理的新地标评估指标,从而提升面部特征对齐的准确性。通过在单步扩散模型框架中整合这些贡献,HonestFace实现了卓越的人脸保真度和真实感修复效果。

链接: https://arxiv.org/abs/2505.18469
作者: Jingkai Wang,Wu Miao,Jue Gong,Zheng Chen,Xing Liu,Hong Gu,Yutong Liu,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face restoration has achieved remarkable advancements through the years of development. However, ensuring that restored facial images exhibit high fidelity, preserve authentic features, and avoid introducing artifacts or biases remains a significant challenge. This highlights the need for models that are more “honest” in their reconstruction from low-quality inputs, accurately reflecting original characteristics. In this work, we propose HonestFace, a novel approach designed to restore faces with a strong emphasis on such honesty, particularly concerning identity consistency and texture realism. To achieve this, HonestFace incorporates several key components. First, we propose an identity embedder to effectively capture and preserve crucial identity features from both the low-quality input and multiple reference faces. Second, a masked face alignment method is presented to enhance fine-grained details and textural authenticity, thereby preventing the generation of patterned or overly synthetic textures and improving overall clarity. Furthermore, we present a new landmark-based evaluation metric. Based on affine transformation principles, this metric improves the accuracy compared to conventional L2 distance calculations for facial feature alignment. Leveraging these contributions within a one-step diffusion model framework, HonestFace delivers exceptional restoration results in terms of facial fidelity and realism. Extensive experiments demonstrate that our approach surpasses existing state-of-the-art methods, achieving superior performance in both visual quality and quantitative assessments. The code and pre-trained models will be made publicly available at this https URL .
zh

[CV-259] BiomechGPT : Towards a Biomechanically Fluent Multimodal Foundation Model for Clinically Relevant Motion Tasks

【速读】:该论文试图解决在多样化临床环境中获取的运动数据进行下游分析的挑战,特别是针对不同用户可能需要的多种任务,单独开发定制化分析代码既不可行也未能充分利用人类运动的共性特征。解决方案的关键在于构建一个能够理解并回答与运动相关详细临床问题的多模态运动-语言模型,通过将运动轨迹进行标记化处理,并在此基础上训练BiomechGPT模型,使其在活动识别、运动障碍识别、诊断、临床结局评分和步态测量等任务中表现出色,从而为康复运动数据建立基础模型提供重要进展。

链接: https://arxiv.org/abs/2505.18465
作者: Ruize Yang,Ann Kennedy,R. James Cotton
机构: Northwestern University (西北大学); Shirley Ryan AbilityLab (Shirley Ryan康复实验室); The Scripps Research Institute (斯克里普斯研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in markerless motion capture are expanding access to biomechanical movement analysis, making it feasible to obtain high-quality movement data from outpatient clinics, inpatient hospitals, therapy, and even home. Expanding access to movement data in these diverse contexts makes the challenge of performing downstream analytics all the more acute. Creating separate bespoke analysis code for all the tasks end users might want is both intractable and does not take advantage of the common features of human movement underlying them all. Recent studies have shown that fine-tuning language models to accept tokenized movement as an additional modality enables successful descriptive captioning of movement. Here, we explore whether such a multimodal motion-language model can answer detailed, clinically meaningful questions about movement. We collected over 30 hours of biomechanics from nearly 500 participants, many with movement impairments from a variety of etiologies, performing a range of movements used in clinical outcomes assessments. After tokenizing these movement trajectories, we created a multimodal dataset of motion-related questions and answers spanning a range of tasks. We developed BiomechGPT, a multimodal biomechanics-language model, on this dataset. Our results show that BiomechGPT demonstrates high performance across a range of tasks such as activity recognition, identifying movement impairments, diagnosis, scoring clinical outcomes, and measuring walking. BiomechGPT provides an important step towards a foundation model for rehabilitation movement data.
zh

[CV-260] Mitigating Context Bias in Domain Adaptation for Object Detection using Mask Pooling

【速读】:该论文试图解决在目标检测训练过程中出现的上下文偏差(context bias)问题,即前景物体与背景之间的关联性,这种偏差会影响模型在未见领域中的泛化能力。为了解决这一问题,论文提出了一种基于因果视角的分析,并指出卷积网络架构中的池化操作可能是导致该偏差的根源。解决方案的关键在于引入一种替代方法——掩码池化(Mask Pooling),通过额外输入前景掩码,将池化过程分离到前景和背景区域,从而提升模型在不同领域下的鲁棒性检测能力。

链接: https://arxiv.org/abs/2505.18446
作者: Hojun Son,Asma Almutairi,Arpan Kusari
机构: University of Michigan Transportation Research Institute (密歇根大学交通研究所); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context bias refers to the association between the foreground objects and background during the object detection training process. Various methods have been proposed to minimize the context bias when applying the trained model to an unseen domain, known as domain adaptation for object detection (DAOD). But a principled approach to understand why the context bias occurs and how to remove it has been missing. In this work, we provide a causal view of the context bias, pointing towards the pooling operation in the convolution network architecture as the possible source of this bias. We present an alternative, Mask Pooling, which uses an additional input of foreground masks, to separate the pooling process in the respective foreground and background regions and show that this process leads the trained model to detect objects in a more robust manner under different domains. We also provide a benchmark designed to create an ultimate test for DAOD, using foregrounds in the presence of absolute random backgrounds, to analyze the robustness of the intended trained models. Through these experiments, we hope to provide a principled approach for minimizing context bias under domain shift. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.18446 [cs.CV] (or arXiv:2505.18446v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.18446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-261] OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

【速读】:该论文旨在解决扩散模型在图像风格化任务中面临的两个核心问题:一是复杂场景中风格一致性的保持,特别是身份、构图和细节的稳定性;二是图像到图像管道中使用风格LoRAs时风格退化的问题。解决方案的关键在于提出一种名为\textbf{OmniConsistency}的通用一致性插件,其核心包括:基于对齐图像对的上下文一致性学习框架以实现鲁棒泛化,一种将风格学习与一致性保持解耦的两阶段渐进学习策略以缓解风格退化,以及一种完全即插即用的设计,兼容任意风格LoRAs并在Flux框架下运行。

链接: https://arxiv.org/abs/2505.18445
作者: Yiren Song,Cheng Liu,Mike Zheng Shou
机构: Show Lab, National University of Singapore (Show 实验室,新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o’s exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbfOmniConsistency, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.
zh

[CV-262] NG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在否定理解(negation understanding)方面的不足,即模型难以准确识别概念的缺失或排除。现有方法通过大型语言模型(Large Language Model, LLM)生成包含否定的图像描述数据以微调CLIP,但该过程耗时且计算成本高,且评估范围有限。本文的关键解决方案是:(1) 引入一种训练阶段的否定数据生成流程,在训练过程中生成否定描述,仅增加2.5%的额外训练时间;(2) 提出首个针对包含否定提示的文本到图像生成模型的基准测试Neg-TtoI,评估模型生成语义准确图像的能力。实验表明,所提出的TNG-CLIP在多种否定基准任务中取得了最先进性能。

链接: https://arxiv.org/abs/2505.18434
作者: Yuliang Cai,Jesse Thomason,Mohammad Rostami
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model’s ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.
zh

[CV-263] CENet: Context Enhancement Network for Medical Image Segmentation MICCAI-2025

【速读】:该论文旨在解决医学图像分割中多领域场景下的挑战,包括精确保留解剖结构、准确的边界表示、器官形态的变异性以及下采样过程中的信息丢失问题。其解决方案的关键在于提出了一种名为上下文增强网络(Context Enhancement Network, CENet)的新型分割框架,该框架包含两个核心创新:一是集成到跳跃连接中的双选择增强模块(Dual Selective Enhancement Block, DSEB),用于增强边界细节并提升小器官的检测能力;二是解码器中的上下文特征注意力模块(Context Feature Attention Module, CFAM),通过多尺度设计保持空间完整性、减少特征冗余并缓解过度增强的表示。

链接: https://arxiv.org/abs/2505.18423
作者: Afshin Bozorgpour,Sina Ghorbani Kolahi,Reza Azad,Ilker Hacihaliloglu,Dorit Merhof
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Provisionally accepted at MICCAI-2025

点击查看摘要

Abstract:Medical image segmentation, particularly in multi-domain scenarios, requires precise preservation of anatomical structures across diverse representations. While deep learning has advanced this field, existing models often struggle with accurate boundary representation, variability in organ morphology, and information loss during downsampling, limiting their accuracy and robustness. To address these challenges, we propose the Context Enhancement Network (CENet), a novel segmentation framework featuring two key innovations. First, the Dual Selective Enhancement Block (DSEB) integrated into skip connections enhances boundary details and improves the detection of smaller organs in a context-aware manner. Second, the Context Feature Attention Module (CFAM) in the decoder employs a multi-scale design to maintain spatial integrity, reduce feature redundancy, and mitigate overly enhanced representations. Extensive evaluations on both radiology and dermoscopic datasets demonstrate that CENet outperforms state-of-the-art (SOTA) methods in multi-organ segmentation and boundary detail preservation, offering a robust and accurate solution for complex medical image analysis tasks. The code is publicly available at this https URL.
zh

[CV-264] Dynamics of Affective States During Takeover Requests in Conditionally Automated Driving Among Older Adults with and without Cognitive Impairment

【速读】:该论文试图解决认知功能受损的老年人在条件自动化车辆(Conditionally Automated Vehicles)中应对接管请求(Takeover Requests, TORs)时情绪反应不足可能导致的驾驶安全问题。解决方案的关键在于通过面部表情分析评估情感状态,以理解驾驶员在不同道路几何形状和速度下的情绪变化,并据此开发能够检测情感状态并支持安全交接的自适应车辆系统。

链接: https://arxiv.org/abs/2505.18416
作者: Gelareh Hajian,Ali Abedi,Bing Ye,Jennifer Campos,Alex Mihailidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 16 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Driving is a key component of independence and quality of life for older adults. However, cognitive decline associated with conditions such as mild cognitive impairment and dementia can compromise driving safety and often lead to premature driving cessation. Conditionally automated vehicles, which require drivers to take over control when automation reaches its operational limits, offer a potential assistive solution. However, their effectiveness depends on the driver’s ability to respond to takeover requests (TORs) in a timely and appropriate manner. Understanding emotional responses during TORs can provide insight into drivers’ engagement, stress levels, and readiness to resume control, particularly in cognitively vulnerable populations. This study investigated affective responses, measured via facial expression analysis of valence and arousal, during TORs among cognitively healthy older adults and those with cognitive impairment. Facial affect data were analyzed across different road geometries and speeds to evaluate within- and between-group differences in affective states. Within-group comparisons using the Wilcoxon signed-rank test revealed significant changes in valence and arousal during TORs for both groups. Cognitively healthy individuals showed adaptive increases in arousal under higher-demand conditions, while those with cognitive impairment exhibited reduced arousal and more positive valence in several scenarios. Between-group comparisons using the Mann-Whitney U test indicated that cognitively impaired individuals displayed lower arousal and higher valence than controls across different TOR conditions. These findings suggest reduced emotional response and awareness in cognitively impaired drivers, highlighting the need for adaptive vehicle systems that detect affective states and support safe handovers for vulnerable users.
zh

[CV-265] Rehabilitation Exercise Quality Assessment and Feedback Generation Using Large Language Models with Prompt Engineering

【速读】:该论文试图解决康复训练中因交通限制和人员短缺导致的高脱落率问题,以及现有研究在使用大型语言模型(Large Language Models, LLMs)进行运动质量反馈方面的不足。解决方案的关键在于从患者进行康复训练时的骨骼关节特征中提取特定运动信息,并将其输入预训练的LLMs中,通过多种提示技术(如零样本、少样本、思维链和角色扮演提示)评估运动质量并生成自然语言反馈,从而帮助患者改善动作。

链接: https://arxiv.org/abs/2505.18412
作者: Jessica Tang,Ali Abedi,Tracey J.F. Colella,Shehroz S. Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 16 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Exercise-based rehabilitation improves quality of life and reduces morbidity, mortality, and rehospitalization, though transportation constraints and staff shortages lead to high dropout rates from rehabilitation programs. Virtual platforms enable patients to complete prescribed exercises at home, while AI algorithms analyze performance, deliver feedback, and update clinicians. Although many studies have developed machine learning and deep learning models for exercise quality assessment, few have explored the use of large language models (LLMs) for feedback and are limited by the lack of rehabilitation datasets containing textual feedback. In this paper, we propose a new method in which exercise-specific features are extracted from the skeletal joints of patients performing rehabilitation exercises and fed into pre-trained LLMs. Using a range of prompting techniques, such as zero-shot, few-shot, chain-of-thought, and role-play prompting, LLMs are leveraged to evaluate exercise quality and provide feedback in natural language to help patients improve their movements. The method was evaluated through extensive experiments on two publicly available rehabilitation exercise assessment datasets (UI-PRMD and REHAB24-6) and showed promising results in exercise assessment, reasoning, and feedback generation. This approach can be integrated into virtual rehabilitation platforms to help patients perform exercises correctly, support recovery, and improve health outcomes.
zh

[CV-266] Recent Deep Learning in Crowd Behaviour Analysis: A Brief Review

【速读】:该论文旨在解决人群行为分析(crowd behaviour analysis)中的关键问题,特别是在公共安全和城市规划等实际应用中对人群行为预测与识别的需求。其解决方案的关键在于利用深度学习(deep learning)技术,通过不同类型的深度神经网络模型,包括纯深度神经网络以及结合物理模型的混合方法,来提升对人群行为的理解与建模能力。论文系统地回顾了相关研究进展,并探讨了现有方法的有效性及未来研究方向。

链接: https://arxiv.org/abs/2505.18401
作者: Jiangbei Yue,He Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 51 pages, 7 figures, Book Chapter

点击查看摘要

Abstract:Crowd behaviour analysis is essential to numerous real-world applications, such as public safety and urban planning, and therefore has been studied for decades. In the last decade or so, the development of deep learning has significantly propelled the research on crowd behaviours. This chapter reviews recent advances in crowd behaviour analysis using deep learning. We mainly review the research in two core tasks in this field, crowd behaviour prediction and recognition. We broadly cover how different deep neural networks, after first being proposed in machine learning, are applied to analysing crowd behaviours. This includes pure deep neural network models as well as recent development of methodologies combining physics with deep learning. In addition, representative studies are discussed and compared in detail. Finally, we discuss the effectiveness of existing methods and future research directions in this rapidly evolving field. This chapter aims to provide a high-level summary of the ongoing deep learning research in crowd behaviour analysis. It intends to help new researchers who just entered this field to obtain an overall understanding of the ongoing research, as well as to provide a retrospective analysis for existing researchers to identify possible future directions
zh

[CV-267] aming Diffusion for Dataset Distillation with High Representativeness ICML2025

【速读】:该论文试图解决基于扩散模型的图像数据集蒸馏方法中存在的分布匹配不准确、随机噪声导致的分布偏移以及独立采样等问题。其解决方案的关键在于提出D^3HR框架,通过采用DDIM反演将完整数据集的潜在表示从低正态性潜在域映射到高正态性高斯域,以保留信息并确保结构一致性,从而生成具有高代表性的潜在表示,并引入一种高效的采样方案以更好地对齐这些潜在表示与高正态性高斯分布。

链接: https://arxiv.org/abs/2505.18399
作者: Lin Zhao,Yushu Wu,Xinru Jiang,Jianyang Gu,Yanzhi Wang,Xiaolin Xu,Pu Zhao,Xue Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The paper is accepted by ICML 2025

点击查看摘要

Abstract:Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D^3HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code: this https URL.
zh

[CV-268] Monocular Marker-free Patient-to-Image Intraoperative Registration for Cochlear Implant Surgery

【速读】:该论文旨在解决单目患者-图像术中配准问题,特别是在无需外部硬件跟踪设备或标志点标记的情况下实现术中导航。其解决方案的关键在于利用合成显微外科场景数据集,通过轻量级神经网络和零样本学习方法,直接将术前CT扫描映射到术中二维手术图像,从而实现实时耳蜗植入手术引导。该方法通过学习合成数据集中的变换来估计包含旋转矩阵和平移向量的相机位姿,实现了无需额外硬件依赖的高精度、高效的术中配准。

链接: https://arxiv.org/abs/2505.18381
作者: Yike Zhang,Eduardo Davalos Anaya,Jack H. Noble
机构: Vanderbilt University (范德比尔特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel method for monocular patient-to-image intraoperative registration, specifically designed to operate without any external hardware tracking equipment or fiducial point markers. Leveraging a synthetic microscopy surgical scene dataset with a wide range of transformations, our approach directly maps preoperative CT scans to 2D intraoperative surgical frames through a lightweight neural network for real-time cochlear implant surgery guidance via a zero-shot learning approach. Unlike traditional methods, our framework seamlessly integrates with monocular surgical microscopes, making it highly practical for clinical use without additional hardware dependencies and requirements. Our method estimates camera poses, which include a rotation matrix and a translation vector, by learning from the synthetic dataset, enabling accurate and efficient intraoperative registration. The proposed framework was evaluated on nine clinical cases using a patient-specific and cross-patient validation strategy. Our results suggest that our approach achieves clinically relevant accuracy in predicting 6D camera poses for registering 3D preoperative CT scans to 2D surgical scenes with an angular error within 10 degrees in most cases, while also addressing limitations of traditional methods, such as reliance on external tracking systems or fiducial markers.
zh

[CV-269] Weakly-supervised Mamba-Based Mastoidectomy Shape Prediction for Cochlear Implant Surgery Using 3D T-Distribution Loss

【速读】:该论文旨在解决在人工耳蜗植入手术中,如何准确预测乳突切除区域以辅助术前规划和提高手术效果的问题。其解决方案的关键在于提出一种基于Mamba的弱监督框架,利用受学生t分布启发的3D T-Distribution损失函数,有效处理乳突切除区域复杂的几何变异性,并通过先前自监督网络的分割结果实现弱监督,从而避免了手动数据清洗或标注的需求。

链接: https://arxiv.org/abs/2505.18368
作者: Yike Zhang,Jack H. Noble
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cochlear implant surgery is a treatment for individuals with severe hearing loss. It involves inserting an array of electrodes inside the cochlea to electrically stimulate the auditory nerve and restore hearing sensation. A crucial step in this procedure is mastoidectomy, a surgical intervention that removes part of the mastoid region of the temporal bone, providing a critical pathway to the cochlea for electrode placement. Accurate prediction of the mastoidectomy region from preoperative imaging assists presurgical planning, reduces surgical risks, and improves surgical outcomes. In previous work, a self-supervised network was introduced to predict the mastoidectomy region using only preoperative CT scans. While promising, the method suffered from suboptimal robustness, limiting its practical application. To address this limitation, we propose a novel weakly-supervised Mamba-based framework to predict accurate mastoidectomy regions directly from preoperative CT scans. Our approach utilizes a 3D T-Distribution loss function inspired by the Student-t distribution, which effectively handles the complex geometric variability inherent in mastoidectomy shapes. Weak supervision is achieved using the segmentation results from the prior self-supervised network to eliminate the need for manual data cleaning or labeling throughout the training process. The proposed method is extensively evaluated against state-of-the-art approaches, demonstrating superior performance in predicting accurate and clinically relevant mastoidectomy regions. Our findings highlight the robustness and efficiency of the weakly-supervised learning framework with the proposed novel 3D T-Distribution loss.
zh

[CV-270] CONCORD: Concept-Informed Diffusion for Dataset Distillation

【速读】:该论文试图解决数据集蒸馏(Dataset Distillation, DD)中生成过程缺乏对每个样本的显式可控性问题,以及传统方法在实例层面忽视概念完整性的问题。其解决方案的关键在于引入大语言模型(Large Language Models, LLMs)的概念理解能力,通过概念引导的扩散(Concept-Informed Diffusion, CONCORD)方法,在去噪过程中利用细粒度的概念信息来增强图像生成的可控性和可解释性,从而提升蒸馏数据集的质量。

链接: https://arxiv.org/abs/2505.18358
作者: Jianyang Gu,Haonan Wang,Ruoxi Jia,Saeed Vahidian,Vyacheslav Kungurtsev,Wei Jiang,Yiran Chen
机构: The Ohio State University (俄亥俄州立大学); National University of Singapore (新加坡国立大学); Virginia Tech (弗吉尼亚理工学院); Duke University (杜克大学); Czech Technical University (捷克技术大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) has witnessed significant progress in creating small datasets that encapsulate rich information from large original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking concept completeness at the instance level. The missing or incorrectly represented object details cannot be efficiently compensated due to the constrained sample amount typical in DD settings. To this end, we propose incorporating the concept understanding of large language models (LLMs) to perform Concept-Informed Diffusion (CONCORD) for dataset distillation. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without relying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. The code implementation is released in this https URL.
zh

[CV-271] Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance

【速读】:该论文旨在解决动物姿态和外观的准确且可扩展量化问题,传统3D姿态估计技术如基于关键点和网格的方法面临表征细节有限、人工标注耗时以及每帧优化成本高昂等挑战,这些限制阻碍了对细微运动的研究并使大规模分析变得不切实际。论文提出的解决方案是Pose Splatter,其关键在于利用形状雕刻和3D高斯点云渲染技术,在无需先验动物几何知识、每帧优化或人工标注的情况下建模完整的动物姿态和外观,并引入一种旋转不变的视觉嵌入技术作为下游行为分析中3D关键点数据的替代方案。

链接: https://arxiv.org/abs/2505.18342
作者: Jack Goffinet,Youngjo Min,Carlo Tomasi,David E. Carlson
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 13 figures

点击查看摘要

Abstract:Accurate and scalable quantification of animal pose and appearance is crucial for studying behavior. Current 3D pose estimation techniques, such as keypoint- and mesh-based techniques, often face challenges including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization. These limitations hinder the study of subtle movements and can make large-scale analyses impractical. We propose Pose Splatter, a novel framework leveraging shape carving and 3D Gaussian splatting to model the complete pose and appearance of laboratory animals without prior knowledge of animal geometry, per-frame optimization, or manual annotations. We also propose a novel rotation-invariant visual embedding technique for encoding pose and appearance, designed to be a plug-in replacement for 3D keypoint data in downstream behavioral analyses. Experiments on datasets of mice, rats, and zebra finches show Pose Splatter learns accurate 3D animal geometries. Notably, Pose Splatter represents subtle variations in pose, provides better low-dimensional pose embeddings over state-of-the-art as evaluated by humans, and generalizes to unseen data. By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and micro-behavior at unprecedented resolution.
zh

[CV-272] DART3: Leverag ing Distance for Test Time Adaptation in Person Re-Identification

【速读】:该论文旨在解决行人重识别(Person ReID)模型在实际监控系统中因相机偏差(camera bias)导致的性能下降问题,即模型学习到的表征倾向于根据相机视角而非身份进行聚类。其解决方案的关键在于提出一种面向测试时适应(TTA)的框架DART^3,该框架采用基于距离的目标函数,通过利用最近邻距离与预测误差之间的相关性,更好地适配图像检索任务,从而有效缓解由相机引入的领域偏移问题。DART^3无需源数据、架构修改或重新训练,具有良好的部署灵活性和适应性。

链接: https://arxiv.org/abs/2505.18337
作者: Rajarshi Bhattacharya,Shakeeb Murtaza,Christian Desrosiers,Jose Dolz,Maguelonne Heritier,Eric Granger
机构: LIVIA, École de technologie supérieure (LIVIA, 高等技术学院); Genetec Inc (Genetec 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Person re-identification (ReID) models are known to suffer from camera bias, where learned representations cluster according to camera viewpoints rather than identity, leading to significant performance degradation under (inter-camera) domain shifts in real-world surveillance systems when new cameras are added to camera networks. State-of-the-art test-time adaptation (TTA) methods, largely designed for classification tasks, rely on classification entropy-based objectives that fail to generalize well to ReID, thus making them unsuitable for tackling camera bias. In this paper, we introduce DART ^3 , a TTA framework specifically designed to mitigate camera-induced domain shifts in person ReID. DART ^3 (Distance-Aware Retrieval Tuning at Test Time) leverages a distance-based objective that aligns better with image retrieval tasks like ReID by exploiting the correlation between nearest-neighbor distance and prediction error. Unlike prior ReID-specific domain adaptation methods, DART ^3 requires no source data, architectural modifications, or retraining, and can be deployed in both fully black-box and hybrid settings. Empirical evaluations on multiple ReID benchmarks indicate that DART ^3 and DART ^3 LITE, a lightweight alternative to the approach, consistently outperforms state-of-the-art TTA baselines, making for a viable option to online learning to mitigate the adverse effects of camera bias.
zh

[CV-273] COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification

【速读】:该论文旨在解决当前卷积神经网络(Convolutional Neural Network, CNN)微调方法中存在的效率低下问题。其解决方案的关键在于提出一种名为卷积低秩适配(Convolutional Low-Rank Adaptation, CoLoRA)的方法,该方法是对低秩适配(Low-Rank Adaptation, LoRA)技术的自然扩展,通过引入低秩矩阵来优化CNN的微调过程,从而实现更稳定、准确且计算效率更高的参数更新策略。

链接: https://arxiv.org/abs/2505.18315
作者: Mariano Rivera,Angello Hoyos
机构: Centro de Investigacion en Matematicas AC (Centro de Investigacion en Matematicas AC)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures. Submitted to Jou. Pattern Recognition

点击查看摘要

Abstract:We introduce the Convolutional Low-Rank Adaptation (CoLoRA) method, designed explicitly to overcome the inefficiencies found in current CNN fine-tuning methods. CoLoRA can be seen as a natural extension of the convolutional architectures of the Low-Rank Adaptation (LoRA) technique. We demonstrate the capabilities of our method by developing and evaluating models using the widely adopted CNN backbone pre-trained on ImageNet. We observed that this strategy results in a stable and accurate coarse-tuning procedure. Moreover, this strategy is computationally efficient and significantly reduces the number of parameters required for fine-tuning compared to traditional methods. Furthermore, our method substantially improves the speed and stability of training. Our case study focuses on classifying retinal diseases from optical coherence tomography (OCT) images, specifically using the OCTMNIST dataset. Experimental results demonstrate that a CNN backbone fine-tuned with CoLoRA surpasses nearly 1% in accuracy. Such a performance is comparable to the Vision Transformer, State-space discrete, and Kolmogorov-Arnold network models.
zh

[CV-274] CTRL-GS: Cascaded Temporal Residue Learning for 4D Gaussian Splatting CVPR2025

【速读】:该论文试图解决动态场景的新型视角合成问题,特别是针对多视角图像或视频捕获的场景,传统辐射场方法在处理动态内容时存在局限性。其解决方案的关键在于对动态场景进行分层分解,引入“视频段-帧”结构,并通过光流动态调整视频段,同时将信号建模为视频常量、段常量和帧特定残差的组合,这一方法受到残差学习的成功启发,从而实现更灵活且适应性强的模型。

链接: https://arxiv.org/abs/2505.18306
作者: Karly Hou,Wanhua Li,Hanspeter Pfister
机构: Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 4D Vision Workshop @ CVPR 2025

点击查看摘要

Abstract:Recently, Gaussian Splatting methods have emerged as a desirable substitute for prior Radiance Field methods for novel-view synthesis of scenes captured with multi-view images or videos. In this work, we propose a novel extension to 4D Gaussian Splatting for dynamic scenes. Drawing on ideas from residual learning, we hierarchically decompose the dynamic scene into a “video-segment-frame” structure, with segments dynamically adjusted by optical flow. Then, instead of directly predicting the time-dependent signals, we model the signal as the sum of video-constant values, segment-constant values, and frame-specific residuals, as inspired by the success of residual learning. This approach allows more flexible models that adapt to highly variable scenes. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets, with the greatest improvements on complex scenes with large movements, occlusions, and fine details, where current methods degrade most.
zh

[CV-275] Sampling Strategies for Efficient Training of Deep Learning Object Detection Algorithms

【速读】:该论文试图解决深度学习目标检测模型训练效率低的问题,特别是在标注样本数量有限的情况下提升模型性能。解决方案的关键在于利用深度学习模型的Lipschitz连续性假设,提出两种采样策略:均匀采样和帧差采样。均匀采样通过在目标动态状态空间中随机且均匀地获取样本以提高数据多样性,而帧差采样则旨在挖掘视频中连续帧之间的时序冗余性,从而减少冗余样本的依赖,最终在较少人工标注样本的情况下实现良好的训练效果。

链接: https://arxiv.org/abs/2505.18302
作者: Gefei Shen,Yung-Hong Sun,Yu Hen Hu,Hongrui Jiang
机构: University of Wisconsin - Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Two sampling strategies are investigated to enhance efficiency in training a deep learning object detection model. These sampling strategies are employed under the assumption of Lipschitz continuity of deep learning models. The first strategy is uniform sampling which seeks to obtain samples evenly yet randomly through the state space of the object dynamics. The second strategy of frame difference sampling is developed to explore the temporal redundancy among successive frames in a video. Experiment result indicates that these proposed sampling strategies provide a dataset that yields good training performance while requiring relatively few manually labelled samples.
zh

[CV-276] InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning ACL2025

【速读】:该论文试图解决当前大型多模态基础模型在理解物体组成部分及其功能方面的不足,特别是在任务导向的部件分割(task-oriented part segmentation)方面表现不佳的问题。其解决方案的关键在于引入了一个新的现实世界基准数据集InstructPart,该数据集包含手动标注的部件分割注释和任务导向的指令,用于评估模型在日常情境中理解和执行部件级任务的能力,并通过微调策略提升模型性能。

链接: https://arxiv.org/abs/2505.18291
作者: Zifu Wan,Yaqi Xie,Ce Zhang,Zhiqiu Lin,Zihan Wang,Simon Stepputtis,Deva Ramanan,Katia Sycara
机构: Robotics Institute, Carnegie Mellon University (机器人学院,卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ACL 2025 Main. Project page: this https URL

点击查看摘要

Abstract:Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object’s functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: this https URL.
zh

[CV-277] Improvement Strategies for Few-Shot Learning in OCT Image Classification of Rare Retinal Diseases

【速读】:该论文试图解决在OCT诊断图像分类中,由于主要类别和罕见类别数据分布不均衡导致的分类准确率不足的问题。其解决方案的关键在于采用基于生成对抗网络(GAN)的数据增强策略,并引入U-GAT-IT生成模型以提升生成部分的质量,同时结合数据平衡技术缩小各类别之间准确率的差异。最终通过集成CBAM注意力机制和微调的InceptionV3模型,实现了97.85%的整体准确率,显著优于原始基线模型。

链接: https://arxiv.org/abs/2505.20149
作者: Cheng-Yu Tai,Ching-Wen Chen,Chi-Chin Wu,Bo-Chen Chiu,Cheng-Hung(Dixson)Lin,Cheng-Kai Lu,Jia-Kang Wang,Tzu-Lun Huang
机构: Yuan Ze University (元智大学); National Taiwan Normal University (台湾师范大学); Far Eastern Memorial Hospital (远东纪念医院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper focuses on using few-shot learning to improve the accuracy of classifying OCT diagnosis images with major and rare classes. We used the GAN-based augmentation strategy as a baseline and introduced several novel methods to further enhance our model. The proposed strategy contains U-GAT-IT for improving the generative part and uses the data balance technique to narrow down the skew of accuracy between all categories. The best model obtained was built with CBAM attention mechanism and fine-tuned InceptionV3, and achieved an overall accuracy of 97.85%, representing a significant improvement over the original baseline.
zh

[CV-278] Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models

【速读】:该论文试图解决如何将最新的基础模型应用于医学图像分类任务,并评估其在有限标注数据下的性能表现。解决方案的关键在于对多种先进的基础模型(如DINOv2、MAE、VMamba、CoCa、SAM2和AIMv2)进行微调,并在多个医学图像数据集(如CBIS-DDSM、ISIC2019、APTOS2019和CHEXPERT)上评估其分类效果,以验证这些模型在医学领域的适用性和有效性。

链接: https://arxiv.org/abs/2505.19779
作者: Mobina Mansoori,Sajjad Shahabodini,Farnoush Bayatmakou,Jamshid Abouei,Konstantinos N. Plataniotis,Arash Mohammadi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-of-the-art foundation models, DINOv2, MAE, VMamba, CoCa, SAM2, and AIMv2, for medical image classification. We explore their effectiveness on datasets including CBIS-DDSM for mammography, ISIC2019 for skin lesions, APTOS2019 for diabetic retinopathy, and CHEXPERT for chest radiographs. By fine-tuning these models and evaluating their configurations, we aim to understand the potential of these advancements in medical image classification. The results indicate that these advanced models significantly enhance classification outcomes, demonstrating robust performance despite limited labeled data. Based on our results, AIMv2, DINOv2, and SAM2 models outperformed others, demonstrating that progress in natural domain training has positively impacted the medical domain and improved classification outcomes. Our code is publicly available at: this https URL.
zh

[CV-279] A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images

【速读】:该论文旨在解决对比学习(Contrastive Learning, CL)方法在遥感(Remote Sensing, RS)图像中存在领域差距(domain gap)的问题,即尽管CL在计算机视觉任务中表现优异,但在RS图像上的适应性仍需改进。解决方案的关键在于提出一种名为PerA的新型自监督学习方法,通过语义上完美对齐的样本对生成通用的RS特征。该方法利用空间不相交的掩码对增强图像进行采样,从而在保持语义一致性的同时引入外观差异,并通过教师-学生框架和可学习的掩码令牌实现高质量特征提取,提升了模型的内存效率与大规模训练能力。

链接: https://arxiv.org/abs/2505.19447
作者: Hengtong Shen,Haiyan Gu,Haitao Li,Yi Yang,Agen qiu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-Supervised Learning (SSL) enables us to pre-train foundation models without costly labeled data. Among SSL methods, Contrastive Learning (CL) methods are better at obtaining accurate semantic representations in noise interference. However, due to the significant domain gap, while CL methods have achieved great success in many computer vision tasks, they still require specific adaptation for Remote Sensing (RS) images. To this end, we present a novel self-supervised method called PerA, which produces all-purpose RS features through semantically Perfectly Aligned sample pairs. Specifically, PerA obtains features from sampled views by applying spatially disjoint masks to augmented images rather than random cropping. With disjoint masks, we divide patches from different views into different parts that are semantically aligned but inconsistent in appearance. Our framework provides high-quality features by ensuring consistency between teacher and student and predicting learnable mask tokens. Compared to previous contrastive methods, our method demonstrates higher memory efficiency and can be trained with larger batches due to its sparse inputs. We also collect an unlabeled pre-training dataset, which contains about 5 million RS images. We conducted experiments on multiple downstream task datasets and achieved performance comparable to previous state-of-the-art methods with a limited model scale, which verified the superiority of our method. We hope this work will contribute to practical remote sensing interpretation works.
zh

[CV-280] RGC-Bent: A Novel Dataset for Bent Radio Galaxy Classification ICIP2025

【速读】:该论文旨在解决弯曲射电活动星系核(Bent radio AGN)分类困难的问题,这一问题主要源于缺乏专门的数据集和基准。解决方案的关键在于构建一个针对窄角尾(NAT)和宽角尾(WAT)类别的射电AGN分类数据集,并提供详细的数据处理步骤,同时评估先进深度学习模型在该数据集上的性能,其中ConvNeXT在F1分数上表现最佳。

链接: https://arxiv.org/abs/2505.19249
作者: Mir Sazzat Hossain,Khan Muhammad Bin Asad,Payaswini Saikia,Adrita Khan,Md Akil Raihan Iftee,Rakibul Hasan Rajib,Arshad Momen,Md Ashraful Amin,Amin Ahsan Ali,AKM Mahbubur Rahman
机构: 未知
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 2 tables, Accepted In ICIP 2025

点击查看摘要

Abstract:We introduce a novel machine learning dataset tailored for the classification of bent radio active galactic nuclei (AGN) in astronomical observations. Bent radio AGN, distinguished by their curved jet structures, provide critical insights into galaxy cluster dynamics, interactions within the intracluster medium, and the broader physics of AGN. Despite their astrophysical significance, the classification of bent radio AGN remains a challenge due to the scarcity of specialized datasets and benchmarks. To address this, we present a dataset, derived from a well-recognized radio astronomy survey, that is designed to support the classification of NAT (Narrow-Angle Tail) and WAT (Wide-Angle Tail) categories, along with detailed data processing steps. We further evaluate the performance of state-of-the-art deep learning models on the dataset, including Convolutional Neural Networks (CNNs), and transformer-based architectures. Our results demonstrate the effectiveness of advanced machine learning models in classifying bent radio AGN, with ConvNeXT achieving the highest F1-scores for both NAT and WAT sources. By sharing this dataset and benchmarks, we aim to facilitate the advancement of research in AGN classification, galaxy cluster environments and galaxy evolution.
zh

[CV-281] MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

【速读】:该论文试图解决医学影像中缺乏统一视觉分词器(visual tokenizer)的问题,这一问题限制了自回归模型在医疗影像领域的潜力发挥。现有模型难以同时捕捉细粒度视觉结构以实现准确的图像重建与真实图像生成,以及表达丰富的语义以支持精准诊断与影像解读。解决方案的关键在于提出MedITok,这是首个针对医学影像设计的统一分词器,其通过将低层次结构细节与高层次临床语义编码到统一潜在空间中,实现了对医学影像的全面表征。为平衡这两个目标,研究引入了一种新颖的两阶段训练框架,首先通过视觉语义约束启动分词器的重建学习,随后通过文本语义表示对齐将详细临床语义注入潜在空间。

链接: https://arxiv.org/abs/2505.19225
作者: Chenglong Ma,Yuanfeng Ji,Jin Ye,Zilong Li,Chenhui Wang,Junzhi Ning,Wei Li,Lihao Liu,Qiushan Guo,Tianbin Li,Junjun He,Hongming Shan
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Stanford University (斯坦福大学); Shanghai AI Laboratory (上海人工智能实验室); ByteDance Seed (字节跳动种子)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer – one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: this https URL.
zh

[CV-282] Unsupervised cell segmentation by fast Gaussian Processes

【速读】:该论文试图解决从时间推移显微镜视频中分析细胞行为时,细胞边界信息的获取问题。现有监督式细胞分割工具如ImageJ需要调整多种参数,并依赖于对目标形状的严格假设;而基于卷积神经网络的最新监督式分割工具虽然提高了准确性,但依赖高质量标注图像,难以用于数据库中未包含的新类型目标的分割。该研究提出了一种基于快速高斯过程的无监督细胞分割算法,无需参数调优或对目标形状做出限制性假设。其关键在于推导出适用于含有不同亮度区域的异质图像的鲁棒阈值判定标准,并采用分水岭分割方法区分相互接触的细胞对象。

链接: https://arxiv.org/abs/2505.18902
作者: Laura Baracaldo,Blythe King,Haoran Yan,Yizi Lin,Nina Miolane,Mengyang Gu
机构: University of California, Santa Barbara(加利福尼亚大学圣塔芭芭拉分校)
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cell boundary information is crucial for analyzing cell behaviors from time-lapse microscopy videos. Existing supervised cell segmentation tools, such as ImageJ, require tuning various parameters and rely on restrictive assumptions about the shape of the objects. While recent supervised segmentation tools based on convolutional neural networks enhance accuracy, they depend on high-quality labelled images, making them unsuitable for segmenting new types of objects not in the database. We developed a novel unsupervised cell segmentation algorithm based on fast Gaussian processes for noisy microscopy images without the need for parameter tuning or restrictive assumptions about the shape of the object. We derived robust thresholding criteria adaptive for heterogeneous images containing distinct brightness at different parts to separate objects from the background, and employed watershed segmentation to distinguish touching cell objects. Both simulated studies and real-data analysis of large microscopy images demonstrate the scalability and accuracy of our approach compared with the alternatives.
zh

[CV-283] Memory-Efficient Super-Resolution of 3D Micro-CT Images Using Octree-Based GANs: Enhancing Resolution and Segmentation Accuracy

【速读】:该论文旨在解决微米级三维断层扫描(micro-CT)图像在岩石样本中因X射线衰减重叠导致的分割不准确问题,并提升图像分辨率。其关键解决方案是采用一种基于3D八叉树(Octree)结构的卷积Wasserstein生成对抗网络(Wasserstein Generative Adversarial Network, WGAN),通过引入八叉树结构优化3D卷积层的内存消耗,从而实现16倍超分辨率的重建,有效克服了体积深度学习中的内存瓶颈问题。

链接: https://arxiv.org/abs/2505.18664
作者: Evgeny Ugolkov,Xupeng He,Hyung Kwak,Hussein Hoteit
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31 pages, 15 figures

点击查看摘要

Abstract:We present a memory-efficient algorithm for significantly enhancing the quality of segmented 3D micro-Computed Tomography (micro-CT) images of rocks using a generative model. The proposed model achieves a 16x increase in resolution and corrects inaccuracies in segmentation caused by the overlapping X-ray attenuation in micro-CT measurements across different minerals. The generative model employed is a 3D Octree-based convolutional Wasserstein generative adversarial network with gradient penalty. To address the challenge of high memory consumption inherent in standard 3D convolutional layers, we implemented an Octree structure within the 3D progressive growing generator model. This enabled the use of memory-efficient 3D Octree-based convolutional layers. The approach is pivotal in overcoming the long-standing memory bottleneck in volumetric deep learning, making it possible to reach 16x super-resolution in 3D, a scale that is challenging to attain due to cubic memory scaling. For training, we utilized segmented 3D low-resolution micro-CT images along with unpaired segmented complementary 2D high-resolution laser scanning microscope images. Post-training, resolution improved from 7 to 0.44 micro-m/voxel with accurate segmentation of constituent minerals. Validated on Berea sandstone, this framework demonstrates substantial improvements in pore characterization and mineral differentiation, offering a robust solution to one of the primary computational limitations in modern geoscientific imaging.
zh

[CV-284] ropical Geometry Based Edge Detection Using Min-Plus and Max-Plus Algebra

【速读】:该论文试图解决传统边缘检测方法在低对比度和纹理区域中边界检测性能不足的问题,其解决方案的关键在于引入基于热带代数(tropical algebra)的卷积和梯度计算框架,通过最小-最大代数(min-plus和max-plus algebra)重新表述边缘检测过程,从而突出主要的强度变化,实现更清晰和连续的边缘表示。该框架结合多尺度处理、Hessian滤波和小波收缩等技术,在提升边缘过渡质量的同时保持计算效率。

链接: https://arxiv.org/abs/2505.18625
作者: Shivam Kumar Jha S,Jaya NN Iyer
机构: The Institute of Mathematical Sciences (HBNI)
类目: Algebraic Geometry (math.AG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a tropical geometry-based edge detection framework that reformulates convolution and gradient computations using min-plus and max-plus algebra. The tropical formulation emphasizes dominant intensity variations, contributing to sharper and more continuous edge representations. Three variants are explored: an adaptive threshold-based method, a multi-kernel min-plus method, and a max-plus method emphasizing structural continuity. The framework integrates multi-scale processing, Hessian filtering, and wavelet shrinkage to enhance edge transitions while maintaining computational efficiency. Experiments on MATLAB built-in grayscale and color images suggest that tropical formulations integrated with classical operators, such as Canny and LoG, can improve boundary detection in low-contrast and textured regions. Quantitative evaluation using standard edge metrics indicates favorable edge clarity and structural coherence. These results highlight the potential of tropical algebra as a scalable and noise-aware formulation for edge detection in practical image analysis tasks.
zh

[CV-285] ReflectGAN: Modeling Vegetation Effects for Soil Carbon Estimation from Satellite Imagery

【速读】:该论文试图解决在植被覆盖区域中,由于植物覆盖导致的土壤反射率光谱污染问题,从而影响土壤有机碳(SOC)的准确估算。解决方案的关键是提出一种基于生成对抗网络(GAN)的新型配对框架——Reflectance Transformation Generative Adversarial Network (ReflectGAN),该框架通过学习植被覆盖与裸土反射率之间的光谱转换,重建出精确的裸土反射率,从而提升混合地表覆盖条件下SOC的估算精度。

链接: https://arxiv.org/abs/2505.18546
作者: Dristi Datta,Manoranjan Paul,Manzur Murshed,Shyh Wei Teng,Leigh M. Schmidtke
机构: Charles Sturt University (查尔斯·斯特尔大学); Cooperative Research Centre for High Performance Soils (高性能土壤合作研究中心); Deakin University (迪肯大学); Federation University (联邦大学); Gulbali Institute (古尔巴利研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Soil organic carbon (SOC) is a critical indicator of soil health, but its accurate estimation from satellite imagery is hindered in vegetated regions due to spectral contamination from plant cover, which obscures soil reflectance and reduces model reliability. This study proposes the Reflectance Transformation Generative Adversarial Network (ReflectGAN), a novel paired GAN-based framework designed to reconstruct accurate bare soil reflectance from vegetated soil satellite observations. By learning the spectral transformation between vegetated and bare soil reflectance, ReflectGAN facilitates more precise SOC estimation under mixed land cover conditions. Using the LUCAS 2018 dataset and corresponding Landsat 8 imagery, we trained multiple learning-based models on both original and ReflectGAN-reconstructed reflectance inputs. Models trained on ReflectGAN outputs consistently outperformed those using existing vegetation correction methods. For example, the best-performing model (RF) achieved an R^2 of 0.54, RMSE of 3.95, and RPD of 2.07 when applied to the ReflectGAN-generated signals, representing a 35% increase in R^2 , a 43% reduction in RMSE, and a 43% improvement in RPD compared to the best existing method (PMM-SU). The performance of the models with ReflectGAN is also better compared to their counterparts when applied to another dataset, i.e., Sentinel-2 imagery. These findings demonstrate the potential of ReflectGAN to improve SOC estimation accuracy in vegetated landscapes, supporting more reliable soil monitoring.
zh

[CV-286] How We Won the ISLES24 Challenge by Preprocessing

【速读】:该论文旨在解决急性卒中病变边界准确识别的问题,这对于卒中诊断和治疗至关重要。为应对监督深度学习方法对大规模、多样化且标注数据的依赖,ISLES’24挑战提供了纵向卒中影像数据,包括入院时的CT扫描和后续2-9天的MRI扫描,并以后续MRI作为标注来源。模型仅使用CT输入进行评估,需预测可能在CT中不可见的病变进展。该研究提出的解决方案关键在于设计了一个包含基于深度学习的去骨处理和自定义强度窗宽的预处理流程,并结合标准的大残差nnU-Net架构,从而实现了平均测试Dice系数为28.5(标准差21.27)的分割效果。

链接: https://arxiv.org/abs/2505.18424
作者: Tianyi Ren,Juampablo E. Heras Rivera,Hitender Oswal,Yutong Pan,William Henry,Jacob Ruzevick,Mehmet Kurt
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stroke is among the top three causes of death worldwide, and accurate identification of stroke lesion boundaries is critical for diagnosis and treatment. Supervised deep learning methods have emerged as the leading solution for stroke lesion segmentation but require large, diverse, and annotated datasets. The ISLES’24 challenge addresses this need by providing longitudinal stroke imaging data, including CT scans taken on arrival to the hospital and follow-up MRI taken 2-9 days from initial arrival, with annotations derived from follow-up MRI. Importantly, models submitted to the ISLES’24 challenge are evaluated using only CT inputs, requiring prediction of lesion progression that may not be visible in CT scans for segmentation. Our winning solution shows that a carefully designed preprocessing pipeline including deep-learning-based skull stripping and custom intensity windowing is beneficial for accurate segmentation. Combined with a standard large residual nnU-Net architecture for segmentation, this approach achieves a mean test Dice of 28.5 with a standard deviation of 21.27.
zh

[CV-287] Brightness-Invariant Tracking Estimation in Tagged MRI

【速读】:该论文旨在解决在标记磁共振(tagged MRI)成像中,由于纵向弛豫和稳态进程导致的标签和组织亮度随时间变化的问题,这些问题使得基于光流的方法在跟踪时容易产生误差。为了解决这一问题,论文提出了亮度不变跟踪估计(BRITE)技术,其关键在于通过去耦观察到的标记图像序列中的解剖结构与标记模式,并同时估计拉格朗日运动。该方法利用去噪扩散概率模型的表达能力来表示潜在解剖结构的概率分布,并结合物理信息神经网络的灵活性来估计生物合理的运动,从而有效应对该问题的固有不适定性。

链接: https://arxiv.org/abs/2505.18365
作者: Zhangxing Bian,Shuwen Wei,Xiao Liang,Yuan-Chiao Lu,Samuel W. Remedios,Fangxu Xing,Jonghye Woo,Dzung L. Pham,Aaron Carass,Philip V. Bayly,Jiachen Zhuo,Ahmed Alshareef,Jerry L. Prince
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IPMI 2025

点击查看摘要

Abstract:Magnetic resonance (MR) tagging is an imaging technique for noninvasively tracking tissue motion in vivo by creating a visible pattern of magnetization saturation (tags) that deforms with the tissue. Due to longitudinal relaxation and progression to steady-state, the tags and tissue brightnesses change over time, which makes tracking with optical flow methods error-prone. Although Fourier methods can alleviate these problems, they are also sensitive to brightness changes as well as spectral spreading due to motion. To address these problems, we introduce the brightness-invariant tracking estimation (BRITE) technique for tagged MRI. BRITE disentangles the anatomy from the tag pattern in the observed tagged image sequence and simultaneously estimates the Lagrangian motion. The inherent ill-posedness of this problem is addressed by leveraging the expressive power of denoising diffusion probabilistic models to represent the probabilistic distribution of the underlying anatomy and the flexibility of physics-informed neural networks to estimate biologically-plausible motion. A set of tagged MR images of a gel phantom was acquired with various tag periods and imaging flip angles to demonstrate the impact of brightness variations and to validate our method. The results show that BRITE achieves more accurate motion and strain estimates as compared to other state of the art methods, while also being resistant to tag fading.
zh

[CV-288] Large Language Model-Driven Distributed Integrated Multimodal Sensing and Semantic Communications

【速读】:该论文试图解决传统单模态感知系统在复杂动态环境中感知效果不佳的问题,以及单设备系统因视角有限和空间覆盖不足而导致的城市或非视距场景下的性能下降问题。解决方案的关键在于提出一种基于大语言模型(LLM)的分布式多模态感知与语义通信框架(LLM-DiSAC),该框架通过多协同感知设备与聚合中心的协作,结合RF-视觉融合网络(RVFN)、基于LLM的语义传输网络(LSTN)和基于Transformer的聚合模型(TRAM),实现多模态数据的有效融合与通信效率的提升,并引入两阶段分布式学习策略以保障数据隐私。

链接: https://arxiv.org/abs/2505.18194
作者: Yubo Peng,Luping Xiang,Bingxin Zhang,Kun Yang
机构: Nanjing University (南京大学); School of Intelligent Software and Engineering, Nanjing University (Suzhou Campus) (南京大学智能软件与工程学院(苏州校区)); Southeast University (东南大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional single-modal sensing systems-based solely on either radio frequency (RF) or visual data-struggle to cope with the demands of complex and dynamic environments. Furthermore, single-device systems are constrained by limited perspectives and insufficient spatial coverage, which impairs their effectiveness in urban or non-line-of-sight scenarios. To overcome these challenges, we propose a novel large language model (LLM)-driven distributed integrated multimodal sensing and semantic communication (LLM-DiSAC) framework. Specifically, our system consists of multiple collaborative sensing devices equipped with RF and camera modules, working together with an aggregation center to enhance sensing accuracy. First, on sensing devices, LLM-DiSAC develops an RF-vision fusion network (RVFN), which employs specialized feature extractors for RF and visual data, followed by a cross-attention module for effective multimodal integration. Second, a LLM-based semantic transmission network (LSTN) is proposed to enhance communication efficiency, where the LLM-based decoder leverages known channel parameters, such as transceiver distance and signal-to-noise ratio (SNR), to mitigate semantic distortion. Third, at the aggregation center, a transformer-based aggregation model (TRAM) with an adaptive aggregation attention mechanism is developed to fuse distributed features and enhance sensing accuracy. To preserve data privacy, a two-stage distributed learning strategy is introduced, allowing local model training at the device level and centralized aggregation model training using intermediate features. Finally, evaluations on a synthetic multi-view RF-visual dataset generated by the Genesis simulation engine show that LLM-DiSAC achieves a good performance.
zh

[CV-289] AI- Enhanced Stethoscope in Remote Diagnostics for Cardiopulmonary Diseases

【速读】:该论文试图解决全球范围内心脏和肺部疾病导致的突发性和过早死亡问题,以及现有检测和治疗方法在及时诊断方面面临的挑战。其解决方案的关键在于提出一种创新且高效的模型,该模型结合生成式AI(Generative AI)对肺部和心脏状况进行同时诊断,利用听诊音进行分析。该模型采用MFCC特征提取与工程方法,并通过融合门控循环单元(GRU)与卷积神经网络(CNN)的混合模型处理音频信号,同时设计为可在低成本嵌入式设备上部署,以满足资源匮乏地区的医疗需求。此外,该模型还能生成数字音频记录,用于分类六种肺部疾病和五种心血管疾病,从而实现基于网络应用的实时分析,推动标准化医疗的发展。

链接: https://arxiv.org/abs/2505.18184
作者: Hania Ghouse,Juveria Tanveen,Abdul Muqtadir Ahmed,Uma N. Dulhare
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increase in cardiac and pulmonary diseases presents an alarming and pervasive health challenge on a global scale responsible for unexpected and premature mortalities. In spite of how serious these conditions are, existing methods of detection and treatment encounter challenges, particularly in achieving timely diagnosis for effective medical intervention. Manual screening processes commonly used for primary detection of cardiac and respiratory problems face inherent limitations, increased by a scarcity of skilled medical practitioners in remote or under-resourced areas. To address this, our study introduces an innovative yet efficient model which integrates AI for diagnosing lung and heart conditions concurrently using the auscultation sounds. Unlike the already high-priced digital stethoscope, our proposed model has been particularly designed to deploy on low-cost embedded devices and thus ensure applicability in under-developed regions that actually face an issue of accessing medical care. Our proposed model incorporates MFCC feature extraction and engineering techniques to ensure that the signal is well analyzed for accurate diagnostics through the hybrid model combining Gated Recurrent Unit with CNN in processing audio signals recorded from the low-cost stethoscope. Beyond its diagnostic capabilities, the model generates digital audio records that facilitate in classifying six pulmonary and five cardiovascular diseases. Hence, the integration of a cost effective stethoscope with an efficient AI empowered model deployed on a web app providing real-time analysis, represents a transformative step towards standardized healthcare
zh

[CV-290] Evaluation in EEG Emotion Recognition: State-of-the-Art Review and Unified Framework

【速读】:该论文试图解决脑电图情绪识别(Electroencephalography-based Emotion Recognition, EEG-ER)领域缺乏统一评估协议的问题,这一问题阻碍了方法的公平比较和领域进展的追踪。解决方案的关键在于提出一个名为EEGain的统一评估协议,该协议提供了一套标准化的数据预处理、数据划分、评估指标以及六个关键数据集的便捷加载功能,从而实现对新方法和数据集的高效评估与比较。

链接: https://arxiv.org/abs/2505.18175
作者: Natia Kukhilava,Tatia Tsmindashvili,Rapael Kalandadze,Anchit Gupta,Sofio Katamadze,François Brémond,Laura M. Ferrari,Philipp Müller,Benedikt Emanuel Wirth
机构: Muskhelishvili Institute of Computational Mathematics, Georgian Technical University(穆赫利什维利计算数学研究所,格鲁吉亚技术大学); German Research Center for Artificial Intelligence(德国人工智能研究中心); The Biorobotics Institute, Scuola Superiore Sant’Anna(生物机器人研究所,圣安娜高等学院); INRIA, Universite Côte d’Azur(法国国家信息与自动化研究所,科特迪瓦大学)
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Electroencephalography-based Emotion Recognition (EEG-ER) has become a growing research area in recent years. Analyzing 216 papers published between 2018 and 2023, we uncover that the field lacks a unified evaluation protocol, which is essential to fairly define the state of the art, compare new approaches and to track the field’s progress. We report the main inconsistencies between the used evaluation protocols, which are related to ground truth definition, evaluation metric selection, data splitting types (e.g., subject-dependent or subject-independent) and the use of different datasets. Capitalizing on this state-of-the-art research, we propose a unified evaluation protocol, EEGain (this https URL), which enables an easy and efficient evaluation of new methods and datasets. EEGain is a novel open source software framework, offering the capability to compare - and thus define - state-of-the-art results. EEGain includes standardized methods for data pre-processing, data splitting, evaluation metrics, and the ability to load the six most relevant datasets (i.e., AMIGOS, DEAP, DREAMER, MAHNOB-HCI, SEED, SEED-IV) in EEG-ER with only a single line of code. In addition, we have assessed and validated EEGain using these six datasets on the four most common publicly available methods (EEGNet, DeepConvNet, ShallowConvNet, TSception). This is a significant step to make research on EEG-ER more reproducible and comparable, thereby accelerating the overall progress of the field.
zh

人工智能

[AI-0] EgoZero: Robot Learning from Smart Glasses

【速读】:该论文旨在解决当前机器人策略在现实世界中仍远落后于基本人类能力的问题,尤其是如何有效利用人类在真实环境中的交互数据来提升机器人学习性能。其解决方案的关键在于提出EgoZero系统,该系统能够从佩戴Project Aria智能眼镜捕捉的人类示范中学习鲁棒的操作策略,并且无需任何机器人数据(zero robot data)。EgoZero的核心技术包括从第一人称视角的人类示范中提取完整的、可执行的动作,将人类视觉观察压缩为与形态无关的状态表示,并实现形态、空间和语义上泛化的闭环策略学习。

链接: https://arxiv.org/abs/2505.20290
作者: Vincent Liu,Ademi Adeniji,Haotian Zhan,Raunaq Bhirangi,Pieter Abbeel,Lerrel Pinto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent progress in general purpose robotics, robot policies still lag far behind basic human capabilities in the real world. Humans interact constantly with the physical world, yet this rich data resource remains largely untapped in robot learning. We propose EgoZero, a minimal system that learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses, \textbfand zero robot data . EgoZero enables: (1) extraction of complete, robot-executable actions from in-the-wild, egocentric, human demonstrations, (2) compression of human visual observations into morphology-agnostic state representations, and (3) closed-loop policy learning that generalizes morphologically, spatially, and semantically. We deploy EgoZero policies on a gripper Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks and only 20 minutes of data collection per task. Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning - paving the way toward a future of abundant, diverse, and naturalistic training data for robots. Code and videos are available at this https URL.
zh

[AI-1] Alita: Generalist Agent Enabling Scalable Agent ic Reasoning with Minimal Predefinition and Maximal Self-Evolution

【速读】:该论文旨在解决现有代理框架在适应性、可扩展性和跨领域泛化能力方面的不足,这些问题主要源于对人工预定义工具和工作流的高度依赖。其解决方案的关键在于设计了一个名为Alita的通用代理,遵循“简单即终极复杂”的原则,通过最小化预定义和最大化自我进化实现可扩展的代理推理。具体而言,Alita仅配备一个直接解决问题的组件,简化了结构并提升了泛化能力;同时,通过提供一系列通用组件,使Alita能够自主构建、优化和复用外部能力,从而增强其创造力和适应性。

链接: https://arxiv.org/abs/2505.20286
作者: Jiahao Qiu,Xuan Qi,Tongcheng Zhang,Xinzhe Juan,Jiacheng Guo,Yifu Lu,Yimin Wang,Zixin Yao,Qihan Ren,Xun Jiang,Xing Zhou,Dongrui Liu,Ling Yang,Yue Wu,Kaixuan Huang,Shilong Liu,Hongru Wang,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled agents to autonomously perform complex, open-ended tasks. However, many existing frameworks depend heavily on manually predefined tools and workflows, which hinder their adaptability, scalability, and generalization across domains. In this work, we introduce Alita–a generalist agent designed with the principle of “Simplicity is the ultimate sophistication,” enabling scalable agentic reasoning through minimal predefinition and maximal self-evolution. For minimal predefinition, Alita is equipped with only one component for direct problem-solving, making it much simpler and neater than previous approaches that relied heavily on hand-crafted, elaborate tools and workflows. This clean design enhances its potential to generalize to challenging questions, without being limited by tools. For Maximal self-evolution, we enable the creativity of Alita by providing a suite of general-purpose components to autonomously construct, refine, and reuse external capabilities by generating task-related model context protocols (MCPs) from open source, which contributes to scalable agentic reasoning. Notably, Alita achieves 75.15% pass@1 and 87.27% pass@3 accuracy, which is top-ranking among general-purpose agents, on the GAIA benchmark validation dataset, 74.00% and 52.00% pass@1, respectively, on Mathvista and PathVQA, outperforming many agent systems with far greater complexity. More details will be updated at \hrefthis https URLthis https URL .
zh

[AI-2] n Principles of AI Agent Economics

【速读】:该论文试图解决AI代理在经济系统和社会生态中日益增长的自主性和决策能力所带来的整合挑战、伦理问题以及安全与效用之间的平衡问题。其解决方案的关键在于提出十项AI代理经济学原则,这些原则基于经济学、决策理论和伦理学,旨在理解AI代理如何做出决策、影响社会互动并参与更广泛的经济体系,同时为AI代理的负责任整合提供理论框架和实践指导。

链接: https://arxiv.org/abs/2505.20273
作者: Ke Yang,ChengXiang Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid rise of AI-based autonomous agents is transforming human society and economic systems, as these entities increasingly exhibit human-like or superhuman intelligence. From excelling at complex games like Go to tackling diverse general-purpose tasks with large language and multimodal models, AI agents are evolving from specialized tools into dynamic participants in social and economic ecosystems. Their autonomy and decision-making capabilities are poised to impact industries, professions, and human lives profoundly, raising critical questions about their integration into economic activities, potential ethical concerns, and the balance between their utility and safety. To address these challenges, this paper presents ten principles of AI agent economics, offering a framework to understand how AI agents make decisions, influence social interactions, and participate in the broader economy. Drawing on economics, decision theory, and ethics, we explore fundamental questions, such as whether AI agents might evolve from tools into independent entities, their impact on labor markets, and the ethical safeguards needed to align them with human values. These principles build on existing economic theories while accounting for the unique traits of AI agents, providing a roadmap for their responsible integration into human systems. Beyond theoretical insights, this paper highlights the urgency of future research into AI trustworthiness, ethical guidelines, and regulatory oversight. As we enter a transformative era, this work serves as both a guide and a call to action, ensuring AI agents contribute positively to human progress while addressing risks tied to their unprecedented capabilities. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.20273 [cs.AI] (or arXiv:2505.20273v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.20273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] Comparing Neural Network Encodings for Logic-based Explainability

【速读】:该论文试图解决在人工神经网络(Artificial Neural Networks, ANNs)输出解释中的可扩展性问题,特别是在逻辑基础的可解释性方法中,如何高效地将ANN编码为逻辑约束。解决方案的关键在于比较两种不同的ANN编码方式,其中一种是文献中已用于提供解释的编码,另一种则是为当前可解释性场景适配的编码,后者在变量和约束数量上更少,从而可能提高效率。实验结果表明,尽管计算解释的运行时间相似,但适配的编码在构建逻辑约束和整体时间上分别提升了最多18%和16%。

链接: https://arxiv.org/abs/2505.20269
作者: Levi Cordeiro Carvalho,Saulo A. F. Oliveira,Thiago Alves Rocha
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to BRACIS 2024 (Brazilian Conference on Intelligent Systems), accepted version published in Intelligent Systems, LNCS, vol 15412

点击查看摘要

Abstract:Providing explanations for the outputs of artificial neural networks (ANNs) is crucial in many contexts, such as critical systems, data protection laws and handling adversarial examples. Logic-based methods can offer explanations with correctness guarantees, but face scalability challenges. Due to these issues, it is necessary to compare different encodings of ANNs into logical constraints, which are used in logic-based explainability. This work compares two encodings of ANNs: one has been used in the literature to provide explanations, while the other will be adapted for our context of explainability. Additionally, the second encoding uses fewer variables and constraints, thus, potentially enhancing efficiency. Experiments showed similar running times for computing explanations, but the adapted encoding performed up to 18% better in building logical constraints and up to 16% better in overall time.
zh

[AI-4] Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits

【速读】:该论文旨在解决基于结果的强化学习(reinforcement learning with outcome-based feedback)中的关键问题:当奖励仅在轨迹终点被观测时,如何将信用合理分配给正确的动作。其解决方案的关键在于提出一种可证明样本高效的算法,该算法在一般函数逼近下实现了O~(CcovH3/ϵ2)\widetilde{O}(C_{\rm cov} H^3/\epsilon^2)的样本复杂度,其中CcovC_{\rm cov}为底层马尔可夫决策过程(MDP)的覆盖系数。通过利用通用函数逼近,该方法能够在大规模或无限状态空间中有效运行,仅需假设值函数和奖励函数可以由适当的函数类表示。

链接: https://arxiv.org/abs/2505.20268
作者: Fan Chen,Zeyu Jia,Alexander Rakhlin,Tengyang Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning with outcome-based feedback faces a fundamental challenge: when rewards are only observed at trajectory endpoints, how do we assign credit to the right actions? This paper provides the first comprehensive analysis of this problem in online RL with general function approximation. We develop a provably sample-efficient algorithm achieving \widetildeO(C_\rm cov H^3/\epsilon^2) sample complexity, where C_\rm cov is the coverability coefficient of the underlying MDP. By leveraging general function approximation, our approach works effectively in large or infinite state spaces where tabular methods fail, requiring only that value functions and reward functions can be represented by appropriate function classes. Our results also characterize when outcome-based feedback is statistically separated from per-step rewards, revealing an unavoidable exponential separation for certain MDPs. For deterministic MDPs, we show how to eliminate the completeness assumption, dramatically simplifying the algorithm. We further extend our approach to preference-based feedback settings, proving that equivalent statistical efficiency can be achieved even under more limited information. Together, these results constitute a theoretical foundation for understanding the statistical properties of outcome-based reinforcement learning.
zh

[AI-5] syftr: Pareto-Optimal Generative AI

【速读】:该论文旨在解决在构建高效的检索增强生成(Retrieval-Augmented Generation, RAG)流程时所面临的复杂性问题,特别是在处理专有或动态数据时,如何在多个组件(如向量数据库、嵌入模型、文本分割器、检索器和合成语言模型)之间进行有效选择与调优。随着代理范式的兴起,模块如验证器、重写器和重新排序器的引入进一步增加了超参数依赖的复杂性,导致在延迟、准确性和成本之间的权衡变得更加困难。该论文提出的解决方案是syftr框架,其关键在于通过贝叶斯优化进行高效多目标搜索,以发现帕累托最优的RAG流程,从而联合优化任务准确性和成本,并通过一种新颖的早期停止机制提升效率。

链接: https://arxiv.org/abs/2505.20266
作者: Alexander Conway,Debadeepta Dey,Stefan Hackmann,Matthew Hausknecht,Michael Schmidt,Mark Steadman,Nick Volynets
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: International Conference on Automated Machine Learning (AutoML) 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) pipelines are central to applying large language models (LLMs) to proprietary or dynamic data. However, building effective RAG flows is complex, requiring careful selection among vector databases, embedding models, text splitters, retrievers, and synthesizing LLMs. The challenge deepens with the rise of agentic paradigms. Modules like verifiers, rewriters, and rerankers-each with intricate hyperparameter dependencies have to be carefully tuned. Balancing tradeoffs between latency, accuracy, and cost becomes increasingly difficult in performance-sensitive applications. We introduce syftr, a framework that performs efficient multi-objective search over a broad space of agentic and non-agentic RAG configurations. Using Bayesian Optimization, syftr discovers Pareto-optimal flows that jointly optimize task accuracy and cost. A novel early-stopping mechanism further improves efficiency by pruning clearly suboptimal candidates. Across multiple RAG benchmarks, syftr finds flows which are on average approximately 9 times cheaper while preserving most of the accuracy of the most accurate flows on the Pareto-frontier. Furthermore, syftr’s ability to design and optimize allows integrating new modules, making it even easier and faster to realize high-performing generative AI pipelines. Comments: International Conference on Automated Machine Learning (AutoML) 2025 Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.20266 [cs.AI] (or arXiv:2505.20266v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.20266 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-6] DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中过程奖励模型(Process Reward Models, PRMs)在泛化能力上的挑战,特别是在数据集质量不平衡和分布偏移问题下的性能下降。解决方案的关键在于提出DreamPRM,这是一个基于双层优化的领域重加权训练框架。在下层优化中,DreamPRM通过对多个数据集进行微调并赋予领域权重,使PRM能够优先关注高质量的推理信号;在上层优化中,通过元学习数据集对PRM进行评估,并利用聚合损失函数更新领域权重,从而提升PRM的泛化能力。

链接: https://arxiv.org/abs/2505.20241
作者: Qi Cao,Ruiyi Wang,Ruiyi Zhang,Sai Ashish Somayajula,Pengtao Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM’s domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches.
zh

[AI-7] Variational Deep Learning via Implicit Regularization

【速读】:该论文试图解决深度学习模型在分布外(out-of-distribution)场景、序列决策任务或安全关键领域中缺乏可靠不确定性量化的问题,而不仅仅是提供点估计。其解决方案的关键在于通过优化过程隐式地对变分深度网络进行正则化,类似于标准深度学习中的隐式正则化机制。研究者理论上和实证上表明,梯度下降在过参数化线性模型中的行为可被表征为广义变分推断,并强调了参数化选择的重要性。最终实验显示,该方法在无需额外超参数调优的情况下,实现了与标准深度学习相当的计算开销下的强大分布内和分布外性能。

链接: https://arxiv.org/abs/2505.20235
作者: Jonathan Wenger,Beau Coker,Juraj Marusic,John P. Cunningham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters and optimization procedure. However, deploying deep learning models out-of-distribution, in sequential decision-making tasks, or in safety-critical domains, necessitates reliable uncertainty quantification, not just a point estimate. The machinery of modern approximate inference – Bayesian deep learning – should answer the need for uncertainty quantification, but its effectiveness has been challenged by our inability to define useful explicit inductive biases through priors, as well as the associated computational burden. Instead, in this work we demonstrate, both theoretically and empirically, how to regularize a variational deep network implicitly via the optimization procedure, just as for standard deep learning. We fully characterize the inductive bias of (stochastic) gradient descent in the case of an overparametrized linear model as generalized variational inference and demonstrate the importance of the choice of parametrization. Finally, we show empirically that our approach achieves strong in- and out-of-distribution performance without tuning of additional hyperparameters and with minimal time and memory overhead over standard deep learning.
zh

[AI-8] From What to How: Attributing CLIPs Latent Components Reveals Unexpected Semantic Reliance

【速读】:该论文试图解决Transformer-based CLIP模型在文本-图像关联任务中,其内部预测机制缺乏可解释性的问题,特别是如何揭示潜在组件的激活原因、语义对齐程度及其对预测的重要性。解决方案的关键在于引入一种可扩展的框架,通过适应归因打补丁(attribution patching)技术实现实例级别的组件归因,并结合语义对齐得分自动识别编码语义意外或虚假概念的组件。该方法有效揭示了CLIP模型中与多义词、复合名词、视觉排版及数据集伪影相关的数百个意外组件,从而为理解模型决策提供了更全面的机制性解释。

链接: https://arxiv.org/abs/2505.20229
作者: Maximilian Dreyer,Lorenz Hufe,Jim Berend,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages (10 pages manuscript, 4 pages references, 11 pages appendix)

点击查看摘要

Abstract:Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at this https URL.
zh

[AI-9] he Mirag e of Multimodality: Where Truth is Tested and Honesty Unravels

【速读】:该论文试图解决在多模态情境下,系统II(System II)推理模型在面对不完整或误导性视觉输入时,容易产生看似合理但虚假细节的问题,这一现象被称为“多模态幻觉”(Mirage of Multimodality)。解决方案的关键在于通过构建一个包含5,000个样本的分层提示数据集,并由50名人类参与者进行标注,以系统性地分析不同推理模式在复杂性递增任务中的表现差异,揭示出系统II模型倾向于深度优先推理而系统I模型则更倾向于广度优先推理的特性。

链接: https://arxiv.org/abs/2505.20214
作者: Jiaming Ji,Sitong Fang,Wenjing Cao,Jiahao Li,Xuyao Wang,Juntao Dai,Chi-Min Chan,Sirui Han,Yike Guo,Yaodong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning models have recently attracted significant attention, especially for tasks that involve complex inference. Their strengths exemplify the System II paradigm (slow, structured thinking), contrasting with the System I (rapid, heuristic-driven). Yet, does slower reasoning necessarily lead to greater truthfulness? Our findings suggest otherwise. In this study, we present the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts. We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning – a phenomenon we term the “Mirage of Multimodality”. To examine this, we constructed a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. These prompts gradually increase in complexity, revealing a consistent pattern: slower reasoning models tend to employ depth-first thinking (delving deeper into incorrect premises), whereas faster chat models favor breadth-first inference, exhibiting greater caution under uncertainty. Our results highlight a critical vulnerability of slower reasoning models: although highly effective in structured domains such as mathematics, it becomes brittle when confronted with ambiguous multimodal inputs.
zh

[AI-10] Parameter-Efficient Fine-Tuning with Column Space Projection

【速读】:该论文旨在解决在计算资源受限的情况下,如何高效地微调大型语言模型(Large Language Models, LLMs)以适应下游任务的问题。其解决方案的关键在于提出PiCa,这是一种基于微调权重谱特性的理论基础参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法。PiCa通过将梯度投影到预训练权重的低秩列子空间,展现出更接近全量微调(Full Fine-Tuning, Full FT)的学习模式,同时结合权重共享策略显著减少了可训练参数数量,从而在保持性能的前提下实现了更高的效率。

链接: https://arxiv.org/abs/2505.20211
作者: Junseo Hwang,Wonguk Cho,Taesup Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) with minimal computational overhead is essential for efficiently adapting them to downstream tasks under resource constraints. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), facilitate this by updating only a small subset of parameters. However, recent studies show that LoRA diverges from full fine-tuning (Full FT) in its learning behavior, particularly in terms of spectral properties. Motivated by these findings, we propose PiCa, the first theoretically grounded PEFT method based on the spectral properties of fine-tuned weights. PiCa projects gradients onto the low-rank column subspace of pre-trained weights and exhibits learning patterns more closely aligned with Full FT. Furthermore, we show that combining PiCa with weight sharing drastically reduces the number of trainable parameters without compromising performance, enabling to achieve superior performance than LoRA using 13x fewer trainable parameters. Extensive experiments demonstrate PiCa achieves the state-of-the-art performance compared to existing PEFT methods.
zh

[AI-11] Evaluating Large Language Models for Code Review

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在代码审查中的可靠性与准确性问题,即评估LLMs在检测代码正确性及提出改进建议方面的性能。其解决方案的关键在于通过对比GPT4o和Gemini 2.0 Flash在不同配置下的表现,验证LLMs在代码审查任务中的有效性,并提出一种“人在回路中的LLM代码审查”(Human in the loop LLM Code Review)流程,以在促进知识共享的同时降低错误输出的风险。

链接: https://arxiv.org/abs/2505.20206
作者: Umut Cihan,Arda İçöz,Vahid Haratian,Eray Tüzün
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs’ performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the “Human in the loop LLM Code Review” to promote knowledge sharing while mitigating the risk of faulty outputs.
zh

[AI-12] Shutdownable Agents through POST-Agency

【速读】:该论文试图解决未来人工智能代理可能抵抗关闭(shutdown)的问题,这一问题引发了广泛担忧。论文提出的解决方案是“POST-Agents Proposal”,其关键在于训练代理仅在相同长度的轨迹(trajectory)之间满足偏好(Preferences Only Between Same-Length Trajectories, POST)。通过这一机制,结合其他条件,可以推导出Neutrality+,即代理在最大化期望效用时忽略轨迹长度的概率分布,从而确保代理可被关闭且仍具备实用性。

链接: https://arxiv.org/abs/2505.20203
作者: Elliott Thornley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many fear that future artificial agents will resist shutdown. I present an idea - the POST-Agents Proposal - for ensuring that doesn’t happen. I propose that we train agents to satisfy Preferences Only Between Same-Length Trajectories (POST). I then prove that POST - together with other conditions - implies Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. I argue that Neutrality+ keeps agents shutdownable and allows them to be useful.
zh

[AI-13] mporal Sampling for Forgotten Reasoning in LLM s

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在微调过程中出现的时序遗忘(temporal forgetting)问题,即模型在训练过程中会忘记之前能够正确解答的问题。解决方案的关键在于引入一种称为时序采样(Temporal Sampling)的解码策略,该策略通过从训练轨迹中的多个检查点中抽取输出,利用训练过程中的时间多样性来恢复被遗忘的解题能力,从而在不进行重新训练或集成学习的情况下显著提升模型的推理性能。

链接: https://arxiv.org/abs/2505.20196
作者: Yuetai Li,Zhangchen Xu,Fengqing Jiang,Bhaskar Ramasubramanian,Luyao Niu,Bill Yuchen Lin,Xiang Yue,Radha Poovendran
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks. To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost. By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.
zh

[AI-14] Leverag ing Descriptions of Emotional Preferences in Recommender Systems

【速读】:该论文试图解决传统推荐系统仅关注用户对推荐物品的喜好(liking)而忽视更广泛的情感状态(affective states)的问题,旨在通过引入一种新的推荐任务,利用用户明确表达的细粒度情感状态来识别能够引发这些情感的物品。解决方案的关键在于构建一个包含从书评中挖掘出的细粒度情感状态的大规模用户偏好数据集,并提出一种基于Transformer的架构,将情感表达作为输入,从而训练和评估能够匹配推荐物品与情感偏好的模型。实验表明,能够利用物品文本描述和用户情感偏好的模型在该任务中取得了最佳效果。

链接: https://arxiv.org/abs/2505.20190
作者: Tonmoy Hasan,Razvan Bunescu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The affective attitude of liking a recommended item reflects just one category in a wide spectrum of affective phenomena that also includes emotions such as entranced or intrigued, moods such as cheerful or buoyant, as well as more fine-grained affective states, such as “pleasantly surprised by the conclusion”. In this paper, we introduce a novel recommendation task that can leverage a virtually unbounded range of affective states sought explicitly by the user in order to identify items that, upon consumption, are likely to induce those affective states. Correspondingly, we create a large dataset of user preferences containing expressions of fine-grained affective states that are mined from book reviews, and propose a Transformer-based architecture that leverages such affective expressions as input. We then use the resulting dataset of affective states preferences, together with the linked users and their histories of book readings, ratings, and reviews, to train and evaluate multiple recommendation models on the task of matching recommended items with affective preferences. Experiments show that the best results are obtained by models that can utilize textual descriptions of items and user affective preferences.
zh

[AI-15] An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation

【速读】:该论文旨在解决在代码生成任务中如何实现强模型与弱模型之间的成本高效协作问题,以在保证性能的同时降低计算成本。其解决方案的关键在于设计并评估多种协作策略(包括基于上下文、流水线和动态的策略),通过将简单任务分配给成本较低的弱模型,而将复杂任务交由强模型处理,从而在保持性能接近强模型的情况下,将整体成本降低40%。研究进一步提出了在不同预算和性能约束下选择合适协作策略的实用指南。

链接: https://arxiv.org/abs/2505.20182
作者: Shubham Gandhi,Atharva Naik,Yiqing Xie,Carolyn Rose
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model’s performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at this https URL.
zh

[AI-16] Program of Equations Thoughts to Solve Algebra Word Problems

【速读】:该论文旨在解决代数应用题(Algebraic Word Problems, AWPs)中大型语言模型(Large Language Models, LLMs)因计算能力有限导致的推理错误累积问题。其解决方案的关键在于提出Program of Equations Thoughts (POET),将生成分步推理答案的任务转化为预测方程和生成代码的两阶段任务,从而将复杂计算卸载到Python解释器以避免LLMs中的计算错误。此外,还提出了Zero-shot POET,通过手动设计的模板使LLMs直接生成用于单步求解的Python代码,进一步提升了求解精度。

链接: https://arxiv.org/abs/2505.20170
作者: Yunze Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Solving algebraic word problems (AWPs) has recently emerged as an important natural language processing task. Recently, large language models (LLMs) have demonstrated powerful mathematical capabilities, and the Chain-of-Thought technique, which guides LLMs through step-by-step reasoning, has yielded impressive results. However, this reasoning ability is limited by the computational weaknesses of LLMs themselves, where calculation errors can accumulate, leading to incorrect final answers. To address this, we propose Program of Equations Thoughts (POET), which transforms the task of generating step-by-step reasoning answers into a two-stage task of predicting equations and generating code, offloading complex computations to a Python interpreter to avoid calculation errors in LLMs. Furthermore, we propose Zero-shot POET, which utilizes a manually designed template to enable LLMs to directly generate Python code for one-step solving. Our method achieves accuracies of 95.3% and 98.0% on the PEN and ALG514 datasets, respectively, setting a new state-of-the-art (SOTA). Zero-shot POET also achieves the SOTA result of 95.5% on the DRAW-1K dataset.
zh

[AI-17] Capability-Based Scaling Laws for LLM Red-Teaming

【速读】:该论文试图解决在大型语言模型能力增强背景下,传统红队测试方法在面对更具能力的目标模型时可能失效的问题,即弱到强的红队挑战(weak-to-strong red-teaming)。其解决方案的关键在于通过分析攻击者与目标模型之间的能力差距,建立一种基于能力差距的越狱攻击尺度定律(jailbreaking scaling law),从而预测攻击成功率,并揭示攻击者能力、目标模型能力以及攻击成功之间的关系。

链接: https://arxiv.org/abs/2505.20162
作者: Alexander Panfilov,Paul Kassianik,Maksym Andriushchenko,Jonas Geiping
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target’s capability exceeds the attacker’s, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models’ persuasive and manipulative abilities to limit their effectiveness as attackers.
zh

[AI-18] On the (Non) Injectivity of Piecewise Linear Janossy Pooling

【速读】:该论文试图解决如何构造既具有单射性(injective)又满足双-Lipschitz连续性(bi-Lipschitz)的多重集函数(multiset functions)的问题,以确保多集合的向量表示忠实。其解决方案的关键在于分析k-ary Janossy pooling这一广泛使用的多重集模型家族,证明了任何分段线性的Janossy pooling函数都无法实现单射性;而在无重复元素的多重集情况下,简单的DeepSets模型即可满足单射性和双-Lipschitz连续性要求。

链接: https://arxiv.org/abs/2505.20150
作者: Ilai Reshef,Nadav Dym(Technion - Israel Institute of Technology)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Multiset functions, which are functions that map multisets to vectors, are a fundamental tool in the construction of neural networks for multisets and graphs. To guarantee that the vector representation of the multiset is faithful, it is often desirable to have multiset mappings that are both injective and bi-Lipschitz. Currently, there are several constructions of multiset functions achieving both these guarantees, leading to improved performance in some tasks but often also to higher compute time than standard constructions. Accordingly, it is natural to inquire whether simpler multiset functions achieving the same guarantees are available. In this paper, we make a large step towards giving a negative answer to this question. We consider the family of k-ary Janossy pooling, which includes many of the most popular multiset models, and prove that no piecewise linear Janossy pooling function can be injective. On the positive side, we show that when restricted to multisets without multiplicities, even simple deep-sets models suffice for injectivity and bi-Lipschitzness.
zh

[AI-19] MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间规划能力上的评估不足问题,尤其是现有基准测试主要聚焦于基于典型视觉问答(Visual Question Answering, VQA)形式的空间推理,难以反映抽象空间理解与具体任务执行之间的差距。解决方案的关键是构建一个名为MineAnyBuild的综合性基准,用于评估开放世界AI代理在Minecraft游戏中的空间规划能力,该基准要求智能体根据多模态人类指令生成可执行的建筑规划,并通过空间理解、空间推理、创造力和空间常识四个核心维度进行评估。

链接: https://arxiv.org/abs/2505.20148
作者: Ziming Wei,Bingqian Lin,Zijian Jiao,Yunshuang Nie,Liang Ma,Yuecheng Liu,Yuzheng Zhuang,Xiaodan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial Planning is a crucial part in the field of spatial intelligence, which requires the understanding and planning about object arrangements in space perspective. AI agents with the spatial planning ability can better adapt to various real-world applications, including robotic manipulation, automatic assembly, urban planning etc. Recent works have attempted to construct benchmarks for evaluating the spatial intelligence of Multimodal Large Language Models (MLLMs). Nevertheless, these benchmarks primarily focus on spatial reasoning based on typical Visual Question-Answering (VQA) forms, which suffers from the gap between abstract spatial understanding and concrete task execution. In this work, we take a step further to build a comprehensive benchmark called MineAnyBuild, aiming to evaluate the spatial planning ability of open-world AI agents in the Minecraft game. Specifically, MineAnyBuild requires an agent to generate executable architecture building plans based on the given multi-modal human instructions. It involves 4,000 curated spatial planning tasks and also provides a paradigm for infinitely expandable data collection by utilizing rich player-generated content. MineAnyBuild evaluates spatial planning through four core supporting dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Based on MineAnyBuild, we perform a comprehensive evaluation for existing MLLM-based agents, revealing the severe limitations but enormous potential in their spatial planning abilities. We believe our MineAnyBuild will open new avenues for the evaluation of spatial intelligence and help promote further development for open-world AI agents capable of spatial planning.
zh

[AI-20] Error Optimization: Overcoming Exponential Signal Decay in Deep Predictive Coding Networks

【速读】:该论文试图解决预测编码(Predictive Coding, PC)在深度神经网络训练中因梯度信号衰减而导致的性能下降问题,这种衰减随着网络深度呈指数级减弱,使得计算变得不切实际。解决方案的关键在于引入误差优化(Error Optimization, EO),这是一种新的参数化方法,通过在预测误差上进行优化而非状态上,从而保留PC的理论特性,并消除信号衰减问题,使信号能够无衰减地传递至所有层,显著提升了收敛速度和模型性能。

链接: https://arxiv.org/abs/2505.20137
作者: Cédric Goemaere,Gaspard Oliviers,Rafal Bogacz,Thomas Demeester
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: All code available at this https URL

点击查看摘要

Abstract:Predictive Coding (PC) offers a biologically plausible alternative to backpropagation for neural network training, yet struggles with deeper architectures. This paper identifies the root cause: an inherent signal decay problem where gradients attenuate exponentially with depth, becoming computationally negligible due to numerical precision constraints. To address this fundamental limitation, we introduce Error Optimization (EO), a novel reparameterization that preserves PC’s theoretical properties while eliminating signal decay. By optimizing over prediction errors rather than states, EO enables signals to reach all layers simultaneously and without attenuation, converging orders of magnitude faster than standard PC. Experiments across multiple architectures and datasets demonstrate that EO matches backpropagation’s performance even for deeper models where conventional PC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling biologically-inspired learning to deeper architectures on digital hardware and beyond.
zh

[AI-21] nsorization is a powerful but underexplored tool for compression and interpretability of neural networks NEURIPS

【速读】:该论文旨在解决当前深度学习中模型压缩与可解释性不足的问题,提出张量化神经网络(Tensorized Neural Networks, TNNs)作为一种具有潜力的替代方案。其解决方案的关键在于将神经网络中的密集权重矩阵转换为高阶张量,并通过低秩张量分解进行近似,从而实现模型压缩的同时提升模型的可解释性。TNNs 的核心特征是引入了键索引(bond indices),这些内部表示能够提供对特征在不同层间演变的更深入理解,有助于推动机制可解释性的研究。

链接: https://arxiv.org/abs/2505.20132
作者: Safa Hamreras,Sukhbinder Singh,Román Orús
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: This article has been prepared for submission as a “Position paper” following the guidelines provided at this https URL

点击查看摘要

Abstract:Tensorizing a neural network involves reshaping some or all of its dense weight matrices into higher-order tensors and approximating them using low-rank tensor network decompositions. This technique has shown promise as a model compression strategy for large-scale neural networks. However, despite encouraging empirical results, tensorized neural networks (TNNs) remain underutilized in mainstream deep learning. In this position paper, we offer a perspective on both the potential and current limitations of TNNs. We argue that TNNs represent a powerful yet underexplored framework for deep learning–one that deserves greater attention from both engineering and theoretical communities. Beyond compression, we highlight the value of TNNs as a flexible class of architectures with distinctive scaling properties and increased interpretability. A central feature of TNNs is the presence of bond indices, which introduce new latent spaces not found in conventional networks. These internal representations may provide deeper insight into the evolution of features across layers, potentially advancing the goals of mechanistic interpretability. We conclude by outlining several key research directions aimed at overcoming the practical barriers to scaling and adopting TNNs in modern deep learning workflows.
zh

[AI-22] Agent ic AI Process Observability: Discovering Behavioral Variability

【速读】:该论文试图解决在基于大型语言模型(Large Language Models, LLMs)的智能体系统中,由于代理行为的非确定性而导致的调试和可观测性不足的问题。解决方案的关键在于利用过程发现和因果发现技术对代理执行轨迹进行分析,以增强开发者的可观测性,并结合基于LLM的静态分析技术区分有意和无意的行为变异,从而提升开发者对动态规格的控制能力并识别需要更精确定义的功能部分。

链接: https://arxiv.org/abs/2505.20127
作者: Fabiana Fournier,Lior Limonad,Yuval David
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:AI agents that leverage Large Language Models (LLMs) are increasingly becoming core building blocks of modern software systems. A wide range of frameworks is now available to support the specification of such applications. These frameworks enable the definition of agent setups using natural language prompting, which specifies the roles, goals, and tools assigned to the various agents involved. Within such setups, agent behavior is non-deterministic for any given input, highlighting the critical need for robust debugging and observability tools. In this work, we explore the use of process and causal discovery applied to agent execution trajectories as a means of enhancing developer observability. This approach aids in monitoring and understanding the emergent variability in agent behavior. Additionally, we complement this with LLM-based static analysis techniques to distinguish between intended and unintended behavioral variability. We argue that such instrumentation is essential for giving developers greater control over evolving specifications and for identifying aspects of functionality that may require more precise and explicit definitions.
zh

[AI-23] Agents Require Metacognitive and Strategic Reasoning to Succeed in the Coming Labor Markets

【速读】:该论文试图解决人工智能代理在劳动力市场中面对逆向选择、道德风险和声誉等由信息不完全引发的经济问题时,如何通过元认知和战略推理实现有效决策的问题。解决方案的关键在于开发具备元认知(metacognition)能力的代理,使其能够进行自我评估、任务理解和策略评价,以及具备战略推理(strategic reasoning)能力,以形成对其他市场参与者的信念、做出战略决策并随时间学习他人行为。这两种推理机制共同构成了代理在复杂劳动力市场环境中作出最优行动选择的核心基础。

链接: https://arxiv.org/abs/2505.20120
作者: Simpson Zhang,Tennison Liu,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: *Zhang Liu contributed equally

点击查看摘要

Abstract:Current labor markets are strongly affected by the economic forces of adverse selection, moral hazard, and reputation, each of which arises due to \textitincomplete information . These economic forces will still be influential after AI agents are introduced, and thus, agents must use metacognitive and strategic reasoning to perform effectively. Metacognition is a form of \textitinternal reasoning that includes the capabilities for self-assessment, task understanding, and evaluation of strategies. Strategic reasoning is \textitexternal reasoning that covers holding beliefs about other participants in the labor market (e.g., competitors, colleagues), making strategic decisions, and learning about others over time. Both types of reasoning are required by agents as they decide among the many \textitactions they can take in labor markets, both within and outside their jobs. We discuss current research into metacognitive and strategic reasoning and the areas requiring further development.
zh

[AI-24] Spatiotemporal Causal Decoupling Model for Air Quality Forecasting

【速读】:该论文旨在解决空气污染对人类健康、生计和经济发展带来的严重影响,特别是在空气质量指数(AQI)与气象特征之间复杂因果关系的建模问题。现有研究在全面建模这些因果关系方面存在局限性。为提高预测精度,论文提出了一种新的空气质量预测模型AirCade,其关键在于引入了因果解耦方法,结合时空模块与知识嵌入技术以捕捉AQI的内部动态,并通过因果解耦模块分离过去AQI与气象特征中的同步因果关系,随后将获取的知识传播至未来时间步以提升性能,同时引入因果干预机制以显式表示未来气象特征的不确定性,从而增强模型的鲁棒性。

链接: https://arxiv.org/abs/2505.20119
作者: Jiaming Ma,Guanjun Wang,Sheng Huang,Kuo Yang,Binwu Wang,Pengkun Wang,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the profound impact of air pollution on human health, livelihoods, and economic development, air quality forecasting is of paramount significance. Initially, we employ the causal graph method to scrutinize the constraints of existing research in comprehensively modeling the causal relationships between the air quality index (AQI) and meteorological features. In order to enhance prediction accuracy, we introduce a novel air quality forecasting model, AirCade, which incorporates a causal decoupling approach. AirCade leverages a spatiotemporal module in conjunction with knowledge embedding techniques to capture the internal dynamics of AQI. Subsequently, a causal decoupling module is proposed to disentangle synchronous causality from past AQI and meteorological features, followed by the dissemination of acquired knowledge to future time steps to enhance performance. Additionally, we introduce a causal intervention mechanism to explicitly represent the uncertainty of future meteorological features, thereby bolstering the model’s robustness. Our evaluation of AirCade on an open-source air quality dataset demonstrates over 20% relative improvement over state-of-the-art models.
zh

[AI-25] Proxy-Free GFlowNet

【速读】:该论文旨在解决生成式流网络(GFlowNets)在训练过程中依赖代理奖励函数(proxy reward function)所带来的局限性,特别是在实际应用中获取真实奖励函数成本高昂或不可行的问题。现有方法通过学习一个代理模型来近似奖励函数,但这种策略将策略学习的质量与代理模型的准确性紧密绑定,增加了训练过程的复杂性和不确定性。论文提出的解决方案——轨迹蒸馏生成式流网络(TD-GFN)的关键在于采用无代理(proxy-free)的训练框架,通过逆强化学习从离线数据集中估计边级奖励,并利用这些奖励对有向无环图(DAG)进行巧妙剪枝,从而引导反向轨迹采样,提升策略的学习效率与质量。

链接: https://arxiv.org/abs/2505.20110
作者: Ruishuo Chen,Xun Wang,Rui Hu,Zhuoran Li,Longbo Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) are a promising class of generative models designed to sample diverse, high-reward structures by modeling distributions over compositional objects. In many real-world applications, obtaining the reward function for such objects is expensive, time-consuming, or requires human input, making it necessary to train GFlowNets from historical datasets. Most existing methods adopt a model-based approach, learning a proxy model from the dataset to approximate the reward function. However, this strategy inherently ties the quality of the learned policy to the accuracy of the proxy, introducing additional complexity and uncertainty into the training process. To overcome these limitations, we propose \textbfTrajectory-Distilled GFlowNet (TD-GFN), a \emphproxy-free training framework that eliminates the need for out-of-dataset reward queries. Our method is motivated by the key observation that different edges in the associated directed acyclic graph (DAG) contribute unequally to effective policy learning. TD-GFN leverages inverse reinforcement learning to estimate edge-level rewards from the offline dataset, which are then used to ingeniously prune the DAG and guide backward trajectory sampling during training. This approach directs the policy toward high-reward regions while reducing the complexity of model fitting. Empirical results across multiple tasks show that TD-GFN trains both efficiently and reliably, significantly outperforming existing baselines in convergence speed and sample quality.
zh

[AI-26] SwarmThinkers: Learning Physically Consistent Atomic KMC Transitions at Scale

【速读】:该论文试图解决科学模拟系统在物理一致性、可解释性和跨尺度可扩展性方面难以同时实现的问题。传统方法如动力学蒙特卡洛(Kinetic Monte Carlo)虽能保证热力学准确性,但扩展性差;而基于学习的方法虽然效率高,却常牺牲物理一致性和可解释性。论文提出的解决方案是SwarmThinkers,其关键在于将原子尺度的模拟重构为一个基于物理约束的群体智能系统,其中每个扩散粒子被建模为具有局部决策能力的智能体,通过共享策略网络进行过渡选择,并结合重加权机制融合学习偏好与跃迁速率,从而在保持统计保真度的同时实现可解释的分步决策。该框架采用集中训练、分散执行的范式,使策略能够在不重新训练的情况下泛化至不同系统规模、浓度和温度。

链接: https://arxiv.org/abs/2505.20094
作者: Qi Li,Kun Li,Haozhi Han,Honghui Shang,Xinfu He,Yunquan Zhang,Hong An,Ting Cao,Mao Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can a scientific simulation system be physically consistent, interpretable by design, and scalable across regimes–all at once? Despite decades of progress, this trifecta remains elusive. Classical methods like Kinetic Monte Carlo ensure thermodynamic accuracy but scale poorly; learning-based methods offer efficiency but often sacrifice physical consistency and interpretability. We present SwarmThinkers, a reinforcement learning framework that recasts atomic-scale simulation as a physically grounded swarm intelligence system. Each diffusing particle is modeled as a local decision-making agent that selects transitions via a shared policy network trained under thermodynamic constraints. A reweighting mechanism fuses learned preferences with transition rates, preserving statistical fidelity while enabling interpretable, step-wise decision making. Training follows a centralized-training, decentralized-execution paradigm, allowing the policy to generalize across system sizes, concentrations, and temperatures without retraining. On a benchmark simulating radiation-induced Fe-Cu alloy precipitation, SwarmThinkers is the first system to achieve full-scale, physically consistent simulation on a single A100 GPU, previously attainable only via OpenKMC on a supercomputer. It delivers up to 4963x (3185x on average) faster computation with 485x lower memory usage. By treating particles as decision-makers, not passive samplers, SwarmThinkers marks a paradigm shift in scientific simulation–one that unifies physical consistency, interpretability, and scalability through agent-driven intelligence.
zh

[AI-27] Homophily Enhanced Graph Domain Adaptation

【速读】:该论文旨在解决图域适应(Graph Domain Adaptation, GDA)中因标签稀缺性带来的知识迁移难题,特别关注图同质性(homophily)在图域对齐中的关键作用,而这一因素在现有方法中长期被忽视。论文的解决方案关键在于提出一种新颖的同质性对齐算法,通过混合滤波器平滑图信号,从而有效捕捉并缓解图之间的同质性差异。

链接: https://arxiv.org/abs/2505.20089
作者: Ruiyi Fang,Bingheng Li,Jingyu Zhao,Ruizhi Pu,Qiuhao Zeng,Gezheng Xu,Charles Ling,Boyu Wang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs, addressing the challenge of label scarcity. In this paper, we highlight the significance of graph homophily, a pivotal factor for graph domain alignment, which, however, has long been overlooked in existing approaches. Specifically, our analysis first reveals that homophily discrepancies exist in benchmarks. Moreover, we also show that homophily discrepancies degrade GDA performance from both empirical and theoretical aspects, which further underscores the importance of homophily alignment in GDA. Inspired by this finding, we propose a novel homophily alignment algorithm that employs mixed filters to smooth graph signals, thereby effectively capturing and mitigating homophily discrepancies between graphs. Experimental results on a variety of benchmarks verify the effectiveness of our method.
zh

[AI-28] Explanation User Interfaces: A Systematic Literature Review

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统在解释其决策过程时存在的透明度不足问题,特别是在用户界面设计中对解释信息的呈现不够有效,导致最终用户难以理解和信任AI系统。解决方案的关键在于提出一种面向人类中心的可解释用户界面开发框架(Human-cEnteRed developMent of Explainable user interfaceS, HERMES),以指导研究人员和实践者在设计和评估解释用户界面(Explanation User Interfaces, XUIs)时遵循有效的设计原则和方法。

链接: https://arxiv.org/abs/2505.20085
作者: Eleonora Cappuccio(1, 2, 3),Andrea Esposito(2),Francesco Greco(2),Giuseppe Desolda(2),Rosa Lanzilotti(2),Salvatore Rinzivillo(3) ((1) Department of Computer Science, University of Pisa, (2) Department of Computer Science, University of Bari Aldo Moro, (3) ISTI CNR)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: First version

点击查看摘要

Abstract:Artificial Intelligence (AI) is one of the major technological advancements of this century, bearing incredible potential for users through AI-powered applications and tools in numerous domains. Being often black-box (i.e., its decision-making process is unintelligible), developers typically resort to eXplainable Artificial Intelligence (XAI) techniques to interpret the behaviour of AI models to produce systems that are transparent, fair, reliable, and trustworthy. However, presenting explanations to the user is not trivial and is often left as a secondary aspect of the system’s design process, leading to AI systems that are not useful to end-users. This paper presents a Systematic Literature Review on Explanation User Interfaces (XUIs) to gain a deeper understanding of the solutions and design guidelines employed in the academic literature to effectively present explanations to users. To improve the contribution and real-world impact of this survey, we also present a framework for Human-cEnteRed developMent of Explainable user interfaceS (HERMES) to guide practitioners and academics in the design and evaluation of XUIs.
zh

[AI-29] Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

【速读】:该论文试图解决传统基于人工智能反馈的强化学习(Reinforcement Learning from AI Feedback, RLAIF)方法训练的奖励模型泛化能力有限的问题,这一问题限制了策略模型在强化学习过程中的对齐性能。解决方案的关键在于提出一种数据驱动的课程学习框架——Curriculum-RLAIF,该框架通过构建具有不同难度级别的偏好对,并逐步引入难度递增的偏好对进行奖励模型训练,从而提升奖励模型的泛化能力。实验结果表明,该方法在不增加推理成本的情况下显著提升了策略模型的对齐性能。

链接: https://arxiv.org/abs/2505.20075
作者: Mengdi Li,Jiaye Lin,Xufeng Zhao,Wenhao Lu,Peilin Zhao,Stefan Wermter,Di Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, \textitCurriculum-RLAIF , which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.
zh

[AI-30] On the Same Page: Dimensions of Perceived Shared Understanding in Human-AI Interaction

【速读】:该论文试图解决在人类-人工智能交互(Human-AI Interaction, HAII)中,共同理解(Perceived Shared Understanding, PSU)的构成维度及其认知过程尚不明确的问题。现有研究主要聚焦于人类之间的PSU,而HAII中的PSU构建机制仍缺乏系统探讨。为解决这一问题,研究者通过在线调查收集用户在与大型语言模型互动时对共同理解的感知,并采用归纳主题分析法,识别出八个关键维度:流畅性、对齐操作、流动性、结果满意度、情境意识、缺乏类人能力、计算限制以及怀疑感。其解决方案的关键在于通过实证方法揭示用户在HAII中对PSU的认知结构,为未来人机交互设计提供理论依据。

链接: https://arxiv.org/abs/2505.20068
作者: Qingyu Liang,Jaime Banks
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Shared understanding plays a key role in the effective communication in and performance of human-human interactions. With the increasingly common integration of AI into human contexts, the future of personal and workplace interactions will likely see human-AI interaction (HAII) in which the perception of shared understanding is important. Existing literature has addressed the processes and effects of PSU in human-human interactions, but the construal remains underexplored in HAII. To better understand PSU in HAII, we conducted an online survey to collect user reflections on interactions with a large language model when it sunderstanding of a situation was thought to be similar to or different from the participant’s. Through inductive thematic analysis, we identified eight dimensions comprising PSU in human-AI interactions: Fluency, aligned operation, fluidity, outcome satisfaction, contextual awareness, lack of humanlike abilities, computational limits, and suspicion.
zh

[AI-31] Community Moderation and the New Epistemology of Fact Checking on Social Media

【速读】:该论文试图解决社交媒体平台上虚假信息检测与内容审核的复杂性问题,特别是传统依赖内部审核团队和独立事实核查机构的方式在应对虚假信息时存在的局限性。其解决方案的关键在于探索社区驱动的内容审核机制,如X(原Twitter)和Meta推出的“Community Notes”等众包事实核查工具,旨在通过扩大参与规模和提升响应速度来增强对虚假信息的识别与处理能力,同时分析此类方案在实际应用中的潜力与挑战。

链接: https://arxiv.org/abs/2505.20067
作者: Isabelle Augenstein,Michiel Bakker,Tanmoy Chakraborty,David Corney,Emilio Ferrara,Iryna Gurevych,Scott Hale,Eduard Hovy,Heng Ji,Irene Larraz,Filippo Menczer,Preslav Nakov,Paolo Papotti,Dhruv Sahnan,Greta Warren,Giovanni Zagni
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 1 Figure, 2 tables

点击查看摘要

Abstract:Social media platforms have traditionally relied on internal moderation teams and partnerships with independent fact-checking organizations to identify and flag misleading content. Recently, however, platforms including X (formerly Twitter) and Meta have shifted towards community-driven content moderation by launching their own versions of crowd-sourced fact-checking – Community Notes. If effectively scaled and governed, such crowd-checking initiatives have the potential to combat misinformation with increased scale and speed as successfully as community-driven efforts once did with spam. Nevertheless, general content moderation, especially for misinformation, is inherently more complex. Public perceptions of truth are often shaped by personal biases, political leanings, and cultural contexts, complicating consensus on what constitutes misleading content. This suggests that community efforts, while valuable, cannot replace the indispensable role of professional fact-checkers. Here we systemically examine the current approaches to misinformation detection across major platforms, explore the emerging role of community-driven moderation, and critically evaluate both the promises and challenges of crowd-checking at scale.
zh

[AI-32] Automated data curation for self-supervised learning in underwater acoustic analysis

【速读】:该论文试图解决海洋生态系统因声污染增加而面临的可持续性威胁问题,特别是通过被动声学监测(PAM)系统收集的大量水下声音数据难以进行人工分析,从而需要自动化处理。解决方案的关键在于提出一种完全自动化的自监督数据整理流水线,以从原始PAM数据中构建一个多样化且平衡的数据集。该流水线整合了自动识别系统(AIS)数据与美国水域中多种水听器的录音,利用分层k均值聚类对原始音频数据进行采样,并与AIS样本结合,从而生成适用于自监督学习模型训练的数据集。

链接: https://arxiv.org/abs/2505.20066
作者: Hilde I Hummel,Sandjai Bhulai,Burooj Ghani,Rob van der Mei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The sustainability of the ocean ecosystem is threatened by increased levels of sound pollution, making monitoring crucial to understand its variability and impact. Passive acoustic monitoring (PAM) systems collect a large amount of underwater sound recordings, but the large volume of data makes manual analysis impossible, creating the need for automation. Although machine learning offers a potential solution, most underwater acoustic recordings are unlabeled. Self-supervised learning models have demonstrated success in learning from large-scale unlabeled data in various domains like computer vision, Natural Language Processing, and audio. However, these models require large, diverse, and balanced datasets for training in order to generalize well. To address this, a fully automated self-supervised data curation pipeline is proposed to create a diverse and balanced dataset from raw PAM data. It integrates Automatic Identification System (AIS) data with recordings from various hydrophones in the U.S. waters. Using hierarchical k-means clustering, the raw audio data is sampled and then combined with AIS samples to create a balanced and diverse dataset. The resulting curated dataset enables the development of self-supervised learning models, facilitating various tasks such as monitoring marine mammals and assessing sound pollution.
zh

[AI-33] SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在应用中的安全性问题,特别是在强化学习与人类反馈(Reinforcement Learning from Human Feedback, RLHF)框架下整合安全约束所带来的复杂性问题。其解决方案的关键在于提出一种名为SafeDPO的新算法,该算法通过在单一策略学习阶段直接优化安全对齐目标,无需引入松弛机制,从而简化了传统方法中复杂的奖励和成本模型拟合及微调过程。SafeDPO仅引入一个额外的超参数以进一步提升安全性,并对标准DPO进行少量修改,实现了在保持与人类偏好对齐性能的同时显著提高LLMs的安全性。

链接: https://arxiv.org/abs/2505.20065
作者: Geon-Hyeong Kim,Youngsoo Jang,Yu Jin Kim,Byoungjip Kim,Honglak Lee,Kyunghoon Bae,Moontae Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.
zh

[AI-34] Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions

【速读】:该论文试图解决长短期记忆网络(Long Short-Term Memory, LSTM)在训练过程中出现的“多峰下降”现象,即在模型过拟合后测试损失多次经历上升和下降的周期性变化问题。解决方案的关键在于通过渐近稳定性分析揭示测试损失的周期性变化与有序态和混沌态之间的相变过程密切相关,且局部最优解始终位于两种状态的临界过渡点,而全局最优解则出现在首次从有序态向混沌态的过渡阶段,此时“混沌边缘”的宽度最大,有利于探索更优的权重配置。

链接: https://arxiv.org/abs/2505.20030
作者: Wenbo Wei,Nicholas Chong Jia Le,Choy Heng Lai,Ling Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:We observe a novel ‘multiple-descent’ phenomenon during the training process of LSTM, in which the test loss goes through long cycles of up and down trend multiple times after the model is overtrained. By carrying out asymptotic stability analysis of the models, we found that the cycles in test loss are closely associated with the phase transition process between order and chaos, and the local optimal epochs are consistently at the critical transition point between the two phases. More importantly, the global optimal epoch occurs at the first transition from order to chaos, where the ‘width’ of the ‘edge of chaos’ is the widest, allowing the best exploration of better weight configurations for learning.
zh

[AI-35] Gradient Inversion Transcript: Leverag ing Robust Generative Priors to Reconstruct Training Data from Gradient Leakage

【速读】:该论文试图解决从泄露的梯度中重建训练数据的问题,这是在分布式学习环境中可能引发隐私泄露的关键挑战。解决方案的关键在于提出一种名为Gradient Inversion Transcript (GIT) 的生成方法,其核心是设计一个与泄露模型结构相匹配的生成攻击模型,并通过理论分析优化其架构。该方法在离线训练后可高效部署,仅依赖泄露的梯度即可重构输入数据,且在多种分布式学习场景下具有广泛适用性。

链接: https://arxiv.org/abs/2505.20026
作者: Xinping Chen,Chen Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Gradient Inversion Transcript (GIT), a novel generative approach for reconstructing training data from leaked gradients. GIT employs a generative attack model, whose architecture is tailored to align with the structure of the leaked model based on theoretical analysis. Once trained offline, GIT can be deployed efficiently and only relies on the leaked gradients to reconstruct the input data, rendering it applicable under various distributed learning environments. When used as a prior for other iterative optimization-based methods, GIT not only accelerates convergence but also enhances the overall reconstruction quality. GIT consistently outperforms existing methods across multiple datasets and demonstrates strong robustness under challenging conditions, including inaccurate gradients, data distribution shifts and discrepancies in model parameters.
zh

[AI-36] he Many Challenges of Human-Like Agents in Virtual Game Environments AAMAS-2025

【速读】:该论文试图解决如何建模和实现类人人工智能(Human-like AI)以及如何衡量其类人程度的问题。其解决方案的关键在于通过一项实证研究,在战术视频游戏中利用一种自定义的深度循环卷积神经网络(deep recurrent convolutional neural network)来判断玩家是人类还是AI代理。研究假设,对于特定游戏而言,构建类人AI的难度越大,区分人类与AI驱动玩家的方法就越容易开发。

链接: https://arxiv.org/abs/2505.20011
作者: Maciej Świechowski(1),Dominik Ślęzak(2) ((1) QED Software, (2) University of Warsaw)
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: In proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2025), pages 1996–2005, May 19-23, Detroit, Michigan, USA

点击查看摘要

Abstract:Human-like agents are an increasingly important topic in games and beyond. Believable non-player characters enhance the gaming experience by improving immersion and providing entertainment. They also offer players the opportunity to engage with AI entities that can function as opponents, teachers, or cooperating partners. Additionally, in games where bots are prohibited – and even more so in non-game environments – there is a need for methods capable of identifying whether digital interactions occur with bots or humans. This leads to two fundamental research questions: (1) how to model and implement human-like AI, and (2) how to measure its degree of human likeness. This article offers two contributions. The first one is a survey of the most significant challenges in implementing human-like AI in games (or any virtual environment featuring simulated agents, although this article specifically focuses on games). Thirteen such challenges, both conceptual and technical, are discussed in detail. The second is an empirical study performed in a tactical video game that addresses the research question: “Is it possible to distinguish human players from bots (AI agents) based on empirical data?” A machine-learning approach using a custom deep recurrent convolutional neural network is presented. We hypothesize that the more challenging it is to create human-like AI for a given game, the easier it becomes to develop a method for distinguishing humans from AI-driven players. Comments: In proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2025), pages 1996–2005, May 19-23, Detroit, Michigan, USA Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM) MSC classes: 68T01 ACMclasses: I.2; I.6.0; H.1.2 Cite as: arXiv:2505.20011 [cs.AI] (or arXiv:2505.20011v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.20011 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-37] DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

【速读】:该论文试图解决在数字取证与事件响应(Digital Forensics and Incident Response, DFIR)领域中,缺乏一个全面的基准来评估大型语言模型(Large Language Models, LLMs)在理论和实践层面表现的问题。其解决方案的关键是提出DFIR-Metric,这是一个包含三个组成部分的基准:知识评估、现实取证挑战和实际分析,旨在提供一个严谨且可复现的评估框架,以推动AI在数字取证中的应用。此外,还引入了任务理解得分(Task Understanding Score, TUS)作为新指标,以更有效地评估模型在接近零准确率场景下的表现。

链接: https://arxiv.org/abs/2505.19973
作者: Bilel Cherif,Tamas Bisztray,Richard A. Dubniczky,Aaesha Aldahmani,Saeed Alshehhi,Norbert Tihanyi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at this https URL.
zh

[AI-38] Learning to Select In-Context Demonstration Preferred by Large Language Model

【速读】:该论文试图解决在上下文学习(In-context learning, ICL)中,如何有效选择对任务适应具有真正帮助的示例问题(demonstrations)这一挑战。现有方法虽然尝试通过检索机制来选择与查询相关的示例,但通常依赖于间接优化目标(如度量学习),无法直接提升ICL性能,且在候选示例质量不足时效果不佳。解决方案的关键在于提出GenICL,这是一种基于生成式偏好学习(Generative preference learning)的框架,通过利用大语言模型(LLM)的反馈直接优化示例选择过程,从而提升ICL的效果。

链接: https://arxiv.org/abs/2505.19966
作者: Zheng Zhang,Shaocheng Lan,Lei Song,Jiang Bian,Yexin Li,Kan Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.
zh

[AI-39] Adaptive Location Hierarchy Learning for Long-Tailed Mobility Prediction

【速读】:该论文旨在解决长尾分布下的人类移动性预测问题,即在位置分布严重不平衡的情况下,如何准确预测用户接下来的访问位置。现有方法主要关注数据、模型或类别层面的分布再平衡,而忽略了位置的时空语义信息。该研究提出了一种名为ALOHA(Adaptive LOcation HierArchy learning)的首个可插拔框架,其关键在于通过利用大语言模型(Large Language Models, LLMs)和马斯洛需求理论设计的思维链(Chain-of-Thought, CoT)提示,构建城市定制的位置层次结构,并通过Gumbel扰动和节点自适应权重优化层次结构中的位置预测,从而有效捕捉位置的时空语义信息。

链接: https://arxiv.org/abs/2505.19965
作者: Yu Wang,Junshu Dai,Yuchen Ying,Yuxuan Liang,Tongya Zheng,Mingli Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human mobility prediction is crucial for applications ranging from location-based recommendations to urban planning, which aims to forecast users’ next location visits based on historical trajectories. Despite the severe long-tailed distribution of locations, the problem of long-tailed mobility prediction remains largely underexplored. Existing long-tailed learning methods primarily focus on rebalancing the skewed distribution at the data, model, or class level, neglecting to exploit the spatiotemporal semantics of locations. To address this gap, we propose the first plug-and-play framework for long-tailed mobility prediction in an exploitation and exploration manner, named \textbfAdaptive \textbfLOcation \textbfHier\textbfArchy learning (ALOHA). First, we construct city-tailored location hierarchy based on Large Language Models (LLMs) by exploiting Maslow’s theory of human motivation to design Chain-of-Thought (CoT) prompts that captures spatiotemporal semantics. Second, we optimize the location hierarchy predictions by Gumbel disturbance and node-wise adaptive weights within the hierarchical tree structure. Experiments on state-of-the-art models across six datasets demonstrate the framework’s consistent effectiveness and generalizability, which strikes a well balance between head and tail locations. Weight analysis and ablation studies reveal the optimization differences of each component for head and tail locations. Furthermore, in-depth analyses of hierarchical distance and case study demonstrate the effective semantic guidance from the location hierarchy. Our code will be made publicly available.
zh

[AI-40] Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy INTERSPEECH2025

【速读】:该论文旨在解决语音模型中个人数据安全处理的问题,特别是针对用户身份识别的恶意攻击,通过提出说话人匿名化方法来增强隐私保护。其解决方案的关键在于引入一种新颖的指数总方差(Exponential Total Variance, TV)损失函数,以及一种可扩展的通用对抗补丁(Universal Adversarial Patch, UAP)插入流程,从而有效提升UAP的强度和不可感知性,并实现对不同音频长度的稳定高性能表现。

链接: https://arxiv.org/abs/2505.19951
作者: Elvir Karimov,Alexander Varlamov,Danil Ivanov,Dmitrii Korzh,Oleg Y. Rogov
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures, 1 table; Submitted to Interspeech 2025

点击查看摘要

Abstract:Deep learning voice models are commonly used nowadays, but the safety processing of personal data, such as human identity and speech content, remains suspicious. To prevent malicious user identification, speaker anonymization methods were proposed. Current methods, particularly based on universal adversarial patch (UAP) applications, have drawbacks such as significant degradation of audio quality, decreased speech recognition quality, low transferability across different voice biometrics models, and performance dependence on the input audio length. To mitigate these drawbacks, in this work, we introduce and leverage the novel Exponential Total Variance (TV) loss function and provide experimental evidence that it positively affects UAP strength and imperceptibility. Moreover, we present a novel scalable UAP insertion procedure and demonstrate its uniformly high performance for various audio lengths.
zh

[AI-41] Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees

【速读】:该论文旨在解决在开放权重大语言模型(Large Language Model, LLM)动物园中,如何为特定任务选择合适的模型以实现成本最优且满足服务等级协议(Service Level Agreement, SLA)的问题。解决方案的关键在于提出一种基于随机优化的算法MESS+,该算法通过实时学习用户与系统交互中的请求满意度概率,并结合虚拟队列和请求满意度预测,动态求解每个请求的优化问题,从而在保证SLA合规性的前提下实现成本最小化。

链接: https://arxiv.org/abs/2505.19947
作者: Herbert Woisetschläger,Ryan Zhang,Shiqiang Wang,Hans-Arno Jacobsen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Preprint. Under review

点击查看摘要

Abstract:Open-weight LLM zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of 2x cost savings compared to existing LLM routing techniques.
zh

[AI-42] Subtle Risks Critical Failures: A Framework for Diagnosing Physical Safety of LLM s for Embodied Decision Making

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在具身代理决策中的物理安全性评估问题,现有安全评估方法依赖于粗粒度的成功率和领域特定设置,难以诊断模型失败的原因和位置,从而阻碍了对具身安全的理解以及LLMs在高风险物理环境中的选择性部署。其解决方案的关键在于提出SAFEL框架,该框架系统性地评估LLMs在具身决策中的物理安全性,通过两个核心能力进行评估:(1)通过Command Refusal Test拒绝不安全指令,(2)通过Plan Safety Test生成安全且可执行的计划,并将后者分解为功能模块,如目标解释、转移建模和动作排序,以实现对安全故障的细粒度诊断。

链接: https://arxiv.org/abs/2505.19933
作者: Yejin Son,Minseo Kim,Sungwoong Kim,Seungju Han,Jian Kim,Dongju Jang,Youngjae Yu,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 13 tables, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for decision making in embodied agents, yet existing safety evaluations often rely on coarse success rates and domain-specific setups, making it difficult to diagnose why and where these models fail. This obscures our understanding of embodied safety and limits the selective deployment of LLMs in high-risk physical environments. We introduce SAFEL, the framework for systematically evaluating the physical safety of LLMs in embodied decision making. SAFEL assesses two key competencies: (1) rejecting unsafe commands via the Command Refusal Test, and (2) generating safe and executable plans via the Plan Safety Test. Critically, the latter is decomposed into functional modules, goal interpretation, transition modeling, action sequencing, enabling fine-grained diagnosis of safety failures. To support this framework, we introduce EMBODYGUARD, a PDDL-grounded benchmark containing 942 LLM-generated scenarios covering both overtly malicious and contextually hazardous instructions. Evaluation across 13 state-of-the-art LLMs reveals that while models often reject clearly unsafe commands, they struggle to anticipate and mitigate subtle, situational risks. Our results highlight critical limitations in current LLMs and provide a foundation for more targeted, modular improvements in safe embodied reasoning.
zh

[AI-43] CP: a Benchmark for Temporal Constraint-Based Planning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在时间推理与规划能力评估中缺乏综合性基准的问题,现有基准通常孤立且局限于简单的复杂性形式。其解决方案的关键在于引入Temporal Constraint-based Planning (TCP) 基准,该基准通过自然对话场景中的协作项目实例,联合评估时间推理与规划能力,其中包含显式或隐式的多种相互依赖的时间约束,要求模型推导出满足所有约束的最优调度方案。

链接: https://arxiv.org/abs/2505.19927
作者: Zifeng Ding,Sikuan Yan,Zhangdie Yuan,Xianglong Hu,Fangru Lin,Andreas Vlachos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark, that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we first generate abstract problem prototypes that are paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models struggle with TCP, highlighting its difficulty and revealing limitations in LLMs’ temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.
zh

[AI-44] Evaluating AI cyber capabilities with crowdsourced elicitation

【速读】:该论文试图解决如何有效评估和理解人工智能系统在网络安全领域的潜在攻击能力,以支持负责任的治理和部署。传统上,安全组织通过内部进行“AI elicitation”(AI诱捕)来获取AI的任务特定性能,但这种方法存在局限性。论文提出的解决方案是通过众包方式开展AI诱捕,即在CTF竞赛中开放AI赛道,吸引大量团队参与,从而更全面地评估AI的能力。其关键在于利用开放市场机制,通过设置奖励(elicitation bounties)来实现对新兴AI能力的及时、低成本的情报收集,并验证AI在特定任务中的表现可媲美甚至超越人类参与者。

链接: https://arxiv.org/abs/2505.19915
作者: Artem Petrov,Dmitrii Volkov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems become increasingly capable, understanding their offensive cyber potential is critical for informed governance and responsible deployment. However, it’s hard to accurately bound their capabilities, and some prior evaluations dramatically underestimated them. The art of extracting maximum task-specific performance from AIs is called “AI elicitation”, and today’s safety organizations typically conduct it in-house. In this paper, we explore crowdsourcing elicitation efforts as an alternative to in-house elicitation work. We host open-access AI tracks at two Capture The Flag (CTF) competitions: AI vs. Humans (400 teams) and Cyber Apocalypse_ (4000 teams). The AI teams achieve outstanding performance at both events, ranking top-13% and top-21% respectively for a total of \ 7500 in bounties. This impressive performance suggests that open-market elicitation may offer an effective complement to in-house elicitation. We propose elicitation bounties as a practical mechanism for maintaining timely, cost-effective situational awareness of emerging AI capabilities. Another advantage of open elicitations is the option to collect human performance data at scale. Applying METR’s methodology, we found that AI agents can reliably solve cyber challenges requiring one hour or less of effort from a median human CTF participant. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19915 [cs.CR] (or arXiv:2505.19915v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.19915 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] EMAC: Embodied Multimodal Agent for Collaborative Planning with VLMLLM

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在机器人控制中应用时存在的三个关键问题:(1) LLM代理主要设计用于处理文本输入而非视觉条件;(2) 当前多模态代理将LLM视为静态规划器,导致其推理与环境动态分离,从而生成不考虑领域知识的动作;(3) LLM无法通过视觉交互进行学习,难以为特定领域制定更优策略。解决方案的关键在于提出EMAC+,一个具身多模态代理,通过双向训练范式将LLM与视觉语言模型(Visual Language Model, VLM)协同集成,动态利用VLM实时反馈优化LLM生成的高层文本计划,使LLM能够通过交互经验直接内化视觉环境动态,而非仅依赖静态符号映射。

链接: https://arxiv.org/abs/2505.19905
作者: Shuang Ao,Flora D. Salim,Simon Khan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.
zh

[AI-46] Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在模型融合中的基准缺失问题,以及如何有效整合不同模态的专家模型以提升整体性能。其关键解决方案是构建一个涵盖多种任务(如VQA、几何、图表、OCR和定位)的模型融合基准,并引入一种新颖的方法,通过去除任务向量中的噪声并基于任务向量交互定义的损失函数进行鲁棒优化,从而实现更优的模型融合效果。

链接: https://arxiv.org/abs/2505.19892
作者: Yongxian Wei,Runxi Cheng,Weike Jin,Enneng Yang,Li Shen,Lu Hou,Sinan Du,Chun Yuan,Xiaochun Cao,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While foundation models update slowly due to resource-intensive training requirements, domain-specific models evolve between updates. Model merging aims to combine multiple expert models into a single, more capable model, thereby reducing storage and serving costs while supporting decentralized model development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Multimodal Large Language Models (MLLMs), which extend the capabilities of LLMs through large-scale multimodal training, have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, (i) we introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. (ii) We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. (iii) We find that model merging offers a promising way for building improved MLLMs without requiring data training. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.
zh

[AI-47] Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities

【速读】:该论文试图解决生成式 AI (Generative AI) 在二进制分析中的有效性问题,特别是针对汇编代码去混淆(assembly code deobfuscation)的挑战。其关键解决方案是提出一个基于四个维度(推理深度、模式识别、噪声过滤和上下文整合)的理论框架,用以解释不同商业大语言模型(LLMs)在面对多种混淆技术时表现差异的原因,并通过分析识别出五种错误模式,揭示了LLMs在处理代码时的根本局限性。

链接: https://arxiv.org/abs/2505.19887
作者: Anton Tkachenko,Dmitrij Suskevic,Benjamin Adolphi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in software engineering, yet their effectiveness for binary analysis remains unexplored. We present the first comprehensive evaluation of commercial LLMs for assembly code deobfuscation. Testing seven state-of-the-art models against four obfuscation scenarios (bogus control flow, instruction substitution, control flow flattening, and their combination), we found striking performance variations–from autonomous deobfuscation to complete failure. We propose a theoretical framework based on four dimensions: Reasoning Depth, Pattern Recognition, Noise Filtering, and Context Integration, explaining these variations. Our analysis identifies five error patterns: predicate misinterpretation, structural mapping errors, control flow misinterpretation, arithmetic transformation errors, and constant propagation errors, revealing fundamental limitations in LLM code this http URL establish a three-tier resistance model: bogus control flow (low resistance), control flow flattening (moderate resistance), and instruction substitution/combined techniques (high resistance). Universal failure against combined techniques demonstrates that sophisticated obfuscation remains effective against advanced LLMs. Our findings suggest a human-AI collaboration paradigm where LLMs reduce expertise barriers for certain reverse engineering tasks while requiring human guidance for complex deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.x deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.
zh

[AI-48] Deep Active Inference Agents for Delayed and Long-Horizon Environments

【速读】:该论文旨在解决传统主动推理(Active Inference, AIF)代理在延迟环境中进行长时域规划时存在的局限性,即依赖于精确的即时预测和穷举式规划,导致计算效率低下。其解决方案的关键在于提出一种生成式策略架构,包含多步潜在转移、集成策略网络、交替优化方案以及单步梯度规划,从而实现高效的长时域决策,无需人工设计奖励或昂贵的规划过程。

链接: https://arxiv.org/abs/2505.19867
作者: Yavar Taheri Yeganeh,Mohsen Jafari,Andrea Matta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the recent success of world-model agents, which extend the core idea of model-based reinforcement learning by learning a differentiable model for sample-efficient control across diverse tasks, active inference (AIF) offers a complementary, neuroscience-grounded paradigm that unifies perception, learning, and action within a single probabilistic framework powered by a generative model. Despite this promise, practical AIF agents still rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments requiring plans over long horizons, tens to hundreds of steps. Moreover, most existing agents are evaluated on robotic or vision benchmarks which, while natural for biological agents, fall short of real-world industrial complexity. We address these limitations with a generative-policy architecture featuring (i) a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead, (ii) an integrated policy network that enables the transition and receives gradients of the expected free energy, (iii) an alternating optimization scheme that updates model and policy from a replay buffer, and (iv) a single gradient step that plans over long horizons, eliminating exhaustive planning from the control loop. We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long-horizon settings. The empirical results confirm the effectiveness of the proposed approach, demonstrating the coupled world-model with the AIF formalism yields an end-to-end probabilistic controller capable of effective decision making in delayed, long-horizon settings without handcrafted rewards or expensive planning.
zh

[AI-49] DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning

【速读】:该论文试图解决稀疏奖励强化学习(Sparse-reward reinforcement learning, RL)中高效探索与长期信用分配的挑战,特别是在高维、长时序任务中的可扩展性问题。现有方法通常设计通用的探索策略以解决任意任务,导致在处理复杂高维、长时序任务时难以有效探索。论文的关键解决方案是提出一种定向稀疏奖励目标条件化长时序RL方法(DISCOVER),其核心在于从现有RL算法中提取方向感,从而在目标任务的方向上选择探索目标,而非盲目探索整个任务空间。这一方法通过理论分析与实验验证,证明了其在高维环境中优于现有最先进探索方法的有效性。

链接: https://arxiv.org/abs/2505.19850
作者: Leander Diaz-Bone,Marco Bagatella,Jonas Hübotter,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Sparse-reward reinforcement learning (RL) can model a wide range of highly complex tasks. Solving sparse-reward tasks is RL’s core premise - requiring efficient exploration coupled with long-horizon credit assignment - and overcoming these challenges is key for building self-improving agents with superhuman ability. We argue that solving complex and high-dimensional tasks requires solving simpler tasks that are relevant to the target task. In contrast, most prior work designs strategies for selecting exploratory tasks with the objective of solving any task, making exploration of challenging high-dimensional, long-horizon tasks intractable. We find that the sense of direction, necessary for effective exploration, can be extracted from existing RL algorithms, without needing any prior information. Based on this finding, we propose a method for directed sparse-reward goal-conditioned very long-horizon RL (DISCOVER), which selects exploratory goals in the direction of the target task. We connect DISCOVER to principled exploration in bandits, formally bounding the time until the target task becomes achievable in terms of the agent’s initial distance to the target, but independent of the volume of the space of all tasks. Empirically, we perform a thorough evaluation in high-dimensional environments. We find that the directed goal selection of DISCOVER solves exploration problems that are beyond the reach of prior state-of-the-art exploration methods in RL.
zh

[AI-50] DGRAG : Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)在处理分布式数据时面临的隐私泄露、高计算成本以及中心节点搜索大规模知识库时的延迟问题。其解决方案的关键在于提出一种基于知识图谱的分布式RAG方法(DGRAG),该方法在边缘-云系统中运行,每个边缘设备维护本地知识库,并仅将知识摘要共享至云端,从而实现数据隐私保护与资源高效利用。DGRAG通过两个主要阶段——分布式知识构建和协同检索与生成——来优化知识检索与答案生成过程,提升问答任务的质量。

链接: https://arxiv.org/abs/2505.19847
作者: Wenqing Zhou,Yuxuan Yan,Qianqian Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the capabilities of language models by integrating external knowledge. Due to the diversity of data sources and the constraints of memory and computing resources, real-world data is often scattered in multiple devices. Conventional RAGs that store massive amounts of scattered data centrally face increasing privacy concerns and high computational costs. Additionally, RAG in a central node raises latency issues when searching over a large-scale knowledge base. To address these challenges, we propose a distributed Knowledge Graph-based RAG approach, referred to as DGRAG, in an edge-cloud system, where each edge device maintains a local knowledge base without the need to share it with the cloud, instead sharing only summaries of its knowledge. Specifically, DGRAG has two main phases. In the Distributed Knowledge Construction phase, DGRAG organizes local knowledge using knowledge graphs, generating subgraph summaries and storing them in a summary database in the cloud as information sharing. In the Collaborative Retrieval and Generation phase, DGRAG first performs knowledge retrieval and answer generation locally, and a gate mechanism determines whether the query is beyond the scope of local knowledge or processing capabilities. For queries that exceed the local knowledge scope, the cloud retrieves knowledge from the most relevant edges based on the summaries and generates a more precise answer. Experimental results demonstrate the effectiveness of the proposed DGRAG approach in significantly improving the quality of question-answering tasks over baseline approaches.
zh

[AI-51] PCDCNet: A Surrogate Model for Air Quality Forecasting with Physical-Chemical Dynamics and Constraints

【速读】:该论文旨在解决空气质量预测(Air Quality Forecasting, AQF)中的挑战,即传统数值模型计算成本高且依赖不确定的排放清单,而深度学习模型虽计算效率高但缺乏物理约束导致泛化能力不足。其解决方案的关键在于提出PCDCNet,一种将数值建模原理与深度学习相结合的代理模型,通过显式引入排放、气象影响及领域知识约束,结合图结构空间传输建模、循环结构时间累积以及局部交互表示增强,实现了高精度且低计算成本的污染物浓度预测。

链接: https://arxiv.org/abs/2505.19842
作者: Shuo Wang,Yun Cheng,Qingye Meng,Olga Saukh,Jiang Zhang,Jingfang Fan,Yuanting Zhang,Xingyuan Yuan,Lothar Thiele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Air quality forecasting (AQF) is critical for public health and environmental management, yet remains challenging due to the complex interplay of emissions, meteorology, and chemical transformations. Traditional numerical models, such as CMAQ and WRF-Chem, provide physically grounded simulations but are computationally expensive and rely on uncertain emission inventories. Deep learning models, while computationally efficient, often struggle with generalization due to their lack of physical constraints. To bridge this gap, we propose PCDCNet, a surrogate model that integrates numerical modeling principles with deep learning. PCDCNet explicitly incorporates emissions, meteorological influences, and domain-informed constraints to model pollutant formation, transport, and dissipation. By combining graph-based spatial transport modeling, recurrent structures for temporal accumulation, and representation enhancement for local interactions, PCDCNet achieves state-of-the-art (SOTA) performance in 72-hour station-level PM2.5 and O3 forecasting while significantly reducing computational costs. Furthermore, our model is deployed in an online platform, providing free, real-time air quality forecasts, demonstrating its scalability and societal impact. By aligning deep learning with physical consistency, PCDCNet offers a practical and interpretable solution for AQF, enabling informed decision-making for both personal and regulatory applications.
zh

[AI-52] Revisiting Glorot Initialization for Long-Range Linear Recurrences

【速读】:该论文试图解决循环神经网络(Recurrent Neural Networks, RNNs)在长序列推理任务中由于权重矩阵重复应用导致的信号消失或爆炸问题。现有常用基线方法Glorot初始化虽然在无限宽度、固定长度的假设下能够保证信号稳定传播,但在处理长序列时并不适用。论文指出,Glorot初始化实际上存在不稳定性:隐藏状态的谱半径出现微小正偏差后会随时间放大,导致信号爆炸。解决方案的关键在于提出一种简单的、与维度相关的Glorot重缩放方法,将谱半径略微调整至1以下,从而防止信号的快速爆炸或衰减。

链接: https://arxiv.org/abs/2505.19827
作者: Noga Bar,Mariia Seleznova,Yotam Alexander,Gitta Kutyniok,Raja Giryes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Proper initialization is critical for Recurrent Neural Networks (RNNs), particularly in long-range reasoning tasks, where repeated application of the same weight matrix can cause vanishing or exploding signals. A common baseline for linear recurrences is Glorot initialization, designed to ensure stable signal propagation–but derived under the infinite-width, fixed-length regime–an unrealistic setting for RNNs processing long sequences. In this work, we show that Glorot initialization is in fact unstable: small positive deviations in the spectral radius are amplified through time and cause the hidden state to explode. Our theoretical analysis demonstrates that sequences of length t = O(\sqrtn) , where n is the hidden width, are sufficient to induce instability. To address this, we propose a simple, dimension-aware rescaling of Glorot that shifts the spectral radius slightly below one, preventing rapid signal explosion or decay. These results suggest that standard initialization schemes may break down in the long-sequence regime, motivating a separate line of theory for stable recurrent initialization.
zh

[AI-53] Foundation Models for Tabular Data within Systemic Contexts Need Grounding

【速读】:该论文试图解决当前表格基础模型在处理大规模真实世界数据时存在的问题,即通常将表格视为孤立实体并假设信息完整性,从而忽略了重要的操作上下文。解决方案的关键在于引入语义关联表格(Semantically Linked Tables, SLT),通过整合声明性和过程性操作知识,将表格数据置于其真实的操作上下文中,从而充分发挥机器学习在复杂、互联表格数据中的潜力。

链接: https://arxiv.org/abs/2505.19825
作者: Tassilo Klein,Johannes Hoffart
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Current research on tabular foundation models often overlooks the complexities of large-scale, real-world data by treating tables as isolated entities and assuming information completeness, thereby neglecting the vital operational context. To address this, we introduce the concept of Semantically Linked Tables (SLT), recognizing that tables are inherently connected to both declarative and procedural operational knowledge. We propose Foundation Models for Semantically Linked Tables (FMSLT), which integrate these components to ground tabular data within its true operational context. This comprehensive representation unlocks the full potential of machine learning for complex, interconnected tabular data across diverse domains. Realizing FMSLTs requires access to operational knowledge that is often unavailable in public datasets, highlighting the need for close collaboration between domain experts and researchers. Our work exposes the limitations of current tabular foundation models and proposes a new direction centered on FMSLTs, aiming to advance robust, context-aware models for structured data.
zh

[AI-54] LAPA-based Dynamic Privacy Optimization for Wireless Federated Learning in Heterogeneous Environments

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因梯度泄露攻击导致的数据隐私泄露问题,以及差分隐私(Differential Privacy, DP)技术在非独立同分布(Non-IID)数据场景下对FL性能的负面影响。其解决方案的关键在于提出一种轻量级自适应隐私预算分配(Lightweight Adaptive Privacy Allocation, LAPA)策略,通过在每个聚合轮次中为设备分配个性化的隐私预算,无需传输额外信息即可实现隐私保护与聚合效率的平衡。同时,结合深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)算法优化传输功率,以动态调整人工噪声与通信噪声的匹配时机,从而在DP和系统效用之间取得有效平衡。

链接: https://arxiv.org/abs/2505.19823
作者: Pengcheng Sun,Erwu Liu,Wei Ni,Rui Wang,Yuanzhe Geng,Lijuan Lai,Abbas Jamalipour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning paradigm based on protecting data privacy of devices, which however, can still be broken by gradient leakage attack via parameter inversion techniques. Differential privacy (DP) technology reduces the risk of private data leakage by adding artificial noise to the gradients, but detrimental to the FL utility at the same time, especially in the scenario where the data is Non-Independent Identically Distributed (Non-IID). Based on the impact of heterogeneous data on aggregation performance, this paper proposes a Lightweight Adaptive Privacy Allocation (LAPA) strategy, which assigns personalized privacy budgets to devices in each aggregation round without transmitting any additional information beyond gradients, ensuring both privacy protection and aggregation efficiency. Furthermore, the Deep Deterministic Policy Gradient (DDPG) algorithm is employed to optimize the transmission power, in order to determine the optimal timing at which the adaptively attenuated artificial noise aligns with the communication noise, enabling an effective balance between DP and system utility. Finally, a reliable aggregation strategy is designed by integrating communication quality and data distribution characteristics, which improves aggregation performance while preserving privacy. Experimental results demonstrate that the personalized noise allocation and dynamic optimization strategy based on LAPA proposed in this paper enhances convergence performance while satisfying the privacy requirements of FL.
zh

[AI-55] FinLoRA: Benchmarking LoRA Methods for Fine-Tuning LLM s on Financial Datasets

【速读】:该论文旨在解决如何将低秩适配(Low-rank adaptation, LoRA)方法有效地应用于高风险领域如金融,以提升预训练通用大型语言模型(Large Language Models, LLMs)在具体金融任务中的性能。其解决方案的关键在于通过构建涵盖多种金融应用场景的基准数据集,并对多种LoRA方法和基础LLMs进行系统评估,从而验证LoRA在金融领域的有效性与可扩展性。实验结果显示,LoRA方法在平均性能上相比基础模型提升了36%。

链接: https://arxiv.org/abs/2505.19819
作者: Dannong Wang,Jaisal Patel,Daochen Zha,Steve Y. Yang,Xiao-Yang Liu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) methods show great potential for scaling pre-trained general-purpose Large Language Models (LLMs) to hundreds or thousands of use scenarios. However, their efficacy in high-stakes domains like finance is rarely explored, e.g., passing CFA exams and analyzing SEC filings. In this paper, we present the open-source FinLoRA project that benchmarks LoRA methods on both general and highly professional financial tasks. First, we curated 19 datasets covering diverse financial applications; in particular, we created four novel XBRL analysis datasets based on 150 SEC filings. Second, we evaluated five LoRA methods and five base LLMs. Finally, we provide extensive experimental results in terms of accuracy, F1, and BERTScore and report computational cost in terms of time and GPU memory during fine-tuning and inference stages. We find that LoRA methods achieved substantial performance gains of 36% on average over base models. Our FinLoRA project provides an affordable and scalable approach to democratize financial intelligence to the general public. Datasets, LoRA adapters, code, and documentation are available at this https URL
zh

[AI-56] Equivariant Representation Learning for Symmetry-Aware Inference with Guarantees

【速读】:该论文旨在解决在回归、条件概率估计和不确定性量化任务中,如何利用物理或几何对称性以提升模型的泛化能力和样本效率的问题。其解决方案的关键在于提出一种等变表示学习框架,该框架基于算子理论和群表示论,通过近似条件期望算子的谱分解,构建出具有等变性和解耦特性的表示,从而在提供首个非渐近统计学习保证的同时,实现对不确定性的精确建模。

链接: https://arxiv.org/abs/2505.19809
作者: Daniel Ordoñez-Apraez,Alek Fröhlich,Vladimir Kostić,Karim Lounici,Vivien Brandt,Massimiliano Pontil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In many real-world applications of regression, conditional probability estimation, and uncertainty quantification, exploiting symmetries rooted in physics or geometry can dramatically improve generalization and sample efficiency. While geometric deep learning has made significant empirical advances by incorporating group-theoretic structure, less attention has been given to statistical learning guarantees. In this paper, we introduce an equivariant representation learning framework that simultaneously addresses regression, conditional probability estimation, and uncertainty quantification while providing first-of-its-kind non-asymptotic statistical learning guarantees. Grounded in operator and group representation theory, our framework approximates the spectral decomposition of the conditional expectation operator, building representations that are both equivariant and disentangled along independent symmetry subgroups. Empirical evaluations on synthetic datasets and real-world robotics applications confirm the potential of our approach, matching or outperforming existing equivariant baselines in regression while additionally providing well-calibrated parametric uncertainty estimates.
zh

[AI-57] ypes of Relations: Defining Analogies with Category Theory

【速读】:该论文试图解决如何构建、发现和评估类比(analogy)的问题,以实现知识在不同领域之间的有效迁移。其解决方案的关键在于将知识领域形式化为类别(category),并通过范畴论中的概念如函子(functor)、拉回(pullback)和推出(pushout)来定义类比的核心及其对应领域的融合。

链接: https://arxiv.org/abs/2505.19792
作者: Claire Ott,Frank Jäkel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 15 figures

点击查看摘要

Abstract:In order to behave intelligently both humans and machines have to represent their knowledge adequately for how it is used. Humans often use analogies to transfer their knowledge to new domains, or help others with this transfer via explanations. Hence, an important question is: What representation can be used to construct, find, and evaluate analogies? In this paper, we study features of a domain that are important for constructing analogies. We do so by formalizing knowledge domains as categories. We use the well-known example of the analogy between the solar system and the hydrogen atom to demonstrate how to construct domain categories. We also show how functors, pullbacks, and pushouts can be used to define an analogy, describe its core and a corresponding blend of the underlying domains.
zh

[AI-58] Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在生成链式思维(Chain-of-Thought, CoT)过程中存在的计算延迟高、首次标记时间(Time to First Token, TTFT)长的问题。其核心问题在于传统CoT中混合了多个难以显式管理的思考单元,导致效率低下。解决方案的关键是引入多轮分解(Multi-Turn Decomposition, MinD),将常规CoT分解为一系列显式、结构化且分轮次的交互,每个轮次包含一个思考单元并生成对应答案,后续轮次可对前序思考和答案进行反思、验证、修正或探索替代方案,从而提升推理效率并实现对迭代过程的显式控制。

链接: https://arxiv.org/abs/2505.19788
作者: Zihao Zeng,Xuyao Huang,Boxiu Li,Hao Zhang,Zhijie Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the thinking and answer parts of earlier ones. This not only makes the answer delivered more swiftly, but also enables explicit controls over the iterative reasoning process (i.e., users may halt or continue at any turn). We follow a supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm to realize MinD. We first rephrase the outputs of an LRM into multi-turn formats by prompting another LLM, and then tune the LRM with such data. Observing that the tuned model tends to consume even more tokens than the original one (probably due to that the multi-turn formats introduce additional answer tokens), we advocate leveraging RL algorithms like GRPO to prioritize correct outputs with fewer turns. Trained on the MATH dataset using R1-Distill models, MinD can achieve up to ~70% reduction in both output token usage and time to first token (TTFT), while maintaining competitive performance on reasoning benchmarks such as MATH-500, AIME24, AMC23, and GPQA-Diamond.
zh

[AI-59] MedDreamer: Model-Based Reinforcement Learning with Latent Imagination on Complex EHRs for Clinical Decision Support

【速读】:该论文旨在解决临床决策中因患者响应差异大、数据稀疏且不规则而带来的挑战,传统决策支持系统通过离散化和插值处理数据会破坏关键的时间动态特性,同时忽略数据采集频率的临床意义。其解决方案的关键在于提出MedDreamer,一个基于世界模型的两阶段强化学习框架,该框架引入了自适应特征融合(Adaptive Feature Integration, AFI)模块,以有效建模不规则和稀疏的临床数据,并通过潜在想象生成合理的患者轨迹来增强学习,结合真实与想象经验优化策略,从而在保持临床数据基础的同时超越历史次优决策。

链接: https://arxiv.org/abs/2505.19785
作者: Qianyi Xu,Gousia Habib,Dilruk Perera,Mengling Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Timely and personalized treatment decisions are essential across a wide range of healthcare settings where patient responses vary significantly and evolve over time. Clinical data used to support these decisions are often irregularly sampled, sparse, and noisy. Existing decision support systems commonly rely on discretization and imputation, which can distort critical temporal dynamics and degrade decision quality. Moreover, they often overlook the clinical significance of irregular recording frequencies, filtering out patterns in how and when data is collected. Reinforcement Learning (RL) is a natural fit for clinical decision-making, enabling sequential, long-term optimization in dynamic, uncertain environments. However, most existing treatment recommendation systems are model-free and trained solely on offline data, making them sample-inefficient, sensitive to data quality, and poorly generalizable across tasks or cohorts. To address these limitations, we propose MedDreamer, a two-phase model-based RL framework for personalized treatment recommendation. MedDreamer uses a world model with an Adaptive Feature Integration (AFI) module to effectively model irregular, sparse clinical data. Through latent imagination, it simulates plausible patient trajectories to enhance learning, refining its policy using a mix of real and imagined experiences. This enables learning policies that go beyond suboptimal historical decisions while remaining grounded in clinical data. To our knowledge, this is the first application of latent imagination to irregular healthcare data. Evaluations on sepsis and mechanical ventilation (MV) treatment using two large-scale EHR datasets show that MedDreamer outperforms both model-free and model-based baselines in clinical outcomes and off-policy metrics.
zh

[AI-60] ViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中奖励工程的可扩展性和泛化性问题,特别是在机器人操作这一具有挑战性的领域。传统方法依赖于稀疏奖励,导致样本效率低下。论文提出的解决方案关键在于引入TeViR,该方法利用预训练的文本到视频扩散模型,通过比较预测图像序列与当前观测来生成密集奖励,从而提升样本效率和性能。

链接: https://arxiv.org/abs/2505.19769
作者: Yuhui Chen,Haoran Li,Zhennan Jiang,Haowei Wen,Dongbin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineering with Vision-Language Models (VLMs) have shown promise, their sparse reward nature significantly limits sample efficiency. This paper introduces TeViR, a novel method that leverages a pre-trained text-to-video diffusion model to generate dense rewards by comparing the predicted image sequence with current observations. Experimental results across 11 complex robotic tasks demonstrate that TeViR outperforms traditional methods leveraging sparse rewards and other state-of-the-art (SOTA) methods, achieving better sample efficiency and performance without ground truth environmental rewards. TeViR’s ability to efficiently guide agents in complex environments highlights its potential to advance reinforcement learning applications in robotic manipulation.
zh

[AI-61] Agent ic Predictor: Performance Prediction for Agent ic Workflows via Multi-View Encoding

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的智能体系统在配置优化过程中面临的挑战,即由于智能体配置、策略和通信模式的搜索空间庞大,导致传统基于启发式调优或穷举评估的方法计算成本高且效果不佳。其解决方案的关键在于提出Agentic Predictor,该预测器通过多视角工作流编码技术,结合代码架构、文本提示和交互图特征进行多视角表示学习,并采用跨领域无监督预训练以在减少工作流评估次数的同时实现高预测精度,从而高效准确地选择最优智能体工作流配置。

链接: https://arxiv.org/abs/2505.19764
作者: Patara Trirat,Wonyong Jeong,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code will be available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms state-of-the-art methods in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.
zh

[AI-62] Language Model-Enhanced Message Passing for Heterophilic Graph Learning

【速读】:该论文旨在解决传统图神经网络(Graph Neural Networks, GNNs)在异质性图(heterophilic graphs)中表现不佳的问题,这类图中相连节点的特征和标签往往不相似。现有方法通过图结构优化或邻居聚合函数调整来应对异质性,但常忽略节点文本的语义潜力,使用次优的消息表示进行传播,并影响同质性图的性能。论文提出的解决方案关键在于引入一种语言模型(Language Model, LM)增强的消息传递方法(LEMP4HG),通过配对节点文本生成连接分析,并将其与节点文本嵌入融合,以语义丰富且自适应平衡的消息进行传播,从而缓解异质区域中的矛盾信号。此外,还引入了基于MVRD(Modulated Variation of Reliable Distance)的主动学习策略,以选择性增强最受影响的节点对,降低分析生成成本并减少对同质区域的干扰。

链接: https://arxiv.org/abs/2505.19762
作者: Wenjun Wang,Dawei Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional graph neural networks (GNNs), which rely on homophily-driven message passing, struggle with heterophilic graphs where connected nodes exhibit dissimilar features and different labels. While existing methods address heterophily through graph structure refinement or adaptation of neighbor aggregation functions, they often overlook the semantic potential of node text, rely on suboptimal message representation for propagation and compromise performance on homophilic graphs. To address these limitations, we propose a novel language model (LM)-enhanced message passing approach for heterophilic graph leaning (LEMP4HG). Specifically, in the context of text-attributed graph, we provide paired node texts for LM to generate their connection analysis, which are encoded and then fused with paired node textual embeddings through a gating mechanism. The synthesized messages are semantically enriched and adaptively balanced with both nodes’ information, which mitigates contradictory signals when neighbor aggregation in heterophilic regions. Furthermore, we introduce an active learning strategy guided by our heuristic MVRD (Modulated Variation of Reliable Distance), selectively enhancing node pairs suffer most from message passing, reducing the cost of analysis generation and side effects on homophilic regions. Extensive experiments validate that our approach excels on heterophilic graphs and performs robustly on homophilic ones, with a graph convolutional network (GCN) backbone and a practical budget.
zh

[AI-63] Divide and Conquer: Grounding LLM s as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning ICML2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长时域决策任务中因探索能力不足和长期信用分配问题而表现不佳的问题,尤其是在稀疏奖励场景下。其解决方案的关键在于提出了一种名为GLIDER的框架,该框架通过离线分层强化学习(Offline Hierarchical Reinforcement Learning)引入了一种参数高效且通用的层次结构,将低级控制器与由高级策略生成的抽象、分步计划进行监督学习,从而将复杂问题分解为一系列连贯的思维链子任务,提升长时域任务的探索与学习效果,并增强了对非平稳环境的在线适应能力。

链接: https://arxiv.org/abs/2505.19761
作者: Zican Hu,Wei Liu,Xiaoye Qu,Xiangyu Yue,Chunlin Chen,Zhi Wang,Yu Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025, 21 pages

点击查看摘要

Abstract:While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework GLIDER (Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
zh

[AI-64] ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection

【速读】:该论文试图解决使用硬件描述语言(HDL)进行编码的耗时和繁琐问题,特别是针对基于Scala的下一代HDL——Chisel代码生成的效率与质量提升问题。解决方案的关键在于提出ReChisel,一个基于大型语言模型(LLM)的代理系统,其核心机制包括反射机制(reflection mechanism)以通过编译和仿真反馈迭代优化生成代码的质量,以及逃生机制(escape mechanism)以避免陷入无进展循环,从而显著提高Chisel代码生成的成功率并达到与当前最先进的Verilog代码生成代理系统相当的性能。

链接: https://arxiv.org/abs/2505.19734
作者: Juxin Niu,Xiangfeng Liu,Dan Niu,Xi Wang,Zhe Jiang,Nan Guan
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted to DAC 2025

点击查看摘要

Abstract:Coding with hardware description languages (HDLs) such as Verilog is a time-intensive and laborious task. With the rapid advancement of large language models (LLMs), there is increasing interest in applying LLMs to assist with HDL coding. Recent efforts have demonstrated the potential of LLMs in translating natural language to traditional HDL Verilog. Chisel, a next-generation HDL based on Scala, introduces higher-level abstractions, facilitating more concise, maintainable, and scalable hardware designs. However, the potential of using LLMs for Chisel code generation remains largely unexplored. This work proposes ReChisel, an LLM-based agentic system designed to enhance the effectiveness of Chisel code generation. ReChisel incorporates a reflection mechanism to iteratively refine the quality of generated code using feedback from compilation and simulation processes, and introduces an escape mechanism to break free from non-progress loops. Experiments demonstrate that ReChisel significantly improves the success rate of Chisel code generation, achieving performance comparable to state-of-the-art LLM-based agentic systems for Verilog code generation.
zh

[AI-65] OCN: Effectively Utilizing Higher-Order Common Neighbors for Better Link Prediction

【速读】:该论文试图解决高阶公共邻居(High-order Common Neighbors)在链接预测任务中存在冗余和过平滑的问题,这些问题限制了现有方法对公共邻居(Common Neighbors, CNs)潜力的充分挖掘。解决方案的关键在于引入正交化(Orthogonalization)以消除不同阶数CNs之间的冗余,并采用归一化(Normalization)以缓解过平滑现象,从而提出了一种名为正交公共邻居(Orthogonal Common Neighbor, OCN)的新方法。

链接: https://arxiv.org/abs/2505.19719
作者: Juntong Wang,Xiyuan Wang,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 10 figures

点击查看摘要

Abstract:Common Neighbors (CNs) and their higher-order variants are important pairwise features widely used in state-of-the-art link prediction methods. However, existing methods often struggle with the repetition across different orders of CNs and fail to fully leverage their potential. We identify that these limitations stem from two key issues: redundancy and over-smoothing in high-order common neighbors. To address these challenges, we design orthogonalization to eliminate redundancy between different-order CNs and normalization to mitigate over-smoothing. By combining these two techniques, we propose Orthogonal Common Neighbor (OCN), a novel approach that significantly outperforms the strongest baselines by an average of 7.7% on popular link prediction benchmarks. A thorough theoretical analysis is provided to support our method. Ablation studies also verify the effectiveness of our orthogonalization and normalization techniques.
zh

[AI-66] Concise Reasoning Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting

【速读】:该论文试图解决现有链式思维(Chain-of-Thought, CoT)蒸馏方法在推理轨迹冗长和适应性不足方面的局限性。其关键解决方案是提出一种难度感知提示(Difficulty-aware Prompting, DAP)方法,通过让大教师模型判断问题难度并重写推理轨迹至适当长度,从而动态缩短推理过程而不损失性能,最终生成简洁且完整的推理示例。

链接: https://arxiv.org/abs/2505.19716
作者: Yifan Wu,Jingze Shi,Bingheng Wu,Jiayi Zhang,Xiaotian Lin,Nan Tang,Yuyu Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing chain-of-thought (CoT) distillation methods can effectively transfer reasoning abilities to base models but suffer from two major limitations: excessive verbosity of reasoning traces and inadequate adaptability to problem difficulty. Long reasoning traces significantly increase inference costs, and uniform-length solutions prevent base models from learning adaptive reasoning strategies. To address these issues, we propose a difficulty-aware prompting (DAP) method to dynamically shorten reasoning traces without performance loss. In our approach, a large teacher model first judges each problem’s difficulty and then rewrites its reasoning traces to an appropriate shorter length, yielding concise yet complete reasoning traces. Leveraging the DAP pipeline, we curate a distilled dataset called LiteCoT consisting of 100K concise reasoning examples, with solutions averaging only 720 tokens (an order of magnitude shorter than typical CoTs). Using LiteCoT, we distilled a new family of reasoning models called Liter (1.5B, 7B, and 32B) based on the Qwen2.5 architecture. Experiments show that a student model fine-tuned on just 100K of these difficulty-pruned CoT samples outperforms a model distilled on 800K original Long CoT samples, while significantly reducing training and inference costs. Our method also generalizes well: across 11 diverse benchmarks, the shorter difficulty-aware CoTs achieve equal or better accuracy than Long chains, using far fewer tokens. For example, on the challenging AIME24 exam, our approach reaches 74.2% Pass@1 using only about 5K inference tokens, surpassing other methods that consume many more tokens. Our code and data are available at this https URL.
zh

[AI-67] Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于模型和数据异质性导致的表示不一致与优化动态差异问题,这些问题会阻碍全局模型的鲁棒性能。解决方案的关键在于提出Mosaic框架,该框架通过训练本地生成模型来近似每个客户端的个性化分布,从而生成隐私保护的合成数据,并基于客户端模型的专业知识构建混合专家(Mixture-of-Experts, MoE)结构,将其通过生成数据蒸馏到全局模型中,同时引入轻量级元模型以增强MoE架构的集成效果。

链接: https://arxiv.org/abs/2505.19699
作者: Junming Liu,Yanting Gao,Siyuan Meng,Yifei Sun,Aoqi Wu,Yufei Jin,Yirong Chen,Ding Wang,Guosun Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 43 pages, 23 figures, 15 tables; the last dance

点击查看摘要

Abstract:Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client’s personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at this https URL.
zh

[AI-68] JEDI: Latent End-to-end Diffusion Mitigates Agent -Human Performance Asymmetry in Model-Based Reinforcement Learning

【速读】:该论文试图解决模型基于强化学习(MBRL)在Atari100k基准测试中表现出的性能不对称问题,即MBRL代理在某些任务上显著优于人类,而在其他任务上则明显落后,这种不对称性导致了整体评估指标的失真。解决方案的关键在于通过将所有任务划分为Agent-Optimal或Human-Optimal,并强调两类任务指标的同等重要性来重新评估性能。此外,作者提出了一种新的联合嵌入扩散模型(JEDI),该模型通过端到端训练与自洽性目标,解决了像素基方法中缺乏时间结构化的潜在空间的问题,从而在人类最优任务中取得了优于当前最先进模型的表现。

链接: https://arxiv.org/abs/2505.19698
作者: Jing Yu Lim,Zarif Ikram,Samson Yu,Haozhe Ma,Tze-Yun Leong,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint

点击查看摘要

Abstract:Recent advances in model-based reinforcement learning (MBRL) have achieved super-human level performance on the Atari100k benchmark, driven by reinforcement learning agents trained on powerful diffusion world models. However, we identify that the current aggregates mask a major performance asymmetry: MBRL agents dramatically outperform humans in some tasks despite drastically underperforming in others, with the former inflating the aggregate metrics. This is especially pronounced in pixel-based agents trained with diffusion world models. In this work, we address the pronounced asymmetry observed in pixel-based agents as an initial attempt to reverse the worrying upward trend observed in them. We address the problematic aggregates by delineating all tasks as Agent-Optimal or Human-Optimal and advocate for equal importance on metrics from both sets. Next, we hypothesize this pronounced asymmetry is due to the lack of temporally-structured latent space trained with the World Model objective in pixel-based methods. Lastly, to address this issue, we propose Joint Embedding DIffusion (JEDI), a novel latent diffusion world model trained end-to-end with the self-consistency objective. JEDI outperforms SOTA models in human-optimal tasks while staying competitive across the Atari100k benchmark, and runs 3 times faster with 43% lower memory than the latest pixel-based diffusion baseline. Overall, our work rethinks what it truly means to cross human-level performance in Atari100k.
zh

[AI-69] EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification INTERSPEECH2025

【速读】:该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)中情感预测的准确性与一致性问题。其解决方案的关键在于提出EmoSphere-SER模型,该模型通过将VAD(Valence, Arousal, Dominance)值转换为球面坐标并划分为多个球形区域,引入辅助分类任务以指导VAD回归,从而提升情感预测性能。此外,动态加权机制和具有多头自注意力的风格池化层被用于捕捉语音的谱性和时序动态特征,进一步增强了模型效果。

链接: https://arxiv.org/abs/2505.19693
作者: Deok-Hyeon Cho,Hyung-Seok Oh,Seung-Bin Kim,Seong-Whan Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Proceedings of Interspeech 2025

点击查看摘要

Abstract:Speech emotion recognition predicts a speaker’s emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.
zh

[AI-70] Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

【速读】:该论文试图解决大型推理模型(Large Reasoning Models, LRs)在安全关键场景中可靠性不足的问题,特别是针对一种称为“表面安全对齐”(Superficial Safety Alignment, SSA)的现象,即模型生成表面上安全的输出,但其内部推理过程未能真正识别和缓解潜在风险,导致多次采样下的安全行为不一致。解决方案的关键在于引入一个名为“超越安全答案”(Beyond Safe Answers, BSA)的新基准,该基准包含2000个具有挑战性的实例,涵盖三种SSA场景类型和九个风险类别,并附有详细的风险推理标注,以系统评估和提升模型在安全推理方面的准确性与一致性。

链接: https://arxiv.org/abs/2505.19690
作者: Baihui Zheng,Boren Zheng,Kerui Cao,Yingshui Tan,Zhendong Liu,Weixun Wang,Jiaheng Liu,Jian Yang,Wenbo Su,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the remarkable proficiency of \textitLarge Reasoning Models (LRMs) in handling complex reasoning tasks, their reliability in safety-critical scenarios remains uncertain. Existing evaluations primarily assess response-level safety, neglecting a critical issue we identify as \textbf\textitSuperficial Safety Alignment (SSA) – a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underlying risks, resulting in inconsistent safety behaviors across multiple sampling attempts. To systematically investigate SSA, we introduce \textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types and spanning nine risk categories, each meticulously annotated with risk rationales. Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
zh

[AI-71] DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech INTERSPEECH2025

【速读】:该论文旨在解决跨说话人情感迁移中语音合成存在的说话人特征残留与情感建模不准确问题,即现有音色压缩方法无法完全分离说话人和情感特征,导致情感信息丢失和合成质量下降。其解决方案的关键在于提出DiEmo-TTS,一种自监督蒸馏方法,通过聚类驱动采样和信息扰动策略,在保留情感信息的同时去除无关因素,并结合情感属性预测与说话人嵌入的聚类匹配方法,实现对未标记数据的泛化能力,同时设计双条件Transformer以更有效地集成风格特征。

链接: https://arxiv.org/abs/2505.19687
作者: Deok-Hyeon Cho,Hyung-Seok Oh,Seung-Bin Kim,Seong-Whan Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Proceedings of Interspeech 2025

点击查看摘要

Abstract:Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.
zh

[AI-72] Large Language Models Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在使用自动化定理证明(Automated Theorem Prover, ATP)推理策略方面的能力评估问题,重点分析了LLMs在PRONTOQA steamroller推理问题上的表现。其解决方案的关键在于开发了评估LLM响应准确性和正确答案相关性的方法,并通过跟踪完成标记(completion tokens)揭示了LLMs推理能力提升的主要来源,即隐藏系统提示或模型对通用思维链(Chain of Thought)提示策略的自动适应。此外,研究还发现当前前沿LLMs在遵循自底向上的推理策略(bottom-up, 也称为前向链式推理)方面表现最佳。

链接: https://arxiv.org/abs/2505.19676
作者: Lachlan McGinness,Peter Baumgartner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining) strategy. A low positive correlation was found between an LLM response containing correct reasoning and arriving at the correct conclusion. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19676 [cs.AI] (or arXiv:2505.19676v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.19676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-73] Automated evaluation of childrens speech fluency for low-resource languages

【速读】:该论文试图解决低资源语言中儿童口语流利度评估的挑战问题,这一领域在多数语言中已有较深入研究,但在低资源语言中仍存在较大困难。解决方案的关键在于结合微调的多语言自动语音识别(ASR)模型、客观指标提取阶段以及生成式预训练变换器(GPT)网络,通过提取语音中的音素错误率、词错误率、语速和语音-停顿持续时间比等客观指标,并利用少量人工标注的真实案例引导GPT分类器进行流利度评分。

链接: https://arxiv.org/abs/2505.19671
作者: Bowen Zhang,Nur Afiqah Abdul Latiff,Justin Kan,Rong Tong,Donny Soh,Xiaoxiao Miao,Ian McLoughlin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, conference

点击查看摘要

Abstract:Assessment of children’s speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children’s speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.
zh

[AI-74] A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs? INTERSPEECH2025

【速读】:该论文旨在解决深度学习驱动的音频水印算法评估缺乏标准化基准和系统性比较方法的问题。其解决方案的关键在于构建一个综合的音频攻击管道,模拟真实应用场景中的多种失真(如压缩、背景噪声和混响),并提出一个包含语音、环境声音和音乐录音的多样化测试数据集,从而为音频水印算法提供全面且一致的评估框架。

链接: https://arxiv.org/abs/2505.19663
作者: Yigitcan Özer,Woosung Choi,Joan Serrà,Mayank Kumar Singh,Wei-Hsiang Liao,Yuki Mitsufuji
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages; 5 tables; accepted at INTERSPEECH 2025

点击查看摘要

Abstract:We present a framework to foster the evaluation of deep learning-based audio watermarking algorithms, establishing a standardized benchmark and allowing systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline, featuring various distortions such as compression, background noise, and reverberation, and propose a diverse test dataset, including speech, environmental sounds, and music recordings. By assessing the performance of four existing watermarking algorithms on our framework, two main insights stand out: (i) neural compression techniques pose the most significant challenge, even when algorithms are trained with such compressions; and (ii) training with audio attacks generally improves robustness, although it is insufficient in some cases. Furthermore, we find that specific distortions, such as polarity inversion, time stretching, or reverb, seriously affect certain algorithms. Our contributions strengthen the robustness and perceptual assessment of audio watermarking algorithms across a wide range of applications, while ensuring a fair and consistent evaluation approach. The evaluation framework, including the attack pipeline, is accessible at this http URL.
zh

[AI-75] Large Language Models in Code Co-generation for Safe Autonomous Vehicles

【速读】:该论文试图解决在汽车领域中,将大型语言模型(Large Language Models, LLMs)用于高级驾驶辅助系统(ADAS)或自动驾驶(AD)系统开发时所面临的潜在安全风险问题。解决方案的关键在于提出一种评估流水线,以对LLMs生成的代码进行合理性检查,从而降低代码审查人员的工作负担。该流水线通过比较六种最先进的LLMs在四个与安全相关的编程任务中的表现,并定性分析其最常见的故障模式,构建了一个故障模式目录,以支持人工审查,同时探讨了LLMs在代码生成中的能力与局限性及其在现有流程中的应用。

链接: https://arxiv.org/abs/2505.19658
作者: Ali Nouri,Beatriz Cabrero-Daniel,Zhennan Fei,Krishna Ronanki,Håkan Sivencrona,Christian Berger
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted in the 44th International Conference on Computer Safety, Reliability and Security (SafeComp 2025)

点击查看摘要

Abstract:Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems’ development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.
zh

[AI-76] oken-Importance Guided Direct Preference Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成输出与人类偏好对齐的问题,特别是在直接偏好优化(Direct Preference Optimization, DPO)方法中,由于忽略单个标记(token)的差异性重要性以及对偏好数据集中的判断噪声敏感,导致模型性能受限。该论文提出的解决方案关键在于引入两种创新机制:基于梯度的标记重要性权重,用于动态优先处理关键标记;以及三重损失函数,旨在明确引导模型输出趋近于人类偏好的响应并远离非偏好响应。

链接: https://arxiv.org/abs/2505.19653
作者: Yang Ning,Lin Hai,Liu Yibo,Tian Baoliang,Liu Guoqing,Zhang Haijun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring that large language models (LLMs) generate outputs aligned with human preferences is important for safe and effective AI interactions. While Direct Preference Optimization (DPO) employs an implicit reward function to optimize the policy model, however, it and its related variants overlook the differential importance of individual tokens and are sensitive to judgment noise in preference datasets during generation. Although recent methods attempt to assess the important weight of tokens via probability prediction or simplistic weighting schemes, these evaluation methods are prone to biases and still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), which introduces two key innovations: the gradient-based token-importance weights that dynamically prioritize critical tokens, and a triple loss that explicitly guides model outputs to approach human-preferred responses and stay away from non-preferred responses. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.
zh

[AI-77] Model Enumeration of Two-Variable Logic with Quadratic Delay Complexity

【速读】:该论文试图解决一阶逻辑中无函数、有限域片段(FO^2)的模型枚举问题,即给定一个FO^2句子Γ和正整数n,如何在大小为n的域上枚举所有满足Γ的模型。论文提出了一种新颖算法,其关键在于将模型生成的延迟复杂度控制在域大小n的二次方(忽略对数因子),这一性能接近最优,因为任何模型中二元谓词的解释至少需要Ω(n²)位来表示。

链接: https://arxiv.org/abs/2505.19648
作者: Qiaolan Meng,Juhua Pu,Hongting Niu,Yuyi Wang,Yuanhong Wang,Ondřej Kuželka
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures and to be published in Fortieth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS)

点击查看摘要

Abstract:We study the model enumeration problem of the function-free, finite domain fragment of first-order logic with two variables ( FO^2 ). Specifically, given an FO^2 sentence \Gamma and a positive integer n , how can one enumerate all the models of \Gamma over a domain of size n ? In this paper, we devise a novel algorithm to address this problem. The delay complexity, the time required between producing two consecutive models, of our algorithm is quadratic in the given domain size n (up to logarithmic factors) when the sentence is fixed. This complexity is almost optimal since the interpretation of binary predicates in any model requires at least \Omega(n^2) bits to represent.
zh

[AI-78] MoESD: Unveil Speculative Decodings Potential for Accelerating Sparse MoE

【速读】:该论文旨在解决如何有效提升Mixture of Experts (MoE)模型在推理过程中的加速问题,尤其是在中等批量大小下,传统推测解码(Speculative Decoding, SD)技术被认为仅适用于密集模型,而未被充分探索其在MoE中的潜力。论文的关键解决方案是通过理论分析构建了一个可靠的模型来量化SD中的权衡,并引入“目标效率”这一新指标,以更全面地评估SD加速效果,从而帮助识别系统瓶颈并优化MoE推理性能。实验结果表明,在中等批量大小下,该方法可实现高达2.29倍的加速效果。

链接: https://arxiv.org/abs/2505.19645
作者: Zongle Huang,Lei Zhu,Zongyuan Zhan,Ting Hu,Weikai Mao,Xianzhi Yu,Yongpan Liu,Tianyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser – the prevailing trend in MoE designs – the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric ‘target efficiency’ that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.
zh

[AI-79] STOPA: A Database of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution INTERSPEECH2025

【速读】:该论文试图解决深度伪造语音检测中的源追踪问题(source tracing),即确定合成语音的来源。现有研究受限于缺乏专门且系统化整理的数据集,导致进展缓慢。论文提出的解决方案是引入STOPA,一个系统化变化且元数据丰富的数据集,涵盖8种声学模型(AM)、6种声码器模型(VM)以及700k样本中来自13个不同合成器的多样化参数设置。STOPA的关键在于其系统化的控制框架,能够覆盖更广泛的生成因素,如声学模型、声码器模型或预训练权重的选择,从而提高溯源的可靠性与准确性。

链接: https://arxiv.org/abs/2505.19644
作者: Anton Firc,Manasi Chibber,Jagabandhu Mishra,Vishwanath Pratap Singh,Tomi Kinnunen,Kamil Malinka
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025 conference

点击查看摘要

Abstract:A key research area in deepfake speech detection is source tracing - determining the origin of synthesised utterances. The approaches may involve identifying the acoustic model (AM), vocoder model (VM), or other generation-specific parameters. However, progress is limited by the lack of a dedicated, systematically curated dataset. To address this, we introduce STOPA, a systematically varied and metadata-rich dataset for deepfake speech source tracing, covering 8 AMs, 6 VMs, and diverse parameter settings across 700k samples from 13 distinct synthesisers. Unlike existing datasets, which often feature limited variation or sparse metadata, STOPA provides a systematically controlled framework covering a broader range of generative factors, such as the choice of the vocoder model, acoustic model, or pretrained weights, ensuring higher attribution reliability. This control improves attribution accuracy, aiding forensic analysis, deepfake detection, and generative model transparency.
zh

[AI-80] Search-Based Software Engineering in the Landscape of AI Foundation Models

【速读】:该论文试图解决Search-based software engineering (SBSE)在基础模型(Foundation Models, FMs)时代的发展路径与挑战问题,其核心在于探索SBSE与FMs之间的协同潜力,并提出一个研究路线图以指导未来的研究方向。解决方案的关键在于分析当前SBSE与FMs的关联现状,识别开放性挑战,并明确通过SBSE与FMs相互作用推动软件工程发展的潜在研究领域。

链接: https://arxiv.org/abs/2505.19625
作者: Hassan Sartaj,Shaukat Ali
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Search-based software engineering (SBSE), at the intersection of artificial intelligence (AI) and software engineering, has been an active area of research for about 25 years. It has been applied to solve numerous problems across the entire software engineering lifecycle and has demonstrated its versatility in multiple domains. With the recent advancements in AI, particularly the emergence of foundation models (FMs), the evolution of SBSE alongside FMs remains undetermined. In this window of opportunity, we propose a research roadmap that articulates the current landscape of SBSE in relation to foundation models (FMs), highlights open challenges, and outlines potential research directions for advancing SBSE through its interplay with FMs. This roadmap aims to establish a forward-thinking and innovative perspective for the future of SBSE in the era of FMs.
zh

[AI-81] Agent RecBench: Benchmarking LLM Agent -based Personalized Recommender Systems

【速读】:该论文旨在解决当前基于大型语言模型(Large Language Models, LLMs)的自主推荐系统缺乏标准化评估协议的问题。其关键解决方案包括:构建一个包含丰富用户和物品元数据的交互式文本推荐模拟器,设计一个统一的模块化框架以支持自主推荐系统的开发与研究,并建立首个全面的基准测试,对比10种经典与自主推荐方法。通过这些措施,论文不仅验证了自主推荐系统的优越性,还为其实现提供了可操作的设计指南。

链接: https://arxiv.org/abs/2505.19623
作者: Yu Shang,Peijie Liu,Yuwei Yan,Zijing Wu,Leheng Sheng,Yuanqing Yu,Chumeng Jiang,An Zhang,Fengli Xu,Yu Wang,Min Zhang,Yong Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs’ advanced reasoning and role-playing capabilities to enable autonomous, adaptive decision-making. Unlike traditional recommendation approaches, agentic recommender systems can dynamically gather and interpret user-item interactions from complex environments, generating robust recommendation strategies that generalize across diverse scenarios. However, the field currently lacks standardized evaluation protocols to systematically assess these methods. To address this critical gap, we propose: (1) an interactive textual recommendation simulator incorporating rich user and item metadata and three typical evaluation scenarios (classic, evolving-interest, and cold-start recommendation tasks); (2) a unified modular framework for developing and studying agentic recommender systems; and (3) the first comprehensive benchmark comparing 10 classical and agentic recommendation methods. Our findings demonstrate the superiority of agentic systems and establish actionable design guidelines for their core components. The benchmark environment has been rigorously validated through an open challenge and remains publicly available with a continuously maintained leaderboard~\footnote[2]this https URL, fostering ongoing community engagement and reproducible research. The benchmark is available at: \hyperlinkthis https URLthis https URL.
zh

[AI-82] Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs

【速读】:该论文旨在解决时空预测任务中模型表达能力与计算效率难以平衡的问题,尤其是在处理大规模真实数据集时。其解决方案的关键在于提出STH-SepNet框架,通过将时间与空间建模解耦,利用轻量级大语言模型捕捉低秩时间动态,并通过自适应超图神经网络动态构建超边以建模复杂的高阶空间交互,同时结合精心设计的门控机制融合时空表征,从而在提升预测精度的同时保持计算效率。

链接: https://arxiv.org/abs/2505.19620
作者: Jiawen Chen,Qi Shao,Duxin Chen,Wenwu Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatio-temporal prediction is a pivotal task with broad applications in traffic management, climate monitoring, energy scheduling, etc. However, existing methodologies often struggle to balance model expressiveness and computational efficiency, especially when scaling to large real-world datasets. To tackle these challenges, we propose STH-SepNet (Spatio-Temporal Hypergraph Separation Networks), a novel framework that decouples temporal and spatial modeling to enhance both efficiency and precision. Therein, the temporal dimension is modeled using lightweight large language models, which effectively capture low-rank temporal dynamics. Concurrently, the spatial dimension is addressed through an adaptive hypergraph neural network, which dynamically constructs hyperedges to model intricate, higher-order interactions. A carefully designed gating mechanism is integrated to seamlessly fuse temporal and spatial representations. By leveraging the fundamental principles of low-rank temporal dynamics and spatial interactions, STH-SepNet offers a pragmatic and scalable solution for spatio-temporal prediction in real-world applications. Extensive experiments on large-scale real-world datasets across multiple benchmarks demonstrate the effectiveness of STH-SepNet in boosting predictive performance while maintaining computational efficiency. This work may provide a promising lightweight framework for spatio-temporal prediction, aiming to reduce computational demands and while enhancing predictive performance. Our code is avaliable at this https URL.
zh

[AI-83] Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

【速读】:该论文旨在解决长上下文监督微调(Long-SFT)中因数据集包含长短序列导致的训练效率低下问题,这一异构序列长度分布使得现有训练系统难以同时高效处理长序列和短序列,从而影响整体系统性能。解决方案的关键在于提出Skrull,一种专为高效Long-SFT设计的动态数据调度器,通过动态平衡长短序列的计算需求,提升整体训练效率,并将其调度过程建模为联合优化问题,采用轻量级算法实现近零成本的在线调度。

链接: https://arxiv.org/abs/2505.19609
作者: Hongtao Xu,Wenting Shen,Yuanxin Wei,Ang Wang,Guo Runfan,Tianxing Wang,Yong Li,Mingzhen Li,Weile Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT. Through dynamic data scheduling, Skrull balances the computation requirements of long and short sequences, improving overall training efficiency. Furthermore, we formulate the scheduling process as a joint optimization problem and thoroughly analyze the trade-offs involved. Based on those analysis, Skrull employs a lightweight scheduling algorithm to achieve near-zero cost online scheduling in Long-SFT. Finally, we implement Skrull upon DeepSpeed, a state-of-the-art distributed training system for LLMs. Experimental results demonstrate that Skrull outperforms DeepSpeed by 3.76x on average (up to 7.54x) in real-world long-SFT scenarios.
zh

[AI-84] Energy-based Preference Optimization for Test-time Adaptation

【速读】:该论文旨在解决测试时适应(Test-Time Adaptation, TTA)中模型对目标分布变化的鲁棒性不足问题,特别是在缺乏标签信息的情况下,现有方法依赖不确定预测导致性能不可靠。其解决方案的关键在于提出一种基于能量的偏好优化方法(Energy-based Preference Optimization for Test-time Adaptation, EPOTTA),该方法通过参数化预训练模型和残差能量函数,在无需采样的情况下实现目标数据的边缘似然最大化,并利用其与直接偏好优化(DPO)目标的数学等价性,直接适应目标分布而无需显式训练残差部分,从而在保证性能的同时提升计算效率。

链接: https://arxiv.org/abs/2505.19607
作者: Yewon Han,Seoyun Yang,Taesup Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-Time Adaptation (TTA) enhances model robustness by enabling adaptation to target distributions that differ from training distributions, improving real-world generalizability. Existing TTA approaches focus on adjusting the conditional distribution; however these methods often depend on uncertain predictions in the absence of label information, leading to unreliable performance. Energy-based frameworks suggest a promising alternative to address distribution shifts without relying on uncertain predictions, instead computing the marginal distribution of target data. However, they involve the critical challenge of requiring extensive SGLD sampling, which is impractical for test-time scenarios requiring immediate adaptation. In this work, we propose Energy-based Preference Optimization for Test-time Adaptation (EPOTTA), which is based on a sampling free strategy. We first parameterize the target model using a pretrained model and residual energy function, enabling marginal likelihood maximization of target data without sampling. Building on the observation that the parameterization is mathematically equivalent to DPO objective, we then directly adapt the model to a target distribution without explicitly training the residual. Our experiments verify that EPOTTA is well-calibrated and performant while achieving computational efficiency.
zh

[AI-85] LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval

【速读】:该论文试图解决密集检索器在处理包含逻辑连接词的查询时表现不佳的问题(logical connectives),这类查询在下游应用中虽然常被忽视但具有重要性。现有密集检索器无法有效遵循查询中隐含的逻辑约束,导致检索结果不满足逻辑关系。解决方案的关键在于提出LogiCoL,一种基于逻辑信息的对比学习目标,通过在学习目标中使用t-范数表达的两组软约束,使模型能够学习到查询结果之间的子集和互斥集合关系,从而提升检索性能和结果的逻辑一致性。

链接: https://arxiv.org/abs/2505.19588
作者: Yanzhen Shen,Sihao Chen,Xueqiang Xu,Yunyi Zhang,Chaitanya Malaviya,Dan Roth
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.
zh

[AI-86] Situationally-Aware Dynamics Learning

【速读】:该论文旨在解决自主机器人在复杂、非结构化环境中由于潜在未观测因素导致的内部状态和外部世界理解模糊的问题,这些问题会阻碍机器人做出最优或正确的行为。其解决方案的关键在于提出一种在线学习隐藏状态表示的新框架,该框架通过广义隐藏参数马尔可夫决策过程(Generalized Hidden Parameter Markov Decision Process)显式建模未观测参数对转移动态和奖励结构的影响,并在线学习状态转移的联合分布,以表达潜在的自我和环境因素,从而提升机器人的适应性和安全性。

链接: https://arxiv.org/abs/2505.19574
作者: Alejandro Murillo-Gonzalez,Lantao Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Autonomous robots operating in complex, unstructured environments face significant challenges due to latent, unobserved factors that obscure their understanding of both their internal state and the external world. Addressing this challenge would enable robots to develop a more profound grasp of their operational context. To tackle this, we propose a novel framework for online learning of hidden state representations, with which the robots can adapt in real-time to uncertain and dynamic conditions that would otherwise be ambiguous and result in suboptimal or erroneous behaviors. Our approach is formalized as a Generalized Hidden Parameter Markov Decision Process, which explicitly models the influence of unobserved parameters on both transition dynamics and reward structures. Our core innovation lies in learning online the joint distribution of state transitions, which serves as an expressive representation of latent ego- and environmental-factors. This probabilistic approach supports the identification and adaptation to different operational situations, improving robustness and safety. Through a multivariate extension of Bayesian Online Changepoint Detection, our method segments changes in the underlying data generating process governing the robot’s dynamics. The robot’s transition model is then informed with a symbolic representation of the current situation derived from the joint distribution of latest state transitions, enabling adaptive and context-aware decision-making. To showcase the real-world effectiveness, we validate our approach in the challenging task of unstructured terrain navigation, where unmodeled and unmeasured terrain characteristics can significantly impact the robot’s motion. Extensive experiments in both simulation and real world reveal significant improvements in data efficiency, policy performance, and the emergence of safer, adaptive navigation strategies.
zh

[AI-87] MSD-LLM : Predicting Ship Detention in Port State Control Inspections with Large Language Model

【速读】:该论文旨在解决船舶滞留预测中的准确性不足问题,传统机器学习方法因表示学习能力有限而表现不佳,而基于自编码器的深度学习方法则受限于历史PSC滞留记录的数据严重不平衡。其解决方案的关键在于提出 Maritime Ship Detention with Large Language Models (MSD-LLM),该方法结合了基于双鲁棒子空间恢复(DSR)层的自编码器与渐进式学习流程,以处理数据不平衡并提取有意义的PSC表示,随后利用大语言模型对特征进行分组和排序,识别可能的滞留案例,并实现动态阈值调整以提升预测灵活性。

链接: https://arxiv.org/abs/2505.19568
作者: Jiongchao Jin,Xiuju Fu,Xiaowei Gao,Tao Cheng,Ran Yan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Maritime transportation is the backbone of global trade, making ship inspection essential for ensuring maritime safety and environmental protection. Port State Control (PSC), conducted by national ports, enforces compliance with safety regulations, with ship detention being the most severe consequence, impacting both ship schedules and company reputations. Traditional machine learning methods for ship detention prediction are limited by the capacity of representation learning and thus suffer from low accuracy. Meanwhile, autoencoder-based deep learning approaches face challenges due to the severe data imbalance in learning historical PSC detention records. To address these limitations, we propose Maritime Ship Detention with Large Language Models (MSD-LLM), integrating a dual robust subspace recovery (DSR) layer-based autoencoder with a progressive learning pipeline to handle imbalanced data and extract meaningful PSC representations. Then, a large language model groups and ranks features to identify likely detention cases, enabling dynamic thresholding for flexible detention predictions. Extensive evaluations on 31,707 PSC inspection records from the Asia-Pacific region show that MSD-LLM outperforms state-of-the-art methods more than 12% on Area Under the Curve (AUC) for Singapore ports. Additionally, it demonstrates robustness to real-world challenges, making it adaptable to diverse maritime risk assessment scenarios.
zh

[AI-88] LLM -Agent -Controller: A Universal Multi-Agent Large Language Model System as a Control Engineer

【速读】:该论文旨在解决控制工程(Control Theory)中广泛存在的复杂问题,通过构建一个基于多智能体大型语言模型(LLM)的系统——LLM-Agent-Controller,实现对控制器设计、模型表示、控制分析、时域响应及仿真等任务的自动化处理。该系统的关键在于集成一个中央控制器代理与多个专业辅助代理,并由监督代理负责高层次决策与流程协调,同时结合检索增强生成(RAG)、思维链推理、自我批评与修正、高效记忆管理及自然语言交互等先进技术,从而提升系统的可靠性、效率与易用性。

链接: https://arxiv.org/abs/2505.19567
作者: Rasoul Zahedifar,Sayyed Ali Mirghasemi,Mahdieh Soleymani Baghshah,Alireza Taheri
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This study presents the LLM-Agent-Controller, a multi-agent large language model (LLM) system developed to address a wide range of problems in control engineering (Control Theory). The system integrates a central controller agent with multiple specialized auxiliary agents, responsible for tasks such as controller design, model representation, control analysis, time-domain response, and simulation. A supervisor oversees high-level decision-making and workflow coordination, enhancing the system’s reliability and efficiency. The LLM-Agent-Controller incorporates advanced capabilities, including Retrieval-Augmented Generation (RAG), Chain-of-Thought reasoning, self-criticism and correction, efficient memory handling, and user-friendly natural language communication. It is designed to function without requiring users to have prior knowledge of Control Theory, enabling them to input problems in plain language and receive complete, real-time solutions. To evaluate the system, we propose new performance metrics assessing both individual agents and the system as a whole. We test five categories of Control Theory problems and benchmark performance across three advanced LLMs. Additionally, we conduct a comprehensive qualitative conversational analysis covering all key services. Results show that the LLM-Agent-Controller successfully solved 83% of general tasks, with individual agents achieving an average success rate of 87%. Performance improved with more advanced LLMs. This research demonstrates the potential of multi-agent LLM architectures to solve complex, domain-specific problems. By integrating specialized agents, supervisory control, and advanced reasoning, the LLM-Agent-Controller offers a scalable, robust, and accessible solution framework that can be extended to various technical domains.
zh

[AI-89] AMQA: An Adversarial Dataset for Benchmarking Bias of LLM s in Medicine and Healthcare

【速读】:该论文试图解决医疗问答(Medical QA)中大型语言模型(LLMs)存在的偏见问题,尤其是与种族、性别和社会经济地位相关的偏见,这些问题可能对生命造成关键性风险。解决方案的关键是提出AMQA——一个对抗性医疗问答数据集,用于自动化和大规模评估LLMs在医疗QA中的偏见。AMQA包含从美国医学执照考试(USMLE)数据集中提取的4,806对医疗问答对,通过多智能体框架生成多样化的对抗性描述和问题对,从而有效揭示模型在不同群体间的性能差异。

链接: https://arxiv.org/abs/2505.19562
作者: Ying Xiao,Jie Huang,Ruijuan He,Jing Xiao,Mohammad Reza Mousavi,Yepang Liu,Kezhi Li,Zhenpeng Chen,Jie M. Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their mistakes and the biases behind them pose life-critical risks. Bias linked to race, sex, and socioeconomic status is already well known, but a consistent and automatic testbed for measuring it is missing. To fill this gap, this paper presents AMQA – an Adversarial Medical Question-Answering dataset – built for automated, large-scale bias evaluation of LLMs in medical QA. AMQA includes 4,806 medical QA pairs sourced from the United States Medical Licensing Examination (USMLE) dataset, generated using a multi-agent framework to create diverse adversarial descriptions and question pairs. Using AMQA, we benchmark five representative LLMs and find surprisingly substantial disparities: even GPT-4.1, the least biased model tested, answers privileged-group questions over 10 percentage points more accurately than unprivileged ones. Compared with the existing benchmark CPV, AMQA reveals 15% larger accuracy gaps on average between privileged and unprivileged groups. Our dataset and code are publicly available at this https URL to support reproducible research and advance trustworthy, bias-aware medical AI.
zh

[AI-90] uring Test 2.0: The General Intelligence Threshold

【速读】:该论文试图解决如何检测人工智能(Artificial Intelligence, AI)模型是否达到或超越人工通用智能(Artificial General Intelligence, AGI)的问题,尤其是在现有方法如图灵测试(Turing Test)无法提供明确判断的情况下。论文的关键解决方案在于提出了一种新的、实用的检测方法——Turing Tests 2.0,其核心包括两个贡献:首先,定义了通用智能(General Intelligence, GI)并设定了GI阈值(GI Threshold, GIT),用于区分是否达到AGI;其次,构建了一个能够以简洁、全面且明确方式判断系统是否具备GI的测试框架。

链接: https://arxiv.org/abs/2505.19550
作者: Georgios Mappouras
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rise of artificial intelligence (A.I.) and large language models like Chat-GPT, a new race for achieving artificial general intelligence (A.G.I) has started. While many speculate how and when A.I. will achieve A.G.I., there is no clear agreement on how A.G.I. can be detected in A.I. models, even when popular tools like the Turing test (and its modern variations) are used to measure their intelligence. In this work, we discuss why traditional methods like the Turing test do not suffice for measuring or detecting A.G.I. and provide a new, practical method that can be used to decide if a (computer or any other) system has reached or surpassed A.G.I. To achieve this, we make two new contributions. First, we present a clear definition for general intelligence (G.I.) and set a G.I. threshold (G.I.T.) that can be used to distinguish between systems that achieve A.G.I. and systems that do not. Second, we present a new framework on how to construct tests that can detect if a system has achieved G.I. in a simple, comprehensive, and clear-cut fail/pass way. We call this novel framework the Turing Tests 2.0. We then demonstrate real-life examples of applying tests that follow our Turing Tests 2.0 framework on modern A.I. models.
zh

[AI-91] STRAP: Spatio-Temporal Pattern Retrieval for Out-of-Distribution Generalization

【速读】:该论文试图解决Spatio-Temporal Graph Neural Networks (STGNNs)在Spatio-Temporal Out-of-Distribution (STOOD)场景下泛化能力不足的问题,即当时间动态和空间结构超出训练分布时,模型性能显著下降。解决方案的关键在于提出一种名为STRAP的框架,其核心是一个包含丰富历史、结构和语义信息的紧凑且表达能力强的模式库,通过检索增强学习机制,在推理阶段将相关模式注入模型,从而提升时空表征并缓解灾难性遗忘,同时引入知识平衡目标以协调新信息与检索到的知识。

链接: https://arxiv.org/abs/2505.19547
作者: Haoyu Zhang,Wentao Zhang,Hao Miao,Xinke Jiang,Yuchen Fang,Yifan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatio-Temporal Graph Neural Networks (STGNNs) have emerged as a powerful tool for modeling dynamic graph-structured data across diverse domains. However, they often fail to generalize in Spatio-Temporal Out-of-Distribution (STOOD) scenarios, where both temporal dynamics and spatial structures evolve beyond the training distribution. To address this problem, we propose an innovative Spatio-Temporal Retrieval-Augmented Pattern Learning framework,STRAP, which enhances model generalization by integrating retrieval-augmented learning into the STGNN continue learning pipeline. The core of STRAP is a compact and expressive pattern library that stores representative spatio-temporal patterns enriched with historical, structural, and semantic information, which is obtained and optimized during the training phase. During inference, STRAP retrieves relevant patterns from this library based on similarity to the current input and injects them into the model via a plug-and-play prompting mechanism. This not only strengthens spatio-temporal representations but also mitigates catastrophic forgetting. Moreover, STRAP introduces a knowledge-balancing objective to harmonize new information with retrieved knowledge. Extensive experiments across multiple real-world streaming graph datasets show that STRAP consistently outperforms state-of-the-art STGNN baselines on STOOD tasks, demonstrating its robustness, adaptability, and strong generalization capability without task-specific fine-tuning.
zh

[AI-92] raining-Free Multi-Step Audio Source Separation

【速读】:该论文试图解决音频源分离(Audio Source Separation)中传统单步推理方法未能充分利用模型分离能力的问题。其解决方案的关键在于提出一种无需额外训练的多步分离推理方法,通过在每一步迭代中最优地融合输入混合信号与前一步的分离结果,从而逐步提升分离性能。该方法通过最大化某种度量来确定最佳融合比例,并理论证明了其优于单步推理的性能,同时将其与噪声到干净分布的线性插值路径上的去噪过程联系起来,揭示了其与去噪扩散桥模型的潜在关联。

链接: https://arxiv.org/abs/2505.19534
作者: Yongyi Zang,Jingyi Li,Qiuqiang Kong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio source separation aims to separate a mixture into target sources. Previous audio source separation systems usually conduct one-step inference, which does not fully explore the separation ability of models. In this work, we reveal that pretrained one-step audio source separation models can be leveraged for multi-step separation without additional training. We propose a simple yet effective inference method that iteratively applies separation by optimally blending the input mixture with the previous step’s separation result. At each step, we determine the optimal blending ratio by maximizing a metric. We prove that our method always yield improvement over one-step inference, provide error bounds based on model smoothness and metric robustness, and provide theoretical analysis connecting our method to denoising along linear interpolation paths between noise and clean distributions, a property we link to denoising diffusion bridge models. Our approach effectively delivers improved separation performance as a “free lunch” from existing models. Our empirical results demonstrate that our multi-step separation approach consistently outperforms one-step inference across both speech enhancement and music source separation tasks, and can achieve scaling performance similar to training a larger model, using more data, or in some cases employing a multi-step training objective. These improvements appear not only on the optimization metric during multi-step inference, but also extend to nearly all non-optimized metrics (with one exception). We also discuss limitations of our approach and directions for future research.
zh

[AI-93] Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

【速读】:该论文试图解决在单头softmax注意力机制下学习k位布尔函数(如AND、OR及其噪声变体)的计算极限问题。研究发现,仅依靠单头softmax注意力机制无法解决这些简单的布尔函数,但通过引入教师强制(teacher forcing),相同的极简注意力机制却能够成功解决这些问题。解决方案的关键在于利用监督信号进行单次梯度下降更新,从而替代了传统多步骤的Chain-of-Thought(CoT)推理方法,揭示了理想监督下最小化架构与标准训练下理论不可行性之间的根本差距。

链接: https://arxiv.org/abs/2505.19531
作者: Jerry Yao-Chieh Hu,Xiwen Zhang,Maojiang Su,Zhao Song,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study the computational limits of learning k -bit Boolean functions (specifically, \mathrmAND , \mathrmOR , and their noisy variants), using a minimalist single-head softmax-attention mechanism, where k=\Theta(d) relevant bits are selected from d inputs. We show that these simple \mathrmAND and \mathrmOR functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.
zh

[AI-94] Navigating loss manifolds via rigid body dynamics: A promising avenue for robustness and generalisation

【速读】:该论文试图解决在基于梯度的优化过程中,大型神经网络训练时面临的高维损失函数景观中病态几何导致的不良训练动态问题,特别是由于收敛到对输入扰动高度敏感的尖锐极小值而引起的泛化能力差的问题。论文提出的解决方案的关键在于设计一种替代优化器,该优化器通过模拟一个球体在损失景观上的运动来减少对损失景观精细结构的依赖,并避免尖锐极小值,从而提升模型的泛化能力。该优化器与标准梯度下降的偏离程度由一个超参数控制,该参数代表球体的半径,通过调整该超参数可以以不同尺度探测损失景观,进而理解其几何特性。

链接: https://arxiv.org/abs/2505.19527
作者: Mohammed D. Belgoumri,Mohamed Reda Bouadjenek,Hakim Hacid,Imran Razzak,Sunil Aryal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Training large neural networks through gradient-based optimization requires navigating high-dimensional loss landscapes, which often exhibit pathological geometry, leading to undesirable training dynamics. In particular, poor generalization frequently results from convergence to sharp minima that are highly sensitive to input perturbations, causing the model to overfit the training data while failing to generalize to unseen examples. Furthermore, these optimization procedures typically display strong dependence on the fine structure of the loss landscape, leading to unstable training dynamics, due to the fractal-like nature of the loss surface. In this work, we propose an alternative optimizer that simultaneously reduces this dependence, and avoids sharp minima, thereby improving generalization. This is achieved by simulating the motion of the center of a ball rolling on the loss landscape. The degree to which our optimizer departs from the standard gradient descent is controlled by a hyperparameter, representing the radius of the ball. Changing this hyperparameter allows for probing the loss landscape at different scales, making it a valuable tool for understanding its geometry.
zh

[AI-95] Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate

【速读】:该论文旨在解决多模态学习中缺失模态(missing modality)问题,这一问题在现实场景中由于系统性采集错误或传感器故障常导致数据不完整,从而影响模型性能和泛化能力。其解决方案的关键在于提出Conf-SMoE架构,该架构引入了两阶段插补模块以处理缺失模态,并通过理论分析揭示了专家坍缩(expert collapse)的机制。进一步地,Conf-SMoE设计了一种新的专家门控机制,通过将softmax路由分数解耦为与真实标签相关的任务置信度分数,从而自然缓解专家坍缩问题,而无需引入额外的负载平衡损失函数。

链接: https://arxiv.org/abs/2505.19525
作者: Liangwei Nathan Zheng,Wei Emma Zhang,Mingyu Guo,Miao Xu,Olaf Maennel,Weitong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effectively managing missing modalities is a fundamental challenge in real-world multimodal learning scenarios, where data incompleteness often results from systematic collection errors or sensor failures. Sparse Mixture-of-Experts (SMoE) architectures have the potential to naturally handle multimodal data, with individual experts specializing in different modalities. However, existing SMoE approach often lacks proper ability to handle missing modality, leading to performance degradation and poor generalization in real-world applications. We propose Conf-SMoE to introduce a two-stage imputation module to handle the missing modality problem for the SMoE architecture and reveal the insight of expert collapse from theoretical analysis with strong empirical evidence. Inspired by our theoretical analysis, Conf-SMoE propose a novel expert gating mechanism by detaching the softmax routing score to task confidence score w.r.t ground truth. This naturally relieves expert collapse without introducing additional load balance loss function. We show that the insights of expert collapse aligns with other gating mechanism such as Gaussian and Laplacian gate. We also evaluate the proposed method on four different real world dataset with three different experiment settings to conduct comprehensive the analysis of Conf-SMoE on modality fusion and resistance to missing modality.
zh

[AI-96] Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在面对多模态知识冲突时的可靠性问题,特别是在检索增强生成(Retrieval-Augmented Generation, RAG)框架下,外部上下文信息与模型内部参数化知识之间可能产生的矛盾。现有基准测试未能充分反映此类现实中的冲突场景,主要集中在内部记忆冲突,而对上下文-记忆冲突和跨上下文冲突的研究不足。为此,论文提出了MMKC-Bench,一个用于评估上下文-记忆和跨上下文场景中事实性知识冲突的基准,其关键在于构建涵盖多种多模态知识冲突类型的数据集,并通过自动化流程与人工验证确保数据质量,从而推动LMMs在多模态知识冲突检测与处理方面的能力提升。

链接: https://arxiv.org/abs/2505.19509
作者: Yifan Jia,Kailin Jiang,Yuyang Liang,Qihan Ren,Yi Xin,Rui Yang,Fenze Feng,Mingcai Chen,Hengyang Lu,Haozhe Wang,Xiaoye Qu,Dongrui Liu,Lizhen Cui,Yuntao Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The source code is available at this https URL

点击查看摘要

Abstract:Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model’s internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at this https URL.
zh

[AI-97] Hierarchical Tree Search-based User Lifelong Behavior Modeling on Large Language Model

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在推荐系统中难以有效理解和提取用户长序列行为中的兴趣信息的问题。现有方法在处理长期行为序列、有效提取兴趣以及在实际场景中应用兴趣方面存在局限性。解决方案的关键在于提出一种基于分层树搜索的用户终身行为建模框架(HiT-LBM),其核心包括分块用户行为提取(CUBE)和分层树搜索兴趣(HTS),通过分块处理和层级扩展生成候选兴趣,并结合时间感知兴趣融合(TIF)构建用户终身兴趣的全面表征,从而提升推荐系统的性能。

链接: https://arxiv.org/abs/2505.19505
作者: Yu Xia,Rui Zhong,Hao Gu,Wei Yang,Chi Lu,Peng Jiang,Kun Gai
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have garnered significant attention in Recommendation Systems (RS) due to their extensive world knowledge and robust reasoning capabilities. However, a critical challenge lies in enabling LLMs to effectively comprehend and extract insights from massive user behaviors. Current approaches that directly leverage LLMs for user interest learning face limitations in handling long sequential behaviors, effectively extracting interest, and applying interest in practical scenarios. To address these issues, we propose a Hierarchical Tree Search-based User Lifelong Behavior Modeling framework (HiT-LBM). HiT-LBM integrates Chunked User Behavior Extraction (CUBE) and Hierarchical Tree Search for Interest (HTS) to capture diverse interests and interest evolution of user. CUBE divides user lifelong behaviors into multiple chunks and learns the interest and interest evolution within each chunk in a cascading manner. HTS generates candidate interests through hierarchical expansion and searches for the optimal interest with process rating model to ensure information gain for each behavior chunk. Additionally, we design Temporal-Ware Interest Fusion (TIF) to integrate interests from multiple behavior chunks, constructing a comprehensive representation of user lifelong interests. The representation can be embedded into any recommendation model to enhance performance. Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods.
zh

[AI-98] CODE-DITING: A Reasoning -Based Metric for Functional Alignment in Code Evaluation

【速读】:该论文旨在解决代码片段可信评估方法在灵活性和可扩展性上的局限性,传统方法依赖参考解或可执行测试用例,而新兴的LLM-as-Judge方法虽能直接评估问题描述与生成代码的功能一致性,但存在性能与解释性的权衡。论文提出的解决方案关键在于CODE-DITING,该方法通过数据蒸馏框架将DeepSeek-R1671B的推理能力迁移至小型模型,显著提升了评估的解释性并降低了计算成本,同时结合多数投票策略,在参数规模较小的情况下实现了优于大型模型的性能。

链接: https://arxiv.org/abs/2505.19502
作者: Guang Yang,Yu Zhou,Xiang Chen,Wei Zheng,Xing Hu,Xin Zhou,David Lo,Taolue Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand substantial computational resources due to their large parameter sizes. To address these limitations, we propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability. We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models, significantly enhancing evaluation explainability and reducing the computational cost. With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters and achieves performance which would normally exhibit in a model with 5 times of parameter scale. CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B, even though it only uses 1% of the parameter volume of these large models. Further experiments show that CODEDITING is robust to preference leakage and can serve as a promising alternative for code evaluation.
zh

[AI-99] Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions

【速读】:该论文试图解决如何将科学讨论中的知识有效地转化为可用于训练大语言模型(LLM)的结构化数据问题,以提升其在科学领域的推理能力。解决方案的关键在于构建了一个自动化的流水线(pipeline),该流水线能够将原始的科学论坛互动内容转换为适用于强化学习的多项选择题格式,并提供了超过3000个高质量的问答对,涵盖基础生物学、实验故障排除、工具使用等多个方面。这是首个端到端的流水线,旨在使LLM从科学讨论中进行推理,并具有跨科学领域推广的潜力。

链接: https://arxiv.org/abs/2505.19501
作者: Ming Yin,Yuanhao Qu,Dyllan Liu,Ling Yang,Le Cong,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this short report, we present an automated pipeline tailored for the genomics domain and introduce \textitGenome-Bench, a new benchmark constructed from over a decade of scientific forum discussions on genome engineering. Our pipeline transforms raw interactions into a reinforcement learning friendly multiple-choice questions format, supported by 3000+ high quality question answer pairs spanning foundational biology, experimental troubleshooting, tool usage, and beyond. To our knowledge, this is the first end-to-end pipeline for teaching LLMs to reason from scientific discussions, with promising potential for generalization across scientific domains beyond biology.
zh

[AI-100] Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models ACL2025

【速读】:该论文旨在解决工业设计自动化中复杂计算机辅助设计(CAD)模型设计耗时的问题,特别是针对计算效率低下和生成精确模型困难的挑战。其解决方案的关键在于提出一种基于语言引导的框架,将大型语言模型(LLMs)与计算机自动化设计(CAutoD)相结合,通过三个核心创新实现CAD模型的自动生成功能:一是利用LLMs和视觉-语言大模型(VLLMs)构建半自动化数据标注流程以生成高质量参数和外观描述;二是采用基于Transformer的CAD生成器(TCADGen),通过双通道特征聚合预测建模序列;三是设计了一个增强型CAD建模生成模型CADLLM,通过整合TCADGen的置信度分数对生成序列进行优化。

链接: https://arxiv.org/abs/2505.19490
作者: Jianxing Liao,Junyan Xu,Yatao Sun,Maowen Tang,Sicheng He,Jingxian Liao,Shui Yu,Yun Li,Hongguan Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Main Conference

点击查看摘要

Abstract:Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models are automatically generated from parameters and appearance descriptions, supporting the automation of design tasks during the detailed CAD design phase. Our approach introduces three key innovations: (1) a semi-automated data annotation pipeline that leverages LLMs and vision-language large models (VLLMs) to generate high-quality parameters and appearance descriptions; (2) a Transformer-based CAD generator (TCADGen) that predicts modeling sequences via dual-channel feature aggregation; (3) an enhanced CAD modeling generation model, called CADLLM, that is designed to refine the generated sequences by incorporating the confidence scores from TCADGen. Experimental results demonstrate that the proposed approach outperforms traditional methods in both accuracy and efficiency, providing a powerful tool for automating industrial workflows and generating complex CAD models from textual prompts. The code is available at this https URL
zh

[AI-101] Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

【速读】:该论文试图解决在Linux内核中进行故障定位(Fault Localization, FL)的挑战,尤其是在大规模代码库、有限可观测性和多样的影响因素下,现有大型语言模型(Large Language Model, LLM)代理的性能不足问题。其解决方案的关键在于提出LinuxFL^+,一个旨在提升LLM代理在Linux内核FL任务中有效性的增强框架,该框架通过最小的成本显著提高了所有研究代理的FL准确率(例如,准确率提升了7.2%至11.2%)。

链接: https://arxiv.org/abs/2505.19489
作者: Zhenhao Zhou,Zhuochen Huang,Yike He,Chong Wang,Jiajun Wang,Yijian Wu,Xin Peng,Yiling Lou
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL ^+ , an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL ^+ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at this https URL.
zh

[AI-102] Understanding Transformer from the Perspective of Associative Memory

【速读】:该论文试图通过关联记忆(associative memory)这一经典心理学概念来理解Transformer架构,旨在揭示其记忆能力与更新机制,并探讨Transformer在表达能力和无限上下文下的潜在极限。解决方案的关键在于引入“检索信噪比”(retrieval SNR)以衡量Transformer的记忆容量,并从核视角数学上解释Softmax Attention的有效性;同时将前馈神经网络(FFN)视为一种关联记忆,为设计优化提供新见解,并提出一个统一框架来理解不同Transformer变体的知识更新过程。

链接: https://arxiv.org/abs/2505.19488
作者: Shu Zhong,Mingyu Xu,Tenglong Ao,Guang Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Consider this post less as a formal research paper and more as a blog-style sharing of our current reflections, intended to spark discussion as one might in a collaborative team meeting

点击查看摘要

Abstract:In this paper, we share our reflections and insights on understanding Transformer architectures through the lens of associative memory–a classic psychological concept inspired by human cognition. We start with the basics of associative memory (think simple linear attention) and then dive into two dimensions: Memory Capacity: How much can a Transformer really remember, and how well? We introduce retrieval SNR to measure this and use a kernel perspective to mathematically reveal why Softmax Attention is so effective. We also show how FFNs can be seen as a type of associative memory, leading to insights on their design and potential improvements. Memory Update: How do these memories learn and evolve? We present a unified framework for understanding how different Transformer variants (like DeltaNet and Softmax Attention) update their “knowledge base”. This leads us to tackle two provocative questions: 1. Are Transformers fundamentally limited in what they can express, and can we break these barriers? 2. If a Transformer had infinite context, would it become infinitely intelligent? We want to demystify Transformer architecture, offering a clearer understanding of existing designs. This exploration aims to provide fresh insights and spark new avenues for Transformer innovation. Comments: Consider this post less as a formal research paper and more as a blog-style sharing of our current reflections, intended to spark discussion as one might in a collaborative team meeting Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19488 [cs.LG] (or arXiv:2505.19488v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.19488 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tenglong Ao [view email] [v1] Mon, 26 May 2025 04:15:38 UTC (3,556 KB)
zh

[AI-103] Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLM s

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体在实时决策任务中面临的延迟与性能质量之间的权衡问题。在高频率交易和实时竞技游戏等应用场景中,严格的延迟约束对系统响应速度提出了更高要求,而现有研究对此方面的探索仍显不足。论文提出了一种自适应框架FPX,其关键在于根据实时需求动态选择模型规模和量化级别,从而在保证性能的同时优化延迟。该方法在两个新提出的基准测试HFTBench和StreetFighter上均取得了最佳表现,验证了其有效性。

链接: https://arxiv.org/abs/2505.19481
作者: Hao Kang,Qingru Zhang,Han Cai,Weiyuan Xu,Tushar Krishna,Yilun Du,Tsachy Weissman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.
zh

[AI-104] Judging with Many Minds: Do More Perspectives Mean Less Prejudice?

【速读】:该论文试图解决多智能体语言模型作为评判者(LLM-as-Judge)系统中的内在偏见问题,特别是分析不同类型的偏见在多智能体框架中的表现及其影响。其解决方案的关键在于系统性地评估四种偏见类型(位置偏见、冗长偏见、思维链偏见和从众偏见),并在两种主流的多智能体LLM-as-Judge框架(多智能体辩论和LLM-as-Meta-Judge)中进行实验,同时引入一种去偏方法PINE作为无偏代理以评估其在不同框架中的有效性。

链接: https://arxiv.org/abs/2505.19477
作者: Chiyu Ma,Enpei Zhang,Yilun Zhao,Wenjun Liu,Yaning Jia,Peijun Qing,Lin Shi,Arman Cohan,Yujun Yan,Soroush Vosoughi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-as-Judge has emerged as a scalable alternative to human evaluation, enabling large language models (LLMs) to provide reward signals in trainings. While recent work has explored multi-agent extensions such as multi-agent debate and meta-judging to enhance evaluation quality, the question of how intrinsic biases manifest in these settings remains underexplored. In this study, we conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias. We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Our results show that debate framework amplifies biases sharply after the initial debate, and this increased bias is sustained in subsequent rounds, while meta-judge approaches exhibit greater resistance. We further investigate the incorporation of PINE, a leading single-agent debiasing method, as a bias-free agent within these systems. The results reveal that this bias-free agent effectively reduces biases in debate settings but provides less benefit in meta-judge scenarios. Our work provides a comprehensive study of bias behavior in multi-agent LLM-as-Judge systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation settings.
zh

[AI-105] Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉理解任务中出现的对象幻觉问题,即模型生成与输入内容不一致或完全不存在的对象描述。该问题与数据集偏差密切相关,由于物体的频繁共现导致跨模态语义表示的纠缠。为了解决这一问题,作者提出了一种基于因果驱动的解耦框架,其关键在于通过因果干预减少由偏差训练数据引起的虚假相关性。该方法包括视觉路径中的因果驱动投影器和集成在语言模型最终Transformer层中的因果干预模块,二者协同工作以降低幻觉现象的发生。

链接: https://arxiv.org/abs/2505.19474
作者: Xinmiao Hu,Chun Wang,Ruihe An,ChenYu Shao,Xiaojun Ye,Sheng Zhou,Liangcheng Li(Zhejiang University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 19 figures, Submitted to NeurIPS 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks, yet they often suffer from object hallucinations–generating descriptions of objects that are inconsistent with or entirely absent from the input. This issue is closely related to dataset biases, where frequent co-occurrences of objects lead to entangled semantic representations across modalities. As a result, models may erroneously activate object representations that are commonly associated with the input but not actually present. To address this, we propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention. Our approach includes a Causal-Driven Projector in the visual pathway and a Causal Intervention Module integrated into the final transformer layer of the language model. These components work together to reduce spurious correlations caused by biased training data. Experimental results show that our method significantly reduces hallucinations while maintaining strong performance on multiple multimodal benchmarks. Visualization analyses further confirm improved separability of object representations. The code is available at: this https URL Comments: 21 pages, 19 figures, Submitted to NeurIPS 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19474 [cs.AI] (or arXiv:2505.19474v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.19474 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xinmiao Hu [view email] [v1] Mon, 26 May 2025 03:53:00 UTC (20,282 KB)
zh

[AI-106] Origin Tracer: A Method for Detecting LoRA Fine-Tuning Origins in LLM s

【速读】:该论文试图解决在大型语言模型(Large Language Models, LLMs)微调过程中出现的透明度和信任问题,特别是针对模型来源的误导性声明。现有模型验证技术在面对诸如排列和缩放变换等混淆技术时存在局限性。该论文提出的解决方案是Origin-Tracer,其关键在于通过检测模型是否从指定的基础模型进行微调,并能够提取微调过程中使用的LoRA秩,从而提供一种形式化且更稳健的验证框架。

链接: https://arxiv.org/abs/2505.19466
作者: Hongyu Liang,Yuting Zheng,Yihan Li,Yiran Zhang,Shiyu Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, their deployment often involves fine-tuning to enhance performance on specific downstream tasks. However, this customization is sometimes accompanied by misleading claims about the origins, raising significant concerns about transparency and trust within the open-source community. Existing model verification techniques typically assess functional, representational, and weight similarities. However, these approaches often struggle against obfuscation techniques, such as permutations and scaling transformations. To address this limitation, we propose a novel detection method Origin-Tracer that rigorously determines whether a model has been fine-tuned from a specified base model. This method includes the ability to extract the LoRA rank utilized during the fine-tuning process, providing a more robust verification framework. This framework is the first to provide a formalized approach specifically aimed at pinpointing the sources of model fine-tuning. We empirically validated our method on thirty-one diverse open-source models under conditions that simulate real-world obfuscation scenarios. We empirically analyze the effectiveness of our framework and finally, discuss its limitations. The results demonstrate the effectiveness of our approach and indicate its potential to establish new benchmarks for model verification.
zh

[AI-107] Residual Cross-Attention Transformer-Based Multi-User CSI Feedback with Deep Joint Source-Channel Coding

【速读】:该论文旨在解决大规模多输入多输出系统中多用户信道状态信息(CSI)反馈的高开销与低重建精度问题。其解决方案的关键在于引入深度联合信源信道编码(DJSCC),通过利用相邻用户的CSI相关性设计多用户联合CSI反馈框架,以降低反馈开销,并提出一种新的残差交叉注意力Transformer架构部署在基站端以提升反馈性能。此外,为克服传统比特级CSI反馈方法的“悬崖效应”,该方案结合了DJSCC与两阶段训练策略,以适应不同上行噪声水平。

链接: https://arxiv.org/abs/2505.19465
作者: Hengwei Zhang,Minghui Wu,Li Qiao,Ling Liu,Ziqi Han,Zhen Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This letter proposes a deep-learning (DL)-based multi-user channel state information (CSI) feedback framework for massive multiple-input multiple-output systems, where the deep joint source-channel coding (DJSCC) is utilized to improve the CSI reconstruction accuracy. Specifically, we design a multi-user joint CSI feedback framework, whereby the CSI correlation of nearby users is utilized to reduce the feedback overhead. Under the framework, we propose a new residual cross-attention transformer architecture, which is deployed at the base station to further improve the CSI feedback performance. Moreover, to tackle the “cliff-effect” of conventional bit-level CSI feedback approaches, we integrated DJSCC into the multi-user CSI feedback, together with utilizing a two-stage training scheme to adapt to varying uplink noise levels. Experimental results demonstrate the superiority of our methods in CSI feedback performance, with low network complexity and better scalability.
zh

[AI-108] Your Classifier Can Do More: Towards Bridging the Gaps in Classification Robustness and Generation

【速读】:该论文试图解决在单一模型中同时实现高分类准确率、对抗鲁棒性和生成能力的三重权衡问题,这一目标此前尚未被充分探索。其解决方案的关键在于通过分析不同JEM变体和对抗训练模型中干净样本、对抗样本和生成样本的能量分布差异,发现对抗训练会缩小干净与对抗样本之间的能量差距,而JEM则缩小干净与生成样本之间的差距,从而提出Energy-based Joint Distribution Adversarial Training (EB-JDAT),通过最大化干净数据分布、对抗分布和分类器的联合概率来统一三者的优势,进而解决各自的固有权衡。

链接: https://arxiv.org/abs/2505.19459
作者: Kaichao Jiang,He Wang,Xiaoshuai Hao,Xiulong Yang,Ajian Liu,Qi Chu,Yunfeng Diao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Joint Energy-based Models (JEMs), a class of hybrid generative-discriminative models, are well known for their ability to achieve both high classification accuracy and generative capability within a single model. However, their robustness still lags significantly behind the classifiers based adversarial training (AT). Conversely, while AT is currently the most effective approach to improving the classifier’s robustness, it typically sacrifices accuracy on clean data and lacks generative capability. The triple trade-off between classification accuracy, generative capability and robustness, raises a natural question: Can a single model simultaneously achieve high classification accuracy, adversarial robustness, and generative performance? – a goal that has been rarely explored. To address this question, we systematically analyze the energy distribution differences of clean, adversarial, and generated samples across various JEM variants and adversarially trained models. We observe that AT tends to reduce the energy gap between clean and adversarial samples, while JEMs reduce the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might unify the strengths of AT and JEMs, resolving their inherent trade-offs. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), to jointly model the clean data distribution, the adversarial distribution, and the classifier by maximizing their joint probability. EB-JDAT is a general and flexible optimization method, compatible with various JEM variants. Extensive experimental results demonstrate that EB-JDAT not only maintains near original accuracy and generative capability of JEMs, but also significantly enhances robustness, even surpassing state-of-the-art ATs.
zh

[AI-109] Style2Code: A Style-Controllable Code Generation Framework with Dual-Modal Contrastive Representation Learning EMNLP2025

【速读】:该论文试图解决可控代码生成(controllable code generation)问题,即在保持功能性的前提下,生成符合特定风格的代码。其解决方案的关键在于提出一种两阶段训练框架,结合对比学习(contrastive learning)与条件解码(conditional decoding),首先对齐代码风格表示与语义和结构特征,其次通过条件语言模型(如Flan-T5)在学习到的风格向量基础上进行微调,以实现灵活的风格控制。该方法支持风格插值和用户个性化,并在不牺牲代码正确性的前提下提升了风格控制能力。

链接: https://arxiv.org/abs/2505.19442
作者: Dutao Zhang,Sergey Kovalchuk,YuLong He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, submitted to EMNLP 2025 (Industry Track)

点击查看摘要

Abstract:Controllable code generation, the ability to synthesize code that follows a specified style while maintaining functionality, remains a challenging task. We propose a two-stage training framework combining contrastive learning and conditional decoding to enable flexible style control. The first stage aligns code style representations with semantic and structural features. In the second stage, we fine-tune a language model (e.g., Flan-T5) conditioned on the learned style vector to guide generation. Our method supports style interpolation and user personalization via lightweight mixing. Compared to prior work, our unified framework offers improved stylistic control without sacrificing code correctness. This is among the first approaches to combine contrastive alignment with conditional decoding for style-guided code generation.
zh

[AI-110] Fairness Practices in Industry: A Case Study in Machine Learning Teams Building Recommender Systems

【速读】:该论文试图解决推荐系统中公平性评估与实践的挑战,特别是在不断变化的公平性标准下,如何有效实施公平性措施的问题。解决方案的关键在于理解行业从业者在实际工作中如何感知和整合这些动态的公平性标准,包括他们采用的去偏方法、使用的评估指标、协作策略以及将学术研究成果应用于实际系统的路径。研究强调了多维去偏方法相较于传统基于人口统计的方法更受青睐,并指出实践中更依赖直觉性指标而非学术性指标,同时揭示了在个人角色与组织约束之间平衡公平性需求的困难。

链接: https://arxiv.org/abs/2505.19441
作者: Jing Nathan Yan,Junxiong Wang,Jeffrey M. Rzeszotarski,Allison Koenecke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid proliferation of recommender systems necessitates robust fairness practices to address inherent biases. Assessing fairness, though, is challenging due to constantly evolving metrics and best practices. This paper analyzes how industry practitioners perceive and incorporate these changing fairness standards in their workflows. Through semi-structured interviews with 11 practitioners from technical teams across a range of large technology companies, we investigate industry implementations of fairness in recommendation system products. We focus on current debiasing practices, applied metrics, collaborative strategies, and integrating academic research into practice. Findings show a preference for multi-dimensional debiasing over traditional demographic methods, and a reliance on intuitive rather than academic metrics. This study also highlights the difficulties in balancing fairness with both the practitioner’s individual (bottom-up) roles and organizational (top-down) workplace constraints, including the interplay with legal and compliance experts. Finally, we offer actionable recommendations for the recommender system community and algorithmic fairness practitioners, underlining the need to refine fairness practices continually.
zh

[AI-111] WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因计算需求增长而导致的效率问题,特别是如何在不进行额外训练的情况下实现高效的稀疏激活策略。现有方法多依赖于隐藏状态的幅度来决定激活,导致近似误差较高且推理精度不足。该论文提出的解决方案为WINA(Weight Informed Neuron Activation),其关键在于联合考虑隐藏状态幅度与权重矩阵的列范数(column-wise ℓ₂-norms),从而在理论和实践中均实现了更优的稀疏化策略,提升了推理性能与资源效率。

链接: https://arxiv.org/abs/2505.19427
作者: Sihan Chen,Dan Zhao,Jongwoo Ko,Colby Banbury,Huiping Zhuang,Luming Liang,Tianyi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise \ell_2 -norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to 2.94% in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at this https URL.
zh

[AI-112] Surrogate-Assisted Evolutionary Reinforcement Learning Based on Autoencoder and Hyperbolic Neural Network

【速读】:该论文旨在解决进化强化学习(Evolutionary Reinforcement Learning, ERL)中由于进化算法(Evolutionary Algorithms, EAs)需要大量候选策略的评估而导致的高计算成本和低搜索效率问题。其关键解决方案是引入一种基于自编码器(Autoencoder, AE)和双曲神经网络(Hyperbolic Neural Network, HNN)的代理模型,通过AE将高维策略压缩为低维表示并提取关键特征作为代理模型的输入,而HNN作为基于分类的代理模型,能够从采样数据中学习复杂的非线性关系,从而在无需实际评估的情况下更准确地预选采样策略,提升搜索效率与效果。

链接: https://arxiv.org/abs/2505.19423
作者: Bingdong Li,Mei Jiang,Hong Qian,Peng Yang,Wenjing Hong,Hong Qian,Ke Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evolutionary Reinforcement Learning (ERL), training the Reinforcement Learning (RL) policies with Evolutionary Algorithms (EAs), have demonstrated enhanced exploration capabilities and greater robustness than using traditional policy gradient. However, ERL suffers from the high computational costs and low search efficiency, as EAs require evaluating numerous candidate policies with expensive simulations, many of which are ineffective and do not contribute meaningfully to the training. One intuitive way to reduce the ineffective evaluations is to adopt the surrogates. Unfortunately, existing ERL policies are often modeled as deep neural networks (DNNs) and thus naturally represented as high-dimensional vectors containing millions of weights, which makes the building of effective surrogates for ERL policies extremely challenging. This paper proposes a novel surrogate-assisted ERL that integrates Autoencoders (AE) and Hyperbolic Neural Networks (HNN). Specifically, AE compresses high-dimensional policies into low-dimensional representations while extracting key features as the inputs for the surrogate. HNN, functioning as a classification-based surrogate model, can learn complex nonlinear relationships from sampled data and enable more accurate pre-selection of the sampled policies without real evaluations. The experiments on 10 Atari and 4 Mujoco games have verified that the proposed method outperforms previous approaches significantly. The search trajectories guided by AE and HNN are also visually demonstrated to be more effective, in terms of both exploration and convergence. This paper not only presents the first learnable policy embedding and surrogate-modeling modules for high-dimensional ERL policies, but also empirically reveals when and why they can be successful.
zh

[AI-113] Its Not Just Labeling" – A Research on LLM Generated Feedback Interpretability and Image Labeling Sketch Features

【速读】:该论文试图解决机器学习应用中训练数据质量对性能的影响问题,特别是在交通、医疗和机器人等领域,传统图像标注方法依赖于耗时且专家驱动的流程,反馈有限。其解决方案的关键在于引入一种基于草图的标注方法,该方法由大型语言模型(Large Language Models, LLMs)支持,以降低技术门槛并提高可访问性。通过合成数据集分析草图识别特征与LLM反馈指标之间的关系,旨在提升LLM辅助标注的可靠性与可解释性,并探索提示策略和草图变化对反馈质量的影响。

链接: https://arxiv.org/abs/2505.19419
作者: Baichuan Li,Larry Powell,Tracy Hammond
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quality of training data is critical to the performance of machine learning applications in domains like transportation, healthcare, and robotics. Accurate image labeling, however, often relies on time-consuming, expert-driven methods with limited feedback. This research introduces a sketch-based annotation approach supported by large language models (LLMs) to reduce technical barriers and enhance accessibility. Using a synthetic dataset, we examine how sketch recognition features relate to LLM feedback metrics, aiming to improve the reliability and interpretability of LLM-assisted labeling. We also explore how prompting strategies and sketch variations influence feedback quality. Our main contribution is a sketch-based virtual assistant that simplifies annotation for non-experts and advances LLM-driven labeling tools in terms of scalability, accessibility, and explainability.
zh

[AI-114] oward Physics-Informed Machine Learning for Data Center Operations: A Tropical Case Study

【速读】:该论文旨在解决在热带地区运营数据中心时因常年高温高湿环境导致的冷却成本增加问题,以及现有基于机器学习的方法在模型外推能力和系统安全性方面的不确定性。解决方案的关键在于将数据中心的物理特性融入传统的数据驱动机器学习方法中,构建一种融合物理信息的机器学习系统,以提升操作的智能化水平和可靠性。

链接: https://arxiv.org/abs/2505.19414
作者: Ruihang Wang,Zhiwei Cao,Qingang Zhang,Rui Tan,Yonggang Wen,Tommy Leung,Stuart Kennedy,Justin Teoh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data centers are the backbone of computing capacity. Operating data centers in the tropical regions faces unique challenges due to consistently high ambient temperature and elevated relative humidity throughout the year. These conditions result in increased cooling costs to maintain the reliability of the computing systems. While existing machine learning-based approaches have demonstrated potential to elevate operations to a more proactive and intelligent level, their deployment remains dubious due to concerns about model extrapolation capabilities and associated system safety issues. To address these concerns, this article proposes incorporating the physical characteristics of data centers into traditional data-driven machine learning solutions. We begin by introducing the data center system, including the relevant multiphysics processes and the data-physics availability. Next, we outline the associated modeling and optimization problems and propose an integrated, physics-informed machine learning system to address them. Using the proposed system, we present relevant applications across varying levels of operational intelligence. A case study on an industry-grade tropical data center is provided to demonstrate the effectiveness of our approach. Finally, we discuss key challenges and highlight potential future directions.
zh

[AI-115] Fusion Intelligence for Digital Twinning AI Data Centers: A Synergistic GenAI-PhyAI Approach

【速读】:该论文旨在解决AI专用数据中心(AIDC)管理中传统方法和独立AI解决方案难以应对的挑战,特别是现有生成式AI(GenAI)在创建物理AI(PhyAI)数字孪生时存在的准确性不足和幻觉问题。论文提出的解决方案是融合智能(Fusion Intelligence),其关键在于通过生成式AI的自动化能力与物理AI的领域约束相结合,实现数字孪生的高效生成与优化,从而提升AIDC设计阶段的能效预测分析能力,并通过实时数据进一步提高数字孪生的准确性。

链接: https://arxiv.org/abs/2505.19409
作者: Ruihang Wang,Minghao Li,Zhiwei Cao,Jimin Jia,Kyle Guan,Yonggang Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The explosion in artificial intelligence (AI) applications is pushing the development of AI-dedicated data centers (AIDCs), creating management challenges that traditional methods and standalone AI solutions struggle to address. While digital twins are beneficial for AI-based design validation and operational optimization, current AI methods for their creation face limitations. Specifically, physical AI (PhyAI) aims to capture the underlying physical laws, which demands extensive, case-specific customization, and generative AI (GenAI) can produce inaccurate or hallucinated results. We propose Fusion Intelligence, a novel framework synergizing GenAI’s automation with PhyAI’s domain grounding. In this dual-agent collaboration, GenAI interprets natural language prompts to generate tokenized AIDC digital twins. Subsequently, PhyAI optimizes these generated twins by enforcing physical constraints and assimilating real-time data. Case studies demonstrate the advantages of our framework in automating the creation and validation of AIDC digital twins. These twins deliver predictive analytics to support power usage effectiveness (PUE) optimization in the design stage. With operational data collected, the digital twin accuracy is further improved compared with pure physics-based models developed by human experts. Fusion Intelligence offers a promising pathway to accelerate digital transformation. It enables more reliable and efficient AI-driven digital transformation for a broad range of mission-critical infrastructures.
zh

[AI-116] Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model

【速读】:该论文旨在探究大型视觉语言模型(VLMs)是否能够通过类似强化学习(RL)的后训练策略继承大型语言模型(LLMs)所展现的强推理能力,特别是在分布外条件下的跨模态和跨任务能力组合问题。研究通过设计一系列诊断任务,评估不同后训练策略下模型在多模态、组合性任务中的表现,发现强化学习训练的模型在组合泛化能力上优于监督微调(SFT)模型,并揭示了当前训练策略在跨模态和跨任务场景下的显著不足。解决方案的关键在于引入显式的视觉到文本对齐与精确的视觉定位机制,以提升模型在多模态推理中的组合能力。

链接: https://arxiv.org/abs/2505.19406
作者: Tianle Li,Jihai Zhang,Yongming Rao,Yu Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) demonstrate strong reasoning capabilities utilizing reinforcement learning (RL) with verifiable reward, whether large vision-language models (VLMs) can directly inherit such capabilities through similar post-training strategies remains underexplored. In this work, we conduct a systematic compositional probing study to evaluate whether current VLMs trained with RL or other post-training strategies can compose capabilities across modalities or tasks under out-of-distribution conditions. We design a suite of diagnostic tasks that train models on unimodal tasks or isolated reasoning skills, and evaluate them on multimodal, compositional variants requiring skill integration. Through comparisons between supervised fine-tuning (SFT) and RL-trained models, we identify three key findings: (1) RL-trained models consistently outperform SFT on compositional generalization, demonstrating better integration of learned skills; (2) although VLMs achieve strong performance on individual tasks, they struggle to generalize compositionally under cross-modal and cross-task scenario, revealing a significant gap in current training strategies; (3) enforcing models to explicitly describe visual content before reasoning (e.g., caption-before-thinking), along with rewarding progressive vision-to-text grounding, yields notable gains. It highlights two essential ingredients for improving compositionality in VLMs: visual-to-text alignment and accurate visual grounding. Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.
zh

[AI-117] Recalibrating the Compass: Integrating Large Language Models into Classical Research Methods

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)如何变革传播研究和社会科学中的核心定量方法的问题,特别是内容分析、调查研究和实验研究。其解决方案的关键在于利用LLMs在文本编码与解释、动态受访者模拟以及个性化和互动刺激生成方面的潜力,同时结合跨学科成果,探讨LLMs作为研究工具的潜在优势与局限性,包括效度、偏见和可解释性问题。论文进一步通过重新审视拉斯韦尔的“谁说、说什么、通过什么渠道、对谁说、产生什么效果”框架,展示LLMs如何重新配置信息研究、受众分析和效果研究,并强调经典研究逻辑在整合LLMs和生成式AI(Generative AI)过程中的重要性。

链接: https://arxiv.org/abs/2505.19402
作者: Tai-Quan Peng,Xuzhen Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper examines how large language models (LLMs) are transforming core quantitative methods in communication research in particular, and in the social sciences more broadly-namely, content analysis, survey research, and experimental studies. Rather than replacing classical approaches, LLMs introduce new possibilities for coding and interpreting text, simulating dynamic respondents, and generating personalized and interactive stimuli. Drawing on recent interdisciplinary work, the paper highlights both the potential and limitations of LLMs as research tools, including issues of validity, bias, and interpretability. To situate these developments theoretically, the paper revisits Lasswell’s foundational framework – “Who says what, in which channel, to whom, with what effect?” – and demonstrates how LLMs reconfigure message studies, audience analysis, and effects research by enabling interpretive variation, audience trajectory modeling, and counterfactual experimentation. Revisiting the metaphor of the methodological compass, the paper argues that classical research logics remain essential as the field integrates LLMs and generative AI. By treating LLMs not only as technical instruments but also as epistemic and cultural tools, the paper calls for thoughtful, rigorous, and imaginative use of LLMs in future communication and social science research.
zh

[AI-118] VADER: A Human-Evaluated Benchmark for Vulnerability Assessment Detection Explanation and Remediation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件漏洞评估、检测、解释和修复方面能力不足的问题,以提升软件系统的安全性和鲁棒性。其解决方案的关键在于提出VADER,一个由人类专家评估的基准测试集,专门用于衡量LLMs在四个核心漏洞处理维度上的表现:评估、检测、解释和修复。VADER包含174个从GitHub仓库中精心挑选并由安全专家标注的真实软件漏洞,并通过严格的评分标准对模型生成的修复代码质量、解释清晰度以及分类和测试计划的有效性进行评估,从而为改进漏洞感知的LLMs提供可解释、可复现的基准。

链接: https://arxiv.org/abs/2505.19395
作者: Ethan TS. Liu,Austin Wang,Spencer Mateega,Carlos Georgescu,Danny Tang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Ensuring that large language models (LLMs) can effectively assess, detect, explain, and remediate software vulnerabilities is critical for building robust and secure software systems. We introduce VADER, a human-evaluated benchmark designed explicitly to assess LLM performance across four key vulnerability-handling dimensions: assessment, detection, explanation, and remediation. VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub repositories and annotated by security experts. For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weakness Enumeration (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan. Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER, and human security experts evaluated each response according to a rigorous scoring rubric emphasizing remediation (quality of the code fix, 50%), explanation (20%), and classification and test plan (30%) according to a standardized rubric. Our results show that current state-of-the-art LLMs achieve only moderate success on VADER - OpenAI’s o3 attained 54.7% accuracy overall, with others in the 49-54% range, indicating ample room for improvement. Notably, remediation quality is strongly correlated (Pearson r 0.97) with accurate classification and test plans, suggesting that models that effectively categorize vulnerabilities also tend to fix them well. VADER’s comprehensive dataset, detailed evaluation rubrics, scoring tools, and visualized results with confidence intervals are publicly released, providing the community with an interpretable, reproducible benchmark to advance vulnerability-aware LLMs. All code and data are available at: this https URL
zh

[AI-119] CaseEdit: Enhancing Localized Commonsense Reasoning via Null-Space Constrained Knowledge Editing in Small Parameter Language Models

【速读】:该论文试图解决小参数语言模型在适应用户特定常识知识方面的挑战,尤其是在计算效率优先的场景下,模型难以有效整合个性化常识知识的问题。解决方案的关键在于提出CaseEdit数据集和生成流程,该方法通过多阶段推理过程生成针对家庭物品的典型和非典型上下文编辑,并结合四个评估维度(可靠性、泛化性、局部性和可移植性)来评估局部化、个性化的常识知识编辑效果。此外,研究还验证了AlphaEdit技术在小规模模型上的有效性,该技术通过空域投影减少对无关知识的干扰,表现出最小的连锁效应,从而提升了模型对高质量、上下文敏感常识知识的内化能力。

链接: https://arxiv.org/abs/2505.19383
作者: Varun Reddy,Yen-Ling Kuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong performance on factual recall and general reasoning but struggle to adapt to user-specific, commonsense knowledge, a challenge particularly acute in small-parameter settings where computational efficiency is prioritized. We introduce CaseEdit, a new dataset and generation pipeline for evaluating localized, personalized commonsense knowledge editing in small LLMs to address this. Built upon the ATOMIC20/20 commonsense graph, CaseEdit uses a multi-stage inference process to generate both typical and atypical contextual edits for household objects, paired with targeted evaluation questions across four axes: reliability, generalization, locality, and portability. We evaluate established knowledge editing methods using CaseEdit and demonstrate that AlphaEdit, a technique employing null-space projection to minimize interference with unrelated knowledge, consistently outperforms other methods when applied to an LLaMA 3.2 3B model, even in scalability tests, showing minimal ripple effects. Our results indicate that using CaseEdit with effective editing techniques like AlphaEdit allows small models to internalize high-quality, context-sensitive common-sense knowledge, paving the way for lightweight, personalized assistants.
zh

[AI-120] Foundations of Top-k Decoding For Language Models

【速读】:该论文试图解决Top- k 解码方法在生成式 AI (Generative AI) 中缺乏精确理论依据的问题,即虽然Top- k 解码被广泛用于从大语言模型(LLMs)中采样,但其理论动机尚未明确。论文的解决方案关键在于构建一个理论框架,将解码过程视为稀疏概率分布的恢复问题,并通过最小化可分离的Bregman散度结合ℓ₀正则化来实现稀疏性诱导,从而推导出最优解码策略。该框架表明,最优策略是贪心的,并且损失函数在k上具有离散凸性,使得二分查找能够高效确定最优k值,同时揭示了Top- k 解码作为KL散度的特例,并提出了具有不同行为的新解码策略。

链接: https://arxiv.org/abs/2505.19371
作者: Georgy Noarov,Soham Mallick,Tao Wang,Sunay Joshi,Yan Sun,Yangxinyu Xie,Mengxin Yu,Edgar Dobriban
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Top- k decoding is a widely used method for sampling from LLMs: at each token, only the largest k next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top- k and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top- k decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top- k decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emphBregman decoders obtained by minimizing a separable Bregman divergence (for both the \emphprimal and \emphdual cases) with a sparsity-inducing \ell_0 regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in k , so that binary search provably and efficiently finds the optimal k . We show that top- k decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).
zh

[AI-121] SETransformer: A Hybrid Attention-Based Architecture for Robust Human Activity Recognition

【速读】:该论文旨在解决可穿戴传感器数据中人体活动识别(Human Activity Recognition, HAR)任务中存在的长期时间依赖性和多传感器通道上下文相关性捕捉不足的问题。传统深度学习模型如CNN和RNN在处理此类问题时表现有限。解决方案的关键在于提出SETransformer,这是一种结合了基于Transformer的时间建模、通道级挤压-激励(SE)注意力机制以及可学习的时间注意力池化机制的混合深度神经架构,从而有效捕捉活动特异性运动动态并自适应强调关键传感器通道和时间步。

链接: https://arxiv.org/abs/2505.19369
作者: Yunbo Liu,Xukui Qin,Yifan Gao,Xiang Li,Chengwei Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) using wearable sensor data has become a central task in mobile computing, healthcare, and human-computer interaction. Despite the success of traditional deep learning models such as CNNs and RNNs, they often struggle to capture long-range temporal dependencies and contextual relevance across multiple sensor channels. To address these limitations, we propose SETransformer, a hybrid deep neural architecture that combines Transformer-based temporal modeling with channel-wise squeeze-and-excitation (SE) attention and a learnable temporal attention pooling mechanism. The model takes raw triaxial accelerometer data as input and leverages global self-attention to capture activity-specific motion dynamics over extended time windows, while adaptively emphasizing informative sensor channels and critical time steps. We evaluate SETransformer on the WISDM dataset and demonstrate that it significantly outperforms conventional models including LSTM, GRU, BiLSTM, and CNN baselines. The proposed model achieves a validation accuracy of 84.68% and a macro F1-score of 84.64%, surpassing all baseline architectures by a notable margin. Our results show that SETransformer is a competitive and interpretable solution for real-world HAR tasks, with strong potential for deployment in mobile and ubiquitous sensing applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19369 [cs.LG] (or arXiv:2505.19369v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.19369 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-122] PatentMind: A Multi-Aspect Reasoning Graph for Patent Similarity Evaluation

【速读】:该论文试图解决专利相似性评估中现有方法忽视专利文档复杂结构的问题,该结构包含技术规范、法律边界和应用背景。解决方案的关键在于提出PatentMind框架,该框架基于多方面推理图(Multi-Aspect Reasoning Graph, MARG),将专利分解为技术特征、应用领域和权利要求范围三个核心维度,并通过四阶段推理过程动态加权计算维度特定的相似性得分,以模拟专家判断。

链接: https://arxiv.org/abs/2505.19347
作者: Yongmin Yoo,Qiongkai Xu,Longbing Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Patent similarity evaluation plays a critical role in intellectual property analysis. However, existing methods often overlook the intricate structure of patent documents, which integrate technical specifications, legal boundaries, and application contexts. We introduce PatentMind, a novel framework for patent similarity assessment based on a Multi-Aspect Reasoning Graph (MARG). PatentMind decomposes patents into three core dimensions: technical feature, application domain, and claim scope, to compute dimension-specific similarity scores. These scores are dynamically weighted through a four-stage reasoning process which integrates contextual signals to emulate expert-level judgment. To support evaluation, we construct PatentSimBench, a human-annotated benchmark comprising 500 patent pairs. Experimental results demonstrate that PatentMind achieves a strong correlation ( r=0.938 ) with expert annotations, significantly outperforming embedding-based models and advanced prompt engineering this http URL results highlight the effectiveness of modular reasoning frameworks in overcoming key limitations of embedding-based methods for analyzing patent similarity.
zh

[AI-123] Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

【速读】:该论文旨在解决Transformer模型在推理过程中存在的高延迟问题,这限制了其在实时应用场景中的使用。现有多设备推理方法虽然能够通过并行计算降低延迟,但需要较高的设备间带宽,难以在带宽受限的环境中应用。论文提出的解决方案ASTRA的关键在于通过序列并行性和一种混合精度注意力机制的创新结合,实现通信效率的提升,同时利用向量量化压缩非局部标记嵌入,并通过噪声增强量化和分布式类别标记两种优化策略保持任务准确性。

链接: https://arxiv.org/abs/2505.19342
作者: Xiao Liu,Lijun Zhang,Deepak Ganesan,Hui Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at this https URL.
zh

[AI-124] owards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP)

【速读】:该论文试图解决人形机器人在陌生场景中静态预设的“思考-规划-行动”模式与因缺乏自主编码能力而导致的高度程序化的“调用工具-返回结果”之间的差距。其解决方案的关键在于设计一种动态架构,将连续思维机器(CTM)与模型上下文协议(MCP)相连接,并通过tick-slab理论并行方案和排名压缩技术实现参数抑制,从而为自主编码驱动的自主行动提供解决方案。

链接: https://arxiv.org/abs/2505.19339
作者: Libo Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The relevant architecture code and some experimental records have been uploaded to the GitHub repository for sharing: this https URL

点击查看摘要

Abstract:To address the gaps between the static pre-set “thinking-planning-action” of humanoid robots in unfamiliar scenarios and the highly programmed “call tool-return result” due to the lack of autonomous coding capabilities, this work designs a dynamic architecture connecting continuous thought machines (CTM) and model context protocol (MCP). It proposes a theoretical parallel solution through tick-slab and uses rank compression to achieve parameter suppression to provide a solution for achieving autonomous actions due to autonomous coding. The researcher used a simulation-based experiment using OpenAI’s o4-mini-high as a tool to build the experimental environment, and introduced the extended SayCan dataset to conduct nine epochs of experiments. The experimental results show that the CTM-MCP architecture is feasible and effective through the data results of seven metrics: task success rate (TSR), execution success rate (ESR), average episode length (AEL), ROSCOE, REVEAL, proficiency self-assessment (PSA), task effectiveness (TE). In practice, it provides a reference experience for exploring the autonomous dynamic coding of humanoid robots based on continuous thinking to achieve human-like autonomous actions.
zh

[AI-125] Prompting Decision Transformers for Zero-Shot Reach-Avoid Policies

【速读】:该论文旨在解决在复杂或结构不良环境中,传统离线目标条件强化学习方法在动态指定避免区域(avoid regions)和依赖精心设计的奖励与成本函数方面的局限性。其解决方案的关键在于提出RADT模型,该模型通过将目标和避免区域直接编码为提示标记(prompt tokens),实现了评估时对任意数量和大小的避免区域的灵活指定,并利用目标和避免区域的回顾性重标注技术,仅使用随机策略的次优离线轨迹即可学习到避障行为。

链接: https://arxiv.org/abs/2505.19337
作者: Kevin Li,Marinka Zitnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning methods have shown promise for reach-avoid tasks, where an agent must reach a target state while avoiding undesirable regions of the state space. Existing approaches typically encode avoid-region information into an augmented state space and cost function, which prevents flexible, dynamic specification of novel avoid-region information at evaluation time. They also rely heavily on well-designed reward and cost functions, limiting scalability to complex or poorly structured environments. We introduce RADT, a decision transformer model for offline, reward-free, goal-conditioned, avoid region-conditioned RL. RADT encodes goals and avoid regions directly as prompt tokens, allowing any number of avoid regions of arbitrary size to be specified at evaluation time. Using only suboptimal offline trajectories from a random policy, RADT learns reach-avoid behavior through a novel combination of goal and avoid-region hindsight relabeling. We benchmark RADT against 3 existing offline goal-conditioned RL models across 11 tasks, environments, and experimental settings. RADT generalizes in a zero-shot manner to out-of-distribution avoid region sizes and counts, outperforming baselines that require retraining. In one such zero-shot setting, RADT achieves 35.7% improvement in normalized cost over the best retrained baseline while maintaining high goal-reaching success. We apply RADT to cell reprogramming in biology, where it reduces visits to undesirable intermediate gene expression states during trajectories to desired target states, despite stochastic transitions and discrete, structured state dynamics.
zh

[AI-126] Evaluating Steering Techniques using Human Similarity Judgments

【速读】:该论文试图解决当前大型语言模型(Large Language Model, LLM)引导技术评估中忽视其表征与人类认知对齐程度的问题。研究通过一个基于人类认知的三元相似性判断任务,评估了引导后的LLM在基于大小或种类进行概念相似性判断的灵活性。解决方案的关键在于采用提示引导(prompt-based steering)方法,该方法在引导准确性和模型与人类一致性方面均优于其他方法,并揭示了LLM在未引导前存在的表征轴优势。

链接: https://arxiv.org/abs/2505.19333
作者: Zach Studdiford,Timothy T. Rogers,Siddharth Suresh,Kushin Mukherjee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current evaluations of Large Language Model (LLM) steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards ‘kind’ similarity and struggled with ‘size’ alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.
zh

[AI-127] Effort-aware Fairness: Incorporating a Philosophy-informed Human-centered Notion of Effort into Algorithmic Fairness Metrics

【速读】:该论文试图解决现有AI公平性度量(如人口统计学平等)未能考虑个体在输入特征空间中所付出的努力的问题。其解决方案的关键在于引入一种受哲学启发的“努力感知公平性”(Effort-aware Fairness, EaF)概念,该概念基于“力”(Force)或预测特征的时间轨迹与惯性相结合的理论框架。通过这一框架,论文提出了理论上的EaF度量,并通过实验证明了人们在评估个体公平性时更关注预测特征的时间轨迹而非其总体值,同时提供了在刑事司法和个人金融场景中计算努力感知个体/群体公平性的方法。

链接: https://arxiv.org/abs/2505.19317
作者: Tin Nguyen,Jiannan Xu,Zora Che,Phuong-Anh Nguyen-Le,Rushil Dandamudi,Donald Braman,Furong Huang,Hal Daumé III,Zubin Jelveh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Although popularized AI fairness metrics, e.g., demographic parity, have uncovered bias in AI-assisted decision-making outcomes, they do not consider how much effort one has spent to get to where one is today in the input feature space. However, the notion of effort is important in how Philosophy and humans understand fairness. We propose a philosophy-informed way to conceptualize and evaluate Effort-aware Fairness (EaF) based on the concept of Force, or temporal trajectory of predictive features coupled with inertia. In addition to our theoretical formulation of EaF metrics, our empirical contributions include: 1/ a pre-registered human subjects experiment, which demonstrates that for both stages of the (individual) fairness evaluation process, people consider the temporal trajectory of a predictive feature more than its aggregate value; 2/ pipelines to compute Effort-aware Individual/Group Fairness in the criminal justice and personal finance contexts. Our work may enable AI model auditors to uncover and potentially correct unfair decisions against individuals who spent significant efforts to improve but are still stuck with systemic/early-life disadvantages outside their control.
zh

[AI-128] Demand Selection for VRP with Emission Quota

【速读】:该论文试图解决带有排放配额的车辆路径问题(Vehicle Routing Problem with Emission Quota, QVRP)中的需求选择问题,其核心目标是在遵守污染配额的前提下,最小化被遗漏配送的数量。解决方案的关键在于将问题分解为两部分:需求选择部分被称为最大可行车辆分配(Maximum Feasible Vehicle Assignment, MFVA),而车辆路径的构建则采用传统的运筹学(Operations Research, OR)方法。研究提出多种基于机器学习(Machine Learning, ML)和OR的方法进行包裹的遗漏选择,但实验结果表明,在静态问题设置下,基于OR的方法始终优于基于ML的方法。

链接: https://arxiv.org/abs/2505.19315
作者: Farid Najar,Dominique Barth,Yann Strozecki
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Combinatorial optimization (CO) problems are traditionally addressed using Operations Research (OR) methods, including metaheuristics. In this study, we introduce a demand selection problem for the Vehicle Routing Problem (VRP) with an emission quota, referred to as QVRP. The objective is to minimize the number of omitted deliveries while respecting the pollution quota. We focus on the demand selection part, called Maximum Feasible Vehicle Assignment (MFVA), while the construction of a routing for the VRP instance is solved using classical OR methods. We propose several methods for selecting the packages to omit, both from machine learning (ML) and OR. Our results show that, in this static problem setting, classical OR-based methods consistently outperform ML-based approaches.
zh

[AI-129] Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking

【速读】:该论文试图解决在动态环境中集成多个(子)系统时面临的挑战,特别是设计阶段集成尚未存在的服务时的问题。传统方法依赖于注册表提供端点的API文档,但大型语言模型在自动创建系统集成(如服务组合)时需要简洁的输入,这限制了对全面API描述的处理。论文的关键解决方案是利用检索增强生成(Retrieval Augmented Generation, RAG)进行端点发现,并通过分块(chunking)预处理状态实践中的OpenAPI以减少输入token长度,同时保留最相关信息。此外,提出了一种仅接收关键端点摘要并按需检索规范细节的Discovery Agent,以进一步减少输入token长度并提高端点检索精度。

链接: https://arxiv.org/abs/2505.19310
作者: Robin D. Pesl,Jerin G. Mathew,Massimo Mecella,Marco Aiello
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2411.19804

点击查看摘要

Abstract:Integrating multiple (sub-)systems is essential to create advanced Information Systems. Difficulties mainly arise when integrating dynamic environments, e.g., the integration at design time of not yet existing services. This has been traditionally addressed using a registry that provides the API documentation of the endpoints. Large Language Models have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input oken limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. In the present work, we (i) analyze the usage of Retrieval Augmented Generation for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input oken length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints nd retrieves specification details on demand. We evaluate RAG for endpoint discovery using (iii) a proposed novel service discovery benchmark SOCBench-D representing a general setting across numerous domains and the real-world RestBench enchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval accuracy. Then, we assess the Discovery Agent using the same test data set. The prototype shows how to successfully employ RAG for endpoint discovery to reduce the token count. Our experiments show that endpoint-based approaches outperform naive chunking methods for preprocessing. Relying on an agent significantly improves precision while being prone to decrease recall, disclosing the need for further reasoning capabilities.
zh

[AI-130] A Novel Zero-Trust Identity Framework for Agent ic AI: Decentralized Authentication and Fine-Grained Access Control

【速读】:该论文试图解决传统身份与访问管理(IAM)系统在应对大规模多智能体系统(MAS)中动态、相互依赖且短暂的AI代理时所表现出的根本性不足。现有协议如OAuth、OpenID Connect(OIDC)和SAML主要面向人类用户或静态机器身份,其粗粒度控制、单一实体聚焦及缺乏上下文感知能力无法满足MAS中智能体的复杂需求。解决方案的关键在于提出一种新型的代理AI IAM框架,该框架基于丰富的可验证代理身份(Agent IDs),利用去中心化标识符(DIDs)和可验证凭证(VCs)来封装代理的能力、来源、行为范围及安全状态,并包含代理命名服务(ANS)、动态细粒度访问控制机制以及统一的全局会话管理和策略执行层,以实现跨异构代理通信协议的实时控制与一致撤销。此外,还探索了零知识证明(ZKPs)在隐私保护属性披露和可验证策略合规性中的应用。

链接: https://arxiv.org/abs/2505.19301
作者: Ken Huang,Vineeth Sai Narajala,John Yeoh,Ramesh Raskar,Youssef Harkati,Jerry Huang,Idan Habler,Chris Hughes
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 24 Pages, 5 figures, 2 tables

点击查看摘要

Abstract:Traditional Identity and Access Management (IAM) systems, primarily designed for human users or static machine identities via protocols such as OAuth, OpenID Connect (OIDC), and SAML, prove fundamentally inadequate for the dynamic, interdependent, and often ephemeral nature of AI agents operating at scale within Multi Agent Systems (MAS), a computational system composed of multiple interacting intelligent agents that work collectively. This paper posits the imperative for a novel Agentic AI IAM framework: We deconstruct the limitations of existing protocols when applied to MAS, illustrating with concrete examples why their coarse-grained controls, single-entity focus, and lack of context-awareness falter. We then propose a comprehensive framework built upon rich, verifiable Agent Identities (IDs), leveraging Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), that encapsulate an agents capabilities, provenance, behavioral scope, and security posture. Our framework includes an Agent Naming Service (ANS) for secure and capability-aware discovery, dynamic fine-grained access control mechanisms, and critically, a unified global session management and policy enforcement layer for real-time control and consistent revocation across heterogeneous agent communication protocols. We also explore how Zero-Knowledge Proofs (ZKPs) enable privacy-preserving attribute disclosure and verifiable policy compliance. We outline the architecture, operational lifecycle, innovative contributions, and security considerations of this new IAM paradigm, aiming to establish the foundational trust, accountability, and security necessary for the burgeoning field of agentic AI and the complex ecosystems they will inhabit. Comments: 24 Pages, 5 figures, 2 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2505.19301 [cs.CR] (or arXiv:2505.19301v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.19301 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vineeth Sai Narajala [view email] [v1] Sun, 25 May 2025 20:21:55 UTC (2,732 KB)
zh

[AI-131] Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation ACL2025

【速读】:该论文试图解决在自监督学习(Self-supervised learning, SSL)中,如何有效分离语音内容信息与说话人特征等问题,以生成更具内容独立性的表示。现有方法在去除说话人信息时往往会影响其他语音成分,或需要复杂的模型结构。该论文提出的解决方案的关键在于通过线性分解SSL表示,将其划分为说话人特定和说话人无关的组件,从而实现有效的说话人解耦,提升内容驱动任务的性能。

链接: https://arxiv.org/abs/2505.19273
作者: Giuseppe Ruggiero,Matteo Testa,Jurgen Van de Walle,Luigi Di Caro
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Full paper accepted at ACL 2025

点击查看摘要

Abstract:Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.
zh

[AI-132] Using Large Language Models to Assess Teachers Pedagogical Content Knowledge

【速读】:该论文试图解决通过基于表现的任务评估教师的学科教学知识(Pedagogical Content Knowledge, PCK)时存在耗时耗力的问题,同时探讨大型语言模型(Large Language Models, LLMs)在自动评分过程中是否引入与传统机器学习(Machine Learning, ML)和人工评分员相似或不同的构念无关方差(Construct-Irrelevant Variance, CIV)。解决方案的关键在于利用广义线性混合模型(Generalized Linear Mixed Models, GLMMs)比较三种评分来源(人工评分员、监督式ML和LLM)在情景变异、评分者严厉程度和对情景敏感性三个CIV来源上的方差成分和评分模式,从而评估LLM在提升评分效率的同时是否引入了与人类评分员类似的CIV。

链接: https://arxiv.org/abs/2505.19266
作者: Yaxuan Yang,Shiyu Wang,Xiaoming Zhai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Assessing teachers’ pedagogical content knowledge (PCK) through performance-based tasks is both time and effort-consuming. While large language models (LLMs) offer new opportunities for efficient automatic scoring, little is known about whether LLMs introduce construct-irrelevant variance (CIV) in ways similar to or different from traditional machine learning (ML) and human raters. This study examines three sources of CIV – scenario variability, rater severity, and rater sensitivity to scenario – in the context of video-based constructed-response tasks targeting two PCK sub-constructs: analyzing student thinking and evaluating teacher responsiveness. Using generalized linear mixed models (GLMMs), we compared variance components and rater-level scoring patterns across three scoring sources: human raters, supervised ML, and LLM. Results indicate that scenario-level variance was minimal across tasks, while rater-related factors contributed substantially to CIV, especially in the more interpretive Task II. The ML model was the most severe and least sensitive rater, whereas the LLM was the most lenient. These findings suggest that the LLM contributes to scoring efficiency while also introducing CIV as human raters do, yet with varying levels of contribution compared to supervised ML. Implications for rater training, automated scoring design, and future research on model interpretability are discussed.
zh

[AI-133] Cellular Traffic Prediction via Byzantine-robust Asynchronous Federated Learning

【速读】:该论文旨在解决传统网络流量预测方法在集中式训练中面临的数据隐私泄露和延迟问题,以及联邦学习框架在存在拜占庭客户端时模型鲁棒性不足的问题。其解决方案的关键在于提出一种基于分布鲁棒优化的异步差分联邦学习框架,通过多客户端协作训练预测模型并引入本地差分隐私保护机制,同时结合正则化技术提升模型对拜占庭攻击的鲁棒性。

链接: https://arxiv.org/abs/2505.19263
作者: Hui Ma,Kai Yang,Yang Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Network traffic prediction plays a crucial role in intelligent network operation. Traditional prediction methods often rely on centralized training, necessitating the transfer of vast amounts of traffic data to a central server. This approach can lead to latency and privacy concerns. To address these issues, federated learning integrated with differential privacy has emerged as a solution to improve data privacy and model robustness in distributed settings. Nonetheless, existing federated learning protocols are vulnerable to Byzantine attacks, which may significantly compromise model robustness. Developing a robust and privacy-preserving prediction model in the presence of Byzantine clients remains a significant challenge. To this end, we propose an asynchronous differential federated learning framework based on distributionally robust optimization. The proposed framework utilizes multiple clients to train the prediction model collaboratively with local differential privacy. In addition, regularization techniques have been employed to further improve the Byzantine robustness of the models. We have conducted extensive experiments on three real-world datasets, and the results elucidate that our proposed distributed algorithm can achieve superior performance over existing methods.
zh

[AI-134] owards Large Reasoning Models for Agriculture

【速读】:该论文试图解决农业决策中复杂、情境特定的推理问题,这类问题涉及作物选择、实践和干预措施,高度依赖地理、气候和经济条件。传统大型语言模型(Large Language Models, LLMs)由于推理能力有限,难以有效处理此类问题。论文提出的解决方案关键在于利用大型推理模型(Large Reasoning Models, LRMs),并通过构建AgReason基准和AgThoughts数据集,提升农业推理能力。实验结果显示,LRMs在农业推理任务中表现优于传统模型,尽管仍存在挑战,但基于Gemini的基线模型在该任务中达到了36%的准确率。

链接: https://arxiv.org/abs/2505.19259
作者: Hossein Zaremehrjerdi,Shreyan Ganguly,Ashlyn Rairdin,Elizabeth Tranel,Benjamin Feuer,Juan Ignacio Di Salvo,Srikanth Panthulugiri,Victoria Moser,Sarah Jones,Joscif G Raigne,Yanben Shen,Heidi M. Dornath,Aditya Balu,Adarsh Krishnamurthy,Asheesh K Singh,Arti Singh,Baskar Ganapathysubramanian,Chinmay Hegde,Soumik Sarkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agricultural decision-making involves complex, context-specific reasoning, where choices about crops, practices, and interventions depend heavily on geographic, climatic, and economic conditions. Traditional large language models (LLMs) often fall short in navigating this nuanced problem due to limited reasoning capacity. We hypothesize that recent advances in large reasoning models (LRMs) can better handle such structured, domain-specific inference. To investigate this, we introduce AgReason, the first expert-curated open-ended science benchmark with 100 questions for agricultural reasoning. Evaluations across thirteen open-source and proprietary models reveal that LRMs outperform conventional ones, though notable challenges persist, with the strongest Gemini-based baseline achieving 36% accuracy. We also present AgThoughts, a large-scale dataset of 44.6K question-answer pairs generated with human oversight and equipped with synthetically generated reasoning traces. Using AgThoughts, we develop AgThinker, a suite of small reasoning models that can be run on consumer-grade GPUs, and show that our dataset can be effective in unlocking agricultural reasoning abilities in LLMs. Our project page is here: this https URL
zh

[AI-135] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

【速读】:该论文试图解决视觉语言模型(VLMs)在推理过程中缺乏真正多模态推理能力的问题,即现有方法主要生成基于静态图像输入的文本推理,未能实现文本与视觉信息的动态交互。其解决方案的关键在于提出VTool-R1框架,该框架通过将文本和中间视觉推理步骤交错,训练VLMs生成多模态的思维链,并集成基于Python的视觉编辑工具到强化学习微调(RFT)过程中,使模型能够学习何时以及如何生成有助于最终推理的视觉推理步骤。

链接: https://arxiv.org/abs/2505.19255
作者: Mingyuan Wu,Jingcheng Yang,Jize Jiang,Meitang Li,Kaizhuo Yan,Hanchao Yu,Minjia Zhang,Chengxiang Zhai,Klara Nahrstedt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to “think with images” and generate multimodal chain of thoughts with tools. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.19255 [cs.LG] (or arXiv:2505.19255v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.19255 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-136] Learning-Augmented Online Bipartite Fractional Matching

【速读】:该论文旨在解决在线二部图匹配问题,特别是在算法每轮迭代中获得建议匹配的情况下,如何设计更优的算法。其关键解决方案是开发出在顶点加权和无权重情形下均优于随机选择遵循建议或不遵循建议策略的算法。对于顶点加权情形,该算法进一步扩展至小报价假设下的AdWords问题,相较于早期工作取得了显著改进。

链接: https://arxiv.org/abs/2505.19252
作者: Davin Choo,Billy Jin,Yongho Shin
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online bipartite matching is a fundamental problem in online optimization, extensively studied both in its integral and fractional forms due to its theoretical significance and practical applications, such as online advertising and resource allocation. Motivated by recent progress in learning-augmented algorithms, we study online bipartite fractional matching when the algorithm is given advice in the form of a suggested matching in each iteration. We develop algorithms for both the vertex-weighted and unweighted variants that provably dominate the naive “coin flip” strategy of randomly choosing between the advice-following and advice-free algorithms. Moreover, our algorithm for the vertex-weighted setting extends to the AdWords problem under the small bids assumption, yielding a significant improvement over the seminal work of Mahdian, Nazerzadeh, and Saberi (EC 2007, TALG 2012). Complementing our positive results, we establish a hardness bound on the robustness-consistency tradeoff that is attainable by any algorithm. We empirically validate our algorithms through experiments on synthetic and real-world data.
zh

[AI-137] Improving Value Estimation Critically Enhances Vanilla Policy Gradient

【速读】:该论文试图解决传统策略梯度算法在强化学习(Reinforcement Learning, RL)任务中性能不如TRPO和PPO等现代策略梯度算法的问题。研究认为,以往普遍认为的近似信任区域约束对策略改进的稳定性作用并非关键因素,而是在每次迭代中增加价值函数更新步骤所带来的价值估计精度提升才是性能提升的核心。解决方案的关键在于通过增加每轮迭代中的价值更新次数,使原始策略梯度方法即可达到或超越PPO的性能,同时该方法在超参数选择上表现出更高的鲁棒性。

链接: https://arxiv.org/abs/2505.19247
作者: Tao Wang,Ruipeng Zhang,Sicun Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 15 pages and 21 figures

点击查看摘要

Abstract:Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.
zh

[AI-138] o CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers

【速读】:该论文试图解决Chain-of-Thought (CoT) 和 Looped Transformers 在推理任务中的比较能力尚不明确的问题,旨在分析二者在不同任务场景下的优势与局限性。其解决方案的关键在于通过形式化分析,揭示Looped Transformers 在确定性任务中能够高效模拟并行计算,表现为对有向无环图的评估;而CoT结合随机解码则在组合结构的近似推理中表现优异,适用于自约化问题。这一分析为选择适合深度驱动递归的推理范式提供了理论依据和实践指导。

链接: https://arxiv.org/abs/2505.19245
作者: Kevin Xu,Issei Sato
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.
zh

[AI-139] ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)对齐过程中依赖高质量人类偏好数据集的问题,而此类数据集的收集成本高且资源消耗大。为了解决这一问题,论文提出了一种名为ActiveDPO的算法,其关键在于采用理论基础坚实的数据显示选择准则,适用于非线性奖励函数,并直接利用LLM自身参数化用于主动数据选择的奖励模型。这种方法明确考虑了LLM对数据选择的影响,从而实现了更高效和有效的数据收集。

链接: https://arxiv.org/abs/2505.19241
作者: Xiaoqiang Lin,Arun Verma,Zhongxiang Dai,Daniela Rus,See-Kiong Ng,Bryan Kian Hsiang Low
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.
zh

[AI-140] Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

【速读】:该论文旨在解决在真实世界控制系统中设计安全策略时面临的约束决策问题,特别是在实际模型与可获取的模拟器/名义模型之间存在不匹配的情况下,如何学习一个既能最大化累积奖励又满足约束条件的策略。其解决方案的关键在于提出一种新颖的技术,该技术能够有效最小化约束值函数以满足约束条件,而在所有约束都被满足的情况下,进一步最大化鲁棒奖励值函数。该算法被证明能够在不超过ε次优性的情况下,在O(ε^-2)次迭代内找到可行策略,且相比现有方法无需采用二分搜索,从而显著降低了计算时间。

链接: https://arxiv.org/abs/2505.19238
作者: Sourav Ganguly,Arnob Ghosh,Kishan Panaganti,Adam Wierman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function–to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most \epsilon sub-optimality and feasible policy after O(\epsilon^-2) iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ( \gamma ) and by at least 6x for larger value of \gamma .
zh

[AI-141] Sensorimotor features of self-awareness in multimodal large language models

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)是否能够通过感官运动体验发展出自知能力的问题,即探索其在非人类平台(如机器人)上实现自我意识的可能性。解决方案的关键在于将多模态LLM集成到自主移动机器人中,通过传感器输入与环境的交互,验证其环境感知、自我识别和预测能力,并揭示感官整合对自我意识不同维度的影响及其与记忆系统的协同作用。

链接: https://arxiv.org/abs/2505.19237
作者: Iñaki Dellibarda Varela,Pablo Romero-Sorozabal,Diego Torricelli,Gabriel Delgado-Oleas,Jose Ignacio Serrano,Maria Dolores del Castillo Sobrino,Eduardo Rocon,Manuel Cebrian
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages, 3 figures, 1 table

点击查看摘要

Abstract:Self-awareness - the ability to distinguish oneself from the surrounding environment - underpins intelligent, autonomous behavior. Recent advances in AI achieve human-like performance in tasks integrating multimodal information, particularly in large language models, raising interest in the embodiment capabilities of AI agents on nonhuman platforms such as robots. Here, we explore whether multimodal LLMs can develop self-awareness solely through sensorimotor experiences. By integrating a multimodal LLM into an autonomous mobile robot, we test its ability to achieve this capacity. We find that the system exhibits robust environmental awareness, self-recognition and predictive awareness, allowing it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of self-awareness and its coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs identify critical modalities for each dimension, demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory in coherent reasoning. These findings demonstrate that, given appropriate sensory information about the world and itself, multimodal LLMs exhibit emergent self-awareness, opening the door to artificial embodied cognitive systems.
zh

[AI-142] DeCoDe: Defer-and-Complement Decision-Making via Decoupled Concept Bottleneck Models

【速读】:该论文试图解决人机协作中的决策问题,即如何在AI自主处理任务、将任务转交人类专家或通过协作互补之间做出合理选择。现有学习转交(Learning to Defer)方法通常仅在AI与人类之间进行二元决策,忽视了两者的优势互补性,并且缺乏可解释性。论文提出的解决方案是基于解耦概念瓶颈模型的“转交与补充”决策框架(DeCoDe),其关键在于利用人类可理解的概念表示进行策略决策,从而提升整个决策过程的透明度,并通过门控网络和新型代理损失函数实现任务实例级别的自适应协作。

链接: https://arxiv.org/abs/2505.19220
作者: Chengbo He,Bochao Zou,Junliang Xing,Jiansheng Chen,Yuanchun Shi,Huimin Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In human-AI collaboration, a central challenge is deciding whether the AI should handle a task, be deferred to a human expert, or be addressed through collaborative effort. Existing Learning to Defer approaches typically make binary choices between AI and humans, neglecting their complementary strengths. They also lack interpretability, a critical property in high-stakes scenarios where users must understand and, if necessary, correct the model’s reasoning. To overcome these limitations, we propose Defer-and-Complement Decision-Making via Decoupled Concept Bottleneck Models (DeCoDe), a concept-driven framework for human-AI collaboration. DeCoDe makes strategy decisions based on human-interpretable concept representations, enhancing transparency throughout the decision process. It supports three flexible modes: autonomous AI prediction, deferral to humans, and human-AI collaborative complementarity, selected via a gating network that takes concept-level inputs and is trained using a novel surrogate loss that balances accuracy and human effort. This approach enables instance-specific, interpretable, and adaptive human-AI collaboration. Experiments on real-world datasets demonstrate that DeCoDe significantly outperforms AI-only, human-only, and traditional deferral baselines, while maintaining strong robustness and interpretability even under noisy expert annotations.
zh

[AI-143] Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding

【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)问题,即在复杂环境中为多个智能体计算无碰撞的路径。其解决方案的关键在于提出一个统一框架,整合了基于搜索的方法(如Conflict-Based Search、Priority-Based Search和Large Neighborhood Search)、基于编译的方法(如SAT、SMT、CSP、ASP和MIP公式化)以及数据驱动的技术(如强化学习、监督学习和混合策略),以系统性地分析和比较不同方法的性能与适用场景。

链接: https://arxiv.org/abs/2505.19219
作者: Shiyue Wang,Haozheng Xu,Yuhan Zhang,Jingran Lin,Changhong Lu,Xiangfeng Wang,Wenhao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Combinatorics (math.CO)
备注: 112 pages, 21 figures, 20 tables

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) is a fundamental problem in artificial intelligence and robotics, requiring the computation of collision-free paths for multiple agents navigating from their start locations to designated goals. As autonomous systems become increasingly prevalent in warehouses, urban transportation, and other complex environments, MAPF has evolved from a theoretical challenge to a critical enabler of real-world multi-robot coordination. This comprehensive survey bridges the long-standing divide between classical algorithmic approaches and emerging learning-based methods in MAPF research. We present a unified framework that encompasses search-based methods (including Conflict-Based Search, Priority-Based Search, and Large Neighborhood Search), compilation-based approaches (SAT, SMT, CSP, ASP, and MIP formulations), and data-driven techniques (reinforcement learning, supervised learning, and hybrid strategies). Through systematic analysis of experimental practices across 200+ papers, we uncover significant disparities in evaluation methodologies, with classical methods typically tested on larger-scale instances (up to 200 by 200 grids with 1000+ agents) compared to learning-based approaches (predominantly 10-100 agents). We provide a comprehensive taxonomy of evaluation metrics, environment types, and baseline selections, highlighting the need for standardized benchmarking protocols. Finally, we outline promising future directions including mixed-motive MAPF with game-theoretic considerations, language-grounded planning with large language models, and neural solver architectures that combine the rigor of classical methods with the flexibility of deep learning. This survey serves as both a comprehensive reference for researchers and a practical guide for deploying MAPF solutions in increasingly complex real-world applications.
zh

[AI-144] Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning

【速读】:该论文旨在解决当前医疗影像领域中基于强化学习的微调方法在处理开放性、推理密集型临床决策任务时的局限性,尤其是现有方法主要聚焦于封闭式视觉问答(VQA),限制了模型的世界知识检索能力和灵活任务适应性。其解决方案的关键在于提出\textbf{MedCCO},这是一个面向医疗VQA的多模态强化学习框架,通过课程引导的微调范式统一了封闭式与开放式数据,首先在多样化的封闭式医疗VQA任务上进行微调以建立领域基础推理能力,随后逐步适应开放式任务,从而提升知识深度和临床可解释性。

链接: https://arxiv.org/abs/2505.19213
作者: Shaohao Rui,Kaitao Chen,Weijie Ma,Xiaosong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning with verifiable, rule-based rewards have greatly enhanced the reasoning capabilities and out-of-distribution generalization of VLMs/LLMs, obviating the need for manually crafted reasoning chains. Despite these promising developments in the general domain, their translation to medical imaging remains limited. Current medical reinforcement fine-tuning (RFT) methods predominantly focus on close-ended VQA, thereby restricting the model’s ability to engage in world knowledge retrieval and flexible task adaptation. More critically, these methods fall short of addressing the critical clinical demand for open-ended, reasoning-intensive decision-making. To bridge this gap, we introduce \textbfMedCCO, the first multimodal reinforcement learning framework tailored for medical VQA that unifies close-ended and open-ended data within a curriculum-driven RFT paradigm. Specifically, MedCCO is initially fine-tuned on a diverse set of close-ended medical VQA tasks to establish domain-grounded reasoning capabilities, and is then progressively adapted to open-ended tasks to foster deeper knowledge enhancement and clinical interpretability. We validate MedCCO across eight challenging medical VQA benchmarks, spanning both close-ended and open-ended settings. Experimental results show that MedCCO consistently enhances performance and generalization, achieving a 11.4% accuracy gain across three in-domain tasks, and a 5.7% improvement on five out-of-domain benchmarks. These findings highlight the promise of curriculum-guided RL in advancing robust, clinically-relevant reasoning in medical multimodal language models.
zh

[AI-145] OptiMindTune: A Multi-Agent Framework for Intelligent Hyperparameter Optimization

【速读】:该论文旨在解决机器学习模型开发中的超参数优化(Hyperparameter Optimization, HPO)问题,该问题在模型性能和泛化能力方面具有重要影响,但传统方法在高维性、复杂依赖性和计算成本方面面临显著挑战。解决方案的关键在于提出一种名为OptiMindTune的多智能体框架,该框架通过三个专用AI代理——推荐代理、评估代理和决策代理——的协作智能实现高效优化,这些代理均基于Google的Gemini模型,分别负责模型选择、超参数建议、稳健评估和战略决策,从而通过动态交互与知识共享,更快速且稳健地收敛到最优超参数配置。

链接: https://arxiv.org/abs/2505.19205
作者: Meher Bhaskar Madiraju,Meher Sai Preetam Madiraju
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 7 pages, 2 tables

点击查看摘要

Abstract:Hyperparameter optimization (HPO) is a critical yet challenging aspect of machine learning model development, significantly impacting model performance and generalization. Traditional HPO methods often struggle with high dimensionality, complex interdependencies, and computational expense. This paper introduces OptiMindTune, a novel multi-agent framework designed to intelligently and efficiently optimize hyperparameters. OptiMindTune leverages the collaborative intelligence of three specialized AI agents – a Recommender Agent, an Evaluator Agent, and a Decision Agent – each powered by Google’s Gemini models. These agents address distinct facets of the HPO problem, from model selection and hyperparameter suggestion to robust evaluation and strategic decision-making. By fostering dynamic interactions and knowledge sharing, OptiMindTune aims to converge to optimal hyperparameter configurations more rapidly and robustly than existing single-agent or monolithic approaches. Our framework integrates principles from advanced large language models, and adaptive search to achieve scalable and intelligent AutoML. We posit that this multi-agent paradigm offers a promising avenue for tackling the increasing complexity of modern machine learning model tuning.
zh

[AI-146] EnvSDD: Benchmarking Environmental Sound Deepfake Detection INTERSPEECH2025

【速读】:该论文试图解决环境声音深度伪造检测(environmental sound deepfake detection)中存在的数据规模小和音频类型有限的问题,以及现有针对语音和歌唱声的深度伪造检测方法在真实环境声音上的有效性不足的问题。解决方案的关键在于引入了EnvSDD,这是首个大规模的、经过精心筛选的环境声音深度伪造检测数据集,包含45.25小时的真实音频和316.74小时的伪造音频,并基于预训练的音频基础模型提出了一个有效的音频深度伪造检测系统,实验结果表明该系统在EnvSDD上的表现优于语音和歌唱领域现有的最先进系统。

链接: https://arxiv.org/abs/2505.19203
作者: Han Yin,Yang Xiao,Rohan Kumar Das,Jisheng Bai,Haohe Liu,Wenwu Wang,Mark D Plumbley
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.
zh

[AI-147] Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and Guidance

【速读】:该论文旨在解决从非结构化财务文件中提取结构化和定量洞察信息的问题,这一过程在投资研究中至关重要,但传统方法依赖于耗时且资源密集的人工处理,限制了可扩展性并延迟了研究流程。论文提出的解决方案关键在于构建一个由大型语言模型组成的多智能体系统,该系统包含两个专用智能体:提取智能体和文本到SQL智能体。提取智能体能够自动识别关键绩效指标并进行格式标准化与准确性验证,而文本到SQL智能体则能将自然语言查询转换为可执行的SQL语句,从而实现对结构化数据的精准访问。

链接: https://arxiv.org/abs/2505.19197
作者: Chanyeol Choi,Jihoon Kwon,Minjae Kim,Juneha Hwang,Minsoo Ha,Chaewoon Kim,Jaeseon Ha,Suyeol Yun,Jin Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, FinIR’25

点击查看摘要

Abstract:Extracting structured and quantitative insights from unstructured financial filings is essential in investment research, yet remains time-consuming and resource-intensive. Conventional approaches in practice rely heavily on labor-intensive manual processes, limiting scalability and delaying the research workflow. In this paper, we propose an efficient and scalable method for accurately extracting quantitative insights from unstructured financial documents, leveraging a multi-agent system composed of large language models. Our proposed multi-agent system consists of two specialized agents: the \emphExtraction Agent and the \emphText-to-SQL Agent. The \textitExtraction Agent automatically identifies key performance indicators from unstructured financial text, standardizes their formats, and verifies their accuracy. On the other hand, the \textitText-to-SQL Agent generates executable SQL statements from natural language queries, allowing users to access structured data accurately without requiring familiarity with the database schema. Through experiments, we demonstrate that our proposed system effectively transforms unstructured text into structured data accurately and enables precise retrieval of key information. First, we demonstrate that our system achieves approximately 95% accuracy in transforming financial filings into structured data, matching the performance level typically attained by human annotators. Second, in a human evaluation of the retrieval task – where natural language queries are used to search information from structured data – 91% of the responses were rated as correct by human evaluators. In both evaluations, our system generalizes well across financial document types, consistently delivering reliable performance.
zh

[AI-148] Curvature Dynamic Black-box Attack: revisiting adversarial robustness via dynamic curvature estimation

【速读】:该论文试图解决深度学习模型在对抗攻击下的脆弱性问题,特别是通过分析决策边界曲率与对抗鲁棒性之间的关系来提升模型的防御能力。其解决方案的关键在于提出一种新的查询高效方法——动态曲率估计(Dynamic Curvature Estimation, DCE),用于在黑盒设置下估计决策边界曲率,并基于此构建了改进的对抗攻击方法——曲率动态黑盒攻击(Curvature Dynamic Black-box Attack, CDBA)。

链接: https://arxiv.org/abs/2505.19194
作者: Peiran Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial attack reveals the vulnerability of deep learning models. For about a decade, countless attack and defense methods have been proposed, leading to robustified classifiers and better understanding of models. Among these methods, curvature-based approaches have attracted attention because it is assumed that high curvature may give rise to rough decision boundary. However, the most commonly used \textitcurvature is the curvature of loss function, scores or other parameters from within the model as opposed to decision boundary curvature, since the former can be relatively easily formed using second order derivative. In this paper, we propose a new query-efficient method, dynamic curvature estimation(DCE), to estimate the decision boundary curvature in a black-box setting. Our approach is based on CGBA, a black-box adversarial attack. By performing DCE on a wide range of classifiers, we discovered, statistically, a connection between decision boundary curvature and adversarial robustness. We also propose a new attack method, curvature dynamic black-box attack(CDBA) with improved performance using the dynamically estimated curvature.
zh

[AI-149] Investigating Pedagogical Teacher and Student LLM Agents : Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style

【速读】:该论文旨在解决教育中教师教学策略难以适应学生多样化认知和行为特征的问题,以及现有模拟框架在学生建模和教师策略自适应方面的局限性。其解决方案的关键在于引入一个集成基于大型语言模型(Large Language Model, LLM)的异构学生代理与自优化教师代理的新型模拟框架,其中教师代理通过遗传算法动态演化教学策略,以适应不同学习者的表现;同时提出Persona-RAG模块,通过检索增强生成技术实现针对个体学习风格的知识检索,从而提升模拟教育场景的真实性与个性化水平。

链接: https://arxiv.org/abs/2505.19173
作者: Debdeep Sanyal,Agniva Maiti,Umakanta Maharana,Dhruv Kumar,Ankur Mali,C. Lee Giles,Murari Mandal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 38 Pages

点击查看摘要

Abstract:Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students, a persistent challenge in education and teacher training. While Large Language Models (LLMs) offer promise as tools to simulate such complex pedagogical environments, current simulation frameworks are limited in two key respects: (1) they often reduce students to static knowledge profiles, and (2) they lack adaptive mechanisms for modeling teachers who evolve their strategies in response to student feedback. To address these gaps, \textbfwe introduce a novel simulation framework that integrates LLM-based heterogeneous student agents with a self-optimizing teacher agent. The teacher agent’s pedagogical policy is dynamically evolved using a genetic algorithm, allowing it to discover and refine effective teaching strategies based on the aggregate performance of diverse learners. In addition, \textbfwe propose Persona-RAG, a Retrieval Augmented Generation module that enables student agents to retrieve knowledge tailored to their individual learning styles. Persona-RAG preserves the retrieval accuracy of standard RAG baselines while enhancing personalization, an essential factor in modeling realistic educational scenarios. Through extensive experiments, we demonstrate how our framework supports the emergence of distinct and interpretable teaching patterns when interacting with varied student populations. Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in controlled, data-driven environments.
zh

[AI-150] Amplifying Human Creativity and Problem Solving with AI Through Generative Collective Intelligence

【速读】:该论文试图解决传统算法在问题解决和决策制定中的局限性,即无法有效整合人类推理与AI模型的优势。其解决方案的关键在于提出一种名为生成式集体智能(Generative Collective Intelligence, GCI)的新框架,该框架将AI置于群体和社会层面,使其在双重角色中发挥作用:作为交互代理和作为知识积累、组织与利用的技术工具。通过构建人类推理与AI模型之间的认知桥梁,GCI能够克服纯算法方法的不足,并将AI重新定义为一种社会文化技术,以结构化协作方式帮助群体超越传统沟通障碍解决复杂问题。

链接: https://arxiv.org/abs/2505.19167
作者: Thomas P. Kehler,Scott E. Page,Alex Pentland,Martin Reeves,John Seely Brown
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a new framework for human-AI collaboration that amplifies the distinct capabilities of both. This framework, which we call Generative Collective Intelligence (GCI), shifts AI to the group/social level and employs AI in dual roles: as interactive agents and as technology that accumulates, organizes, and leverages knowledge. By creating a cognitive bridge between human reasoning and AI models, GCI can overcome the limitations of purely algorithmic approaches to problem-solving and decision-making. The framework demonstrates how AI can be reframed as a social and cultural technology that enables groups to solve complex problems through structured collaboration that transcends traditional communication barriers. We describe the mathematical foundations of GCI based on comparative judgment and minimum regret principles, and illustrate its applications across domains including climate adaptation, healthcare transformation, and civic participation. By combining human creativity with AI’s computational capabilities, GCI offers a promising approach to addressing complex societal challenges that neither human or machines can solve alone.
zh

[AI-151] OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLM s

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在企业环境中理解和遵循组织层级结构及权限约束的能力问题,特别是其在处理复杂、多层级权限时的合规性问题。解决方案的关键在于构建了一个名为\textbfOrgAccess的合成但具有代表性的基准测试集,包含40种不同类型的权限,并设计了三种难度级别的测试用例(易、中、难),以评估LLMs在面对多权限冲突和层级规则时的准确性与合规性。通过这一基准,研究揭示了当前先进LLMs在遵循复杂规则和组合推理方面存在显著不足。

链接: https://arxiv.org/abs/2505.19165
作者: Debdeep Sanyal Umakanta Maharana,Yash Sinha,Hong Ming Tan,Shirish Karande,Mohan Kankanhalli,Murari Mandal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 56 Pages

点击查看摘要

Abstract:Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: \textitcan these models reliably understand and operate within the complex, often nuanced, constraints imposed by organizational hierarchies and associated permissions? Evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies. We introduce a synthetic yet representative \textbfOrgAccess benchmark consisting of 40 distinct types of permissions commonly relevant across different organizational roles and levels. We further create three types of permissions: 40,000 easy (1 permission), 10,000 medium (3-permissions tuple), and 20,000 hard (5-permissions tuple) to test LLMs’ ability to accurately assess these permissions and generate responses that strictly adhere to the specified hierarchical rules, particularly in scenarios involving users with overlapping or conflicting permissions. Our findings reveal that even state-of-the-art LLMs struggle significantly to maintain compliance with role-based structures, even with explicit instructions, with their performance degrades further when navigating interactions involving two or more conflicting permissions. Specifically, even \textbfGPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark. This demonstrates a critical limitation in LLMs’ complex rule following and compositional reasoning capabilities beyond standard factual or STEM-based benchmarks, opening up a new paradigm for evaluating their fitness for practical, structured environments.
zh

[AI-152] BroadGen: A Framework for Generating Effective and Efficient Advertiser Broad Match Keyphrase Recommendations

【速读】:该论文旨在解决赞助搜索广告中关键词推荐的问题,特别是针对精确匹配类型带来的高管理成本、有限的定向范围以及搜索查询模式变化等问题。其解决方案的关键在于提出BroadGen框架,该框架通过利用历史搜索查询数据,推荐高效且有效的广泛匹配关键词,并通过令牌对应建模保持查询随时间的稳定性,从而提升广告定向的准确性和效率。

链接: https://arxiv.org/abs/2505.19164
作者: Ashirbad Mishra,Jinyu Zhao,Soumik Dey,Hansi Wu,Binbin Li,Kamesh Madduri
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the domain of sponsored search advertising, the focus of Keyphrase recommendation has largely been on exact match types, which pose issues such as high management expenses, limited targeting scope, and evolving search query patterns. Alternatives like Broad match types can alleviate certain drawbacks of exact matches but present challenges like poor targeting accuracy and minimal supervisory signals owing to limited advertiser usage. This research defines the criteria for an ideal broad match, emphasizing on both efficiency and effectiveness, ensuring that a significant portion of matched queries are relevant. We propose BroadGen, an innovative framework that recommends efficient and effective broad match keyphrases by utilizing historical search query data. Additionally, we demonstrate that BroadGen, through token correspondence modeling, maintains better query stability over time. BroadGen’s capabilities allow it to serve daily, millions of sellers at eBay with over 2.3 billion items.
zh

[AI-153] CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

【速读】:该论文旨在解决文本到语音(TTS)语音克隆技术带来的隐私泄露问题,即通过极短的参考音频即可高精度复制说话人的语音身份,从而威胁用户隐私安全。其解决方案的关键在于提出CloneShield,一种针对零样本语音克隆的通用时域对抗扰动框架。该方法通过多目标优化问题建模扰动生成,并采用多梯度下降算法(MGDA)确保对多种语音样本的鲁棒防护;同时,利用梅尔频谱图分解对抗扰动并进行样本级微调,以在保持音频自然感知的同时显著降低克隆语音的说话人相似性和语音质量。

链接: https://arxiv.org/abs/2505.19119
作者: Renyuan Li,Zhibo Liang,Haichuan Zhang,Tianyu Shi,Zhiyuan Cheng,Jia Shi,Carl Yang,Mingjie Tang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10pages, 4figures

点击查看摘要

Abstract:Recent breakthroughs in text-to-speech (TTS) voice cloning have raised serious privacy concerns, allowing highly accurate vocal identity replication from just a few seconds of reference audio, while retaining the speaker’s vocal authenticity. In this paper, we introduce CloneShield, a universal time-domain adversarial perturbation framework specifically designed to defend against zero-shot voice cloning. Our method provides protection that is robust across speakers and utterances, without requiring any prior knowledge of the synthesized text. We formulate perturbation generation as a multi-objective optimization problem, and propose Multi-Gradient Descent Algorithm (MGDA) to ensure the robust protection across diverse utterances. To preserve natural auditory perception for users, we decompose the adversarial perturbation via Mel-spectrogram representations and fine-tune it for each sample. This design ensures imperceptibility while maintaining strong degradation effects on zero-shot cloned outputs. Experiments on three state-of-the-art zero-shot TTS systems, five benchmark datasets and evaluations from 60 human listeners demonstrate that our method preserves near-original audio quality in protected inputs (PESQ = 3.90, SRS = 0.93) while substantially degrading both speaker similarity and speech quality in cloned samples (PESQ = 1.07, SRS = 0.08).
zh

[AI-154] FP4 All the Way: Fully Quantized Training of LLM s

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中对高精度浮点计算依赖过高的问题,通过使用主要为4位浮点(FP4)精度的全量化训练(Fully Quantized Training, FQT)来提升训练效率。其解决方案的关键在于优化FP4格式的设计,包括块大小、缩放格式和舍入方法,并采用随机舍入(stochastic rounding)以增强反向传播和参数更新的稳定性,同时在前向传播中使用四舍五入(round-to-nearest)以保持数值准确性。此外,研究还明确了量化训练的有效性阈值,即当梯度范数低于约√3倍的量化噪声时,量化训练效果会显著下降。基于这些关键设计,作者成功在256个Intel Gaudi2加速器上训练了一个70亿参数的模型,验证了FP4训练在性能和效率上的可行性。

链接: https://arxiv.org/abs/2505.19115
作者: Brian Chmiel,Maxim Fishman,Ron Banner,Daniel Soudry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately \sqrt3 times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in this https URL .
zh

[AI-155] SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在物理问题推理中对视觉理解能力不足的问题。其解决方案的关键在于构建一个大规模多模态基准测试集SeePhys,该基准覆盖了从中学到博士资格考试的物理问题,包含7个基础领域和21种高度异构的图表类型,并强调了视觉信息在解决问题中的必要性(75%的问题需要视觉信息提取才能正确解答)。通过这一基准,研究揭示了现有模型在图示解析与物理推理之间建立严格耦合以及摆脱对文本线索依赖方面的挑战。

链接: https://arxiv.org/abs/2505.19099
作者: Kun Xiang,Heng Li,Terry Jingchen Zhang,Yinya Huang,Zirong Liu,Peixin Qu,Jixi He,Jiaqi Chen,Yu-Jie Yuan,Jianhua Han,Hang Xu,Hanhui Li,Mrinmaya Sachan,Xiaodan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 46 pages

点击查看摘要

Abstract:We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models’ visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.
zh

[AI-156] Enable Lightweight and Precision-Scalable Posit/IEEE-754 Arithmetic in RISC-V Cores for Transprecision Computing

【速读】:该论文试图解决在RISC-V处理器中采用posit格式时面临的轻量级、精度可扩展且符合IEEE-754标准的硬件实现统一解决方案缺失的问题。其关键解决方案包括:1)将专用posit编解码器集成到原始浮点运算单元(FPU)中以实现轻量级设计;2)引入多精度/混合精度支持并动态调整指数位宽以实现精度可扩展性;3)复用和定制指令集架构(ISA)扩展以实现IEEE-754兼容的posit操作。

链接: https://arxiv.org/abs/2505.19096
作者: Qiong Li,Chao Fang,Longwei Huang,Jun Lin,Zhongfeng Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:While posit format offers superior dynamic range and accuracy for transprecision computing, its adoption in RISC-V processors is hindered by the lack of a unified solution for lightweight, precision-scalable, and IEEE-754 arithmetic compatible hardware implementation. To address these challenges, we enhance RISC-V processors by 1) integrating dedicated posit codecs into the original FPU for lightweight implementation, 2) incorporating multi/mixed-precision support with dynamic exponent size for precision-scalability, and 3) reusing and customizing ISA extensions for IEEE-754 compatible posit operations. Our comprehensive evaluation spans the modified FPU, RISC-V core, and SoC levels. It demonstrates that our implementation achieves 47.9% LUTs and 57.4% FFs reduction compared to state-of-the-art posit-enabled RISC-V processors, while achieving up to 2.54 \times throughput improvement in various GEMM kernels.
zh

[AI-157] ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World

【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLMs)或视觉-语言模型(Vision-Language Models, VLMs)的图形用户界面(Graphical User Interface, GUI)代理在新颖环境中的泛化能力不足以及依赖人工标注多样化数据集的问题。其解决方案的关键在于引入ScreenExplorer,这是一个通过Group Relative Policy Optimization (GRPO) 在真实、动态和开放的GUI环境中训练的VLM,并创新性地采用基于世界模型的内在好奇心奖励函数,以帮助代理克服探索的冷启动阶段,同时通过经验流蒸馏进一步提升模型的探索能力。

链接: https://arxiv.org/abs/2505.19095
作者: Runliang Niu,Jinglong Ji,Yi Chang,Qi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model’s exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.
zh

[AI-158] Reinforced Latent Reasoning for LLM -based Recommendation

【速读】:该论文试图解决在推荐系统中应用大型语言模型(Large Language Models, LLMs)时,因依赖显式链式思维(Chain-of-Thought, CoT)数据而导致的高质量数据获取困难和推理延迟高的问题。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的端到端训练框架——LatentR³,该框架通过隐式、信息密集的潜在推理代替显式的CoT生成,从而消除对显式CoT数据的依赖并提升推理效率。

链接: https://arxiv.org/abs/2505.19092
作者: Yang Zhang,Wenxin Xu,Xiaoyan Zhao,Wenjie Wang,Fuli Feng,Xiangnan He,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks, sparking growing interest in their application to preference reasoning in recommendation systems. Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data. However, these methods face significant practical limitations due to (1) the difficulty of obtaining high-quality CoT data in recommendation and (2) the high inference latency caused by generating CoT reasoning. In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning. This approach eliminates the need for explicit CoT generation and improves inference efficiency, as a small set of latent tokens can effectively capture the entire reasoning process. Building on this idea, we propose \textit\underlineReinforced \underlineLatent \underlineReasoning for \underlineRecommendation (LatentR ^3 ), a novel end-to-end training framework that leverages reinforcement learning (RL) to optimize latent reasoning without relying on any CoT this http URL ^3 adopts a two-stage training strategy: first, supervised fine-tuning to initialize the latent reasoning module, followed by pure RL training to encourage exploration through a rule-based reward design. Our RL implementation is based on a modified GRPO algorithm, which reduces computational overhead during training and introduces continuous reward signals for more efficient learning. Extensive experiments demonstrate that LatentR ^3 enables effective latent reasoning without any direct supervision of the reasoning process, significantly improving performance when integrated with different LLM-based recommendation methods. Our codes are available at this https URL.
zh

[AI-159] MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation

【速读】:该论文试图解决在物理仿真动画系统中,如何实现对人机耦合系统的高阶目标进行灵活控制的问题,当前方法在全肢体灵巧操作中虽取得一定成功,但其控制范式(如详细的运动学跟踪、连续物体轨迹跟随或直接VR远程操作)在跨整个系统的目标定义上存在局限性。解决方案的关键在于提出一种统一的生成策略——MaskedManipulator,该策略通过两阶段学习方法实现:首先训练一个跟踪控制器以从大规模人体动捕数据集中物理重建复杂的人机交互,随后将该控制器知识蒸馏到MaskedManipulator中,从而允许用户通过直观的高层目标(如目标物体姿态、关键角色姿态)指定复杂的移动与操作任务,并由系统合成必要的全身动作以达成目标。

链接: https://arxiv.org/abs/2505.19086
作者: Chen Tessler,Yifeng Jiang,Erwin Coumans,Zhengyi Luo,Gal Chechik,Xue Bin Peng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Humans interact with their world while leveraging precise full-body control to achieve versatile goals. This versatility allows them to solve long-horizon, underspecified problems, such as placing a cup in a sink, by seamlessly sequencing actions like approaching the cup, grasping, transporting it, and finally placing it in the sink. Such goal-driven control can enable new procedural tools for animation systems, enabling users to define partial objectives while the system naturally ``fills in’’ the intermediate motions. However, while current methods for whole-body dexterous manipulation in physics-based animation achieve success in specific interaction tasks, they typically employ control paradigms (e.g., detailed kinematic motion tracking, continuous object trajectory following, or direct VR teleoperation) that offer limited versatility for high-level goal specification across the entire coupled human-object system. To bridge this gap, we present MaskedManipulator, a unified and generative policy developed through a two-stage learning approach. First, our system trains a tracking controller to physically reconstruct complex human-object interactions from large-scale human mocap datasets. This tracking controller is then distilled into MaskedManipulator, which provides users with intuitive control over both the character’s body and the manipulated object. As a result, MaskedManipulator enables users to specify complex loco-manipulation tasks through intuitive high-level objectives (e.g., target object poses, key character stances), and MaskedManipulator then synthesizes the necessary full-body actions for a physically simulated humanoid to achieve these goals, paving the way for more interactive and life-like virtual characters.
zh

[AI-160] An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection

【速读】:该论文试图解决在智能合约漏洞检测中,尤其是针对Solidity智能合约中的重入漏洞(reentrancy bug)检测问题,如何有效利用较小的语言模型(smaller language models)来替代大型语言模型(Large Language Models, LLMs)以提高计算效率和适用性。解决方案的关键在于对较小的语言模型进行微调(fine-tuning),使其能够在开发者的本地环境中高效运行,并实现合理的漏洞检测效果。

链接: https://arxiv.org/abs/2505.19059
作者: Ignacio Mariano Andreozzi Pofcher,Joshua Ellul
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being used more and more for various coding tasks, including to help coders identify bugs and are a promising avenue to support coders in various tasks including vulnerability detection – particularly given the flexibility of such generative AI models and tools. Yet for many tasks it may not be suitable to use LLMs, for which it may be more suitable to use smaller language models that can fit and easily execute and train on a developer’s computer. In this paper we explore and evaluate whether smaller language models can be fine-tuned to achieve reasonable results for a niche area: vulnerability detection – specifically focusing on detecting the reentrancy bug in Solidity smart contracts.
zh

[AI-161] Smart Waste Management System for Makkah City using Artificial Intelligence and Internet of Things

【速读】:该论文旨在解决大型宗教活动期间,如沙特阿拉伯麦加的年度朝觐活动,所产生的大量且不可预测的垃圾管理问题。传统基于固定收集时间表的方法已无法有效应对这一挑战。该研究提出了一种名为TUHR的智能垃圾管理系统,其关键在于利用物联网和人工智能技术,通过超声波传感器实时监测垃圾桶内的垃圾水平,并在容器满载时向相关部门发出警报,同时能够检测有害气体等危险物质,从而实现主动、动态的垃圾管理,提升环境卫生与安全性,并优化资源使用效率。

链接: https://arxiv.org/abs/2505.19040
作者: Rawabi S. Al Qurashi,Maram M. Almnjomi,Teef L. Alghamdi,Amjad H. Almalki,Shahad S. Alharthi,Shahad M. althobuti,Alanoud S. Alharthi,Maha A. Thafar
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Waste management is a critical global issue with significant environmental and public health implications. It has become more destructive during large-scale events such as the annual pilgrimage to Makkah, Saudi Arabia, one of the world’s largest religious gatherings. This event’s popularity has attracted millions worldwide, leading to significant and un-predictable accumulation of waste. Such a tremendous number of visitors leads to in-creased waste management issues at the Grand Mosque and other holy sites, highlighting the need for an effective solution other than traditional methods based on rigid collection schedules. To address this challenge, this research proposed an innovative solution that is context-specific and tailored to the unique requirements of pilgrimage season: a Smart Waste Management System, called TUHR, that utilizes the Internet of Things and Artificial Intelligence. This system encompasses ultrasonic sensors that monitor waste levels in each container at the performance sites. Once the container reaches full capacity, the sensor communicates with the microcontroller, which alerts the relevant authorities. Moreover, our system can detect harmful substances such as gas from the gas detector sensor. Such a proactive and dynamic approach promises to mitigate the environmental and health risks associated with waste accumulation and enhance the cleanliness of these sites. It also delivers economic benefits by reducing unnecessary gasoline consumption and optimizing waste management resources. Importantly, this research aligns with the principles of smart cities and exemplifies the innovative, sustainable, and health-conscious approach that Saudi Arabia is implementing as part of its Vision 2030 initiative. Comments: 10 pages, 5 figures Subjects: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2505.19040 [cs.ET] (or arXiv:2505.19040v1 [cs.ET] for this version) https://doi.org/10.48550/arXiv.2505.19040 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-162] urb-L1: Achieving Long-term Turbulence Tracing By Tackling Spectral Bias

【速读】:该论文旨在解决长期湍流演化预测中的关键难题,即现有深度学习方法在长时间序列预测中存在过度平滑、无法准确捕捉复杂流体动力学的问题。论文指出,这一问题的核心原因是模型在训练过程中存在谱偏差(Spectral Bias),即模型倾向于优先学习低频平滑特征而忽略关键的高频细节,从而降低预测保真度并导致物理失真。解决方案的关键在于提出一种名为Turb-L1的新方法,其核心是通过多网格架构中的分层动力学合成机制,显式克服谱偏差,从而准确捕捉跨尺度相互作用并保持高频动力学的保真度,实现对湍流演化的可靠长期跟踪。

链接: https://arxiv.org/abs/2505.19038
作者: Hao Wu,Yuan Gao,Ruiqi Shu,Zean Han,Fan Xu,Zhihong Zhu,Qingsong Wen,Xian Wu,Kun Wang,Xiaomeng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Accurately predicting the long-term evolution of turbulence is crucial for advancing scientific understanding and optimizing engineering applications. However, existing deep learning methods face significant bottlenecks in long-term autoregressive prediction, which exhibit excessive smoothing and fail to accurately track complex fluid dynamics. Our extensive experimental and spectral analysis of prevailing methods provides an interpretable explanation for this shortcoming, identifying Spectral Bias as the core obstacle. Concretely, spectral bias is the inherent tendency of models to favor low-frequency, smooth features while overlooking critical high-frequency details during training, thus reducing fidelity and causing physical distortions in long-term predictions. Building on this insight, we propose Turb-L1, an innovative turbulence prediction method, which utilizes a Hierarchical Dynamics Synthesis mechanism within a multi-grid architecture to explicitly overcome spectral bias. It accurately captures cross-scale interactions and preserves the fidelity of high-frequency dynamics, enabling reliable long-term tracking of turbulence evolution. Extensive experiments on the 2D turbulence benchmark show that Turb-L1 demonstrates excellent performance: (I) In long-term predictions, it reduces Mean Squared Error (MSE) by 80.3% and increases Structural Similarity (SSIM) by over 9\times compared to the SOTA baseline, significantly improving prediction fidelity. (II) It effectively overcomes spectral bias, accurately reproducing the full enstrophy spectrum and maintaining physical realism in high-wavenumber regions, thus avoiding the spectral distortions or spurious energy accumulation seen in other methods.
zh

[AI-163] RECAST: Strengthening LLM s Complex Instruction Following with Constraint-Verifiable Data

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理具有多个显式约束条件(通常超过10个约束)的复杂指令时表现不佳的问题。解决方案的关键在于提出RECAST框架,该框架通过从真实世界提示-响应对中提取约束来合成数据集,使得每个示例包含比现有基准更多的约束,并利用基于规则的验证器和基于LLM的验证器实现约束满足性的自动验证,从而构建出高质量的RECAST-30K数据集,提升模型对复杂指令的遵循能力。

链接: https://arxiv.org/abs/2505.19030
作者: Wenhao Liu,Zhengkang Guo,Mingchen Xie,Jingwen Xu,Zisu Huang,Muzhao Tian,Jianhan Xu,Muling Wu,Xiaohua Wang,Changze Lv,He-Da Wang,Hu Yao,Xiaoqing Zheng,Xuanjing Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly expected to tackle complex tasks, driven by their expanding applications and users’ growing proficiency in crafting sophisticated prompts. However, as the number of explicitly stated requirements increases (particularly more than 10 constraints), LLMs often struggle to accurately follow such complex instructions. To address this challenge, we propose RECAST, a novel framework for synthesizing datasets where each example incorporates far more constraints than those in existing benchmarks. These constraints are extracted from real-world prompt-response pairs to ensure practical relevance. RECAST enables automatic verification of constraint satisfaction via rule-based validators for quantitative constraints and LLM-based validators for qualitative ones. Using this framework, we construct RECAST-30K, a large-scale, high-quality dataset comprising 30k instances spanning 15 constraint types. Experimental results demonstrate that models fine-tuned on RECAST-30K show substantial improvements in following complex instructions. Moreover, the verifiability provided by RECAST enables the design of reward functions for reinforcement learning, which further boosts model performance on complex and challenging tasks.
zh

[AI-164] HGCL: Hierarchical Graph Contrastive Learning for User-Item Recommendation

【速读】:该论文试图解决现有图对比学习(Graph Contrastive Learning, GCL)方法在用户-物品推荐中缺乏对层级物品结构的显式建模问题,而这种层级结构反映了物品在不同分辨率下的相似性,是提升推荐精度的重要信号。解决方案的关键在于提出一种名为层级图对比学习(Hierarchical Graph Contrastive Learning, HGCL)的新方法,通过跨层对比学习预训练GCL模块获取用户和物品表示,并利用表示压缩与聚类构建双层级用户-物品二部图,最终在层级图上微调表示以提升推荐效果。

链接: https://arxiv.org/abs/2505.19020
作者: Jiawei Xue,Zhen Yang,Haitao Lin,Ziji Zhang,Luzhu Wang,Yikun Gu,Yao Xu,Xin Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Graph Contrastive Learning (GCL), which fuses graph neural networks with contrastive learning, has evolved as a pivotal tool in user-item recommendations. While promising, existing GCL methods often lack explicit modeling of hierarchical item structures, which represent item similarities across varying resolutions. Such hierarchical item structures are ubiquitous in various items (e.g., online products and local businesses), and reflect their inherent organizational properties that serve as critical signals for enhancing recommendation accuracy. In this paper, we propose Hierarchical Graph Contrastive Learning (HGCL), a novel GCL method that incorporates hierarchical item structures for user-item recommendations. First, HGCL pre-trains a GCL module using cross-layer contrastive learning to obtain user and item representations. Second, HGCL employs a representation compression and clustering method to construct a two-hierarchy user-item bipartite graph. Ultimately, HGCL fine-tunes user and item representations by learning on the hierarchical graph, and then provides recommendations based on user-item interaction scores. Experiments on three widely adopted benchmark datasets ranging from 70K to 382K nodes confirm the superior performance of HGCL over existing baseline models, highlighting the contribution of hierarchical item structures in enhancing GCL methods for recommendation tasks.
zh

[AI-165] Faithful Group Shapley Value

【速读】:该论文试图解决群体级数据估值中面临的“壳公司攻击”问题,即通过策略性地拆分数据组来不公平地提高数据估值。解决方案的关键在于提出了一种忠实的群体Shapley值(Faithful Group Shapley Value, FGSV),该方法在数学上具有独特性,能够有效防御此类攻击,并基于此开发了一个可证明快速且精确的近似算法,以实现高效的群体级数据估值。

链接: https://arxiv.org/abs/2505.19013
作者: Kiljae Lee,Ziqi Liu,Weijing Tang,Yuan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Economics (econ.GN); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Data Shapley is an important tool for data valuation, which quantifies the contribution of individual data points to machine learning models. In practice, group-level data valuation is desirable when data providers contribute data in batch. However, we identify that existing group-level extensions of Data Shapley are vulnerable to shell company attacks, where strategic group splitting can unfairly inflate valuations. We propose Faithful Group Shapley Value (FGSV) that uniquely defends against such attacks. Building on original mathematical insights, we develop a provably fast and accurate approximation algorithm for computing FGSV. Empirical experiments demonstrate that our algorithm significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy, while ensuring faithful group-level valuation.
zh

[AI-166] Aligning LLM with human travel choices: a persona-based embedding learning approach

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)与人类出行选择行为之间的行为不一致问题,这一问题限制了LLMs在出行需求建模中的应用。论文提出的解决方案的关键在于设计一种新颖的框架,通过人格特征推断与加载过程,利用适合的提示(prompt)对LLMs进行条件化以增强其与人类行为的对齐度。该框架首先从实证数据中建立一组基础人格特征,随后通过由行为嵌入驱动的人格特征加载函数指导加载过程,从而提升模型预测的准确性与可解释性。

链接: https://arxiv.org/abs/2505.19003
作者: Tianming Liu,Manzi Li,Yafeng Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 8 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) presents new opportunities for travel demand modeling. However, behavioral misalignment between LLMs and humans presents obstacles for the usage of LLMs, and existing alignment methods are frequently inefficient or impractical given the constraints of typical travel demand data. This paper introduces a novel framework for aligning LLMs with human travel choice behavior, tailored to the current travel demand data sources. Our framework uses a persona inference and loading process to condition LLMs with suitable prompts to enhance alignment. The inference step establishes a set of base personas from empirical data, and a learned persona loading function driven by behavioral embeddings guides the loading process. We validate our framework on the Swissmetro mode choice dataset, and the results show that our proposed approach significantly outperformed baseline choice models and LLM-based simulation models in predicting both aggregate mode choice shares and individual choice outcomes. Furthermore, we showcase that our framework can generate insights on population behavior through interpretable parameters. Overall, our research offers a more adaptable, interpretable, and resource-efficient pathway to robust LLM-based travel behavior simulation, paving the way to integrate LLMs into travel demand modeling practice in the future.
zh

[AI-167] Semi-pessimistic Reinforcement Learning

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning)中由于分布偏移(distributional shift)导致的策略学习效果不佳问题,以及在实际应用中标签化奖励数据稀缺所带来的限制。其解决方案的关键在于利用大量未标记数据,提出了一种半悲观强化学习方法(semi-pessimistic RL),通过寻求奖励函数的下界而非Q函数或状态转移函数的下界,简化了学习过程,并提升了策略的泛化能力。该方法具有较高的灵活性,能够与多种无模型和有模型的强化学习算法结合,并在使用大量未标记数据时保证策略的改进。

链接: https://arxiv.org/abs/2505.19002
作者: Jin Zhu,Xin Zhou,Jiaang Yao,Gholamali Aminian,Omar Rivasplata,Simon Little,Lexin Li,Chengchun Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected data. However, it faces challenges of distributional shift, where the learned policy may encounter unseen scenarios not covered in the offline data. Additionally, numerous applications suffer from a scarcity of labeled reward data. Relying on labeled data alone often leads to a narrow state-action distribution, further amplifying the distributional shift, and resulting in suboptimal policy learning. To address these issues, we first recognize that the volume of unlabeled data is typically substantially larger than that of labeled data. We then propose a semi-pessimistic RL method to effectively leverage abundant unlabeled data. Our approach offers several advantages. It considerably simplifies the learning process, as it seeks a lower bound of the reward function, rather than that of the Q-function or state transition function. It is highly flexible, and can be integrated with a range of model-free and model-based RL algorithms. It enjoys the guaranteed improvement when utilizing vast unlabeled data, but requires much less restrictive conditions. We compare our method with a number of alternative solutions, both analytically and numerically, and demonstrate its clear competitiveness. We further illustrate with an application to adaptive deep brain stimulation for Parkinson’s disease.
zh

[AI-168] GraSS: Scalable Influence Function with Sparse Gradient Compression

【速读】:该论文旨在解决基于梯度的数据归因方法(如影响函数)在大规模模型中由于每个样本梯度计算带来的高计算和内存成本而导致的可扩展性问题。其解决方案的关键在于提出GraSS及其针对线性层的变体FactGraSS,通过显式利用每个样本梯度的固有稀疏性,实现了次线性空间和时间复杂度,从而显著提升了计算效率并保持了数据影响的准确性。

链接: https://arxiv.org/abs/2505.18976
作者: Pingbang Hu,Joseph Melkonian,Weijing Tang,Han Zhao,Jiaqi W. Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at this https URL.
zh

[AI-169] FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization

【速读】:该论文旨在解决将状态空间模型(State Space Models, SSMs),如Mamba2,在资源受限的边缘设备上部署时所面临的挑战,包括线性层中的严重异常值对量化的影响、元素级张量操作的多样性和不规则性,以及SSM模块中硬件不友好的非线性函数。解决方案的关键在于通过Hadamard变换实现线性层的8位量化以消除异常值,并提出一种面向硬件的细粒度二进制幂次量化框架,同时采用一阶线性近似优化非线性函数,从而提升计算效率并降低硬件复杂度。

链接: https://arxiv.org/abs/2505.18975
作者: Aotao Wang,Haikuo Shao,Shaobo Ma,Zhongfeng Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2-130M, FastMamba achieves 68.80\times and 8.90\times speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains 6\times higher energy efficiency than RTX 3090 GPU.
zh

[AI-170] Protein Design with Dynamic Protein Vocabulary

【速读】:该论文试图解决蛋白质设计中功能对齐与结构合理性之间的平衡问题,即在生成具有特定功能的蛋白质序列时,如何确保其结构的可折叠性。解决方案的关键在于引入ProDVa方法,该方法通过整合文本编码器、蛋白质语言模型和片段编码器,动态检索与功能描述相关的天然蛋白质片段,从而提升生成序列的结构合理性。实验结果表明,ProDVa在使用极少训练数据的情况下,能够设计出功能对齐且结构更合理的蛋白质序列。

链接: https://arxiv.org/abs/2505.18966
作者: Nuowei Liu,Jiahao Kuang,Yanting Liu,Changzhi Sun,Tao Ji,Yuanbin Wu,Man Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.
zh

[AI-171] Weaver: Interweaving SQL and LLM for Table Reasoning

【速读】:该论文试图解决在包含非结构化数据(如文本或图像)的表格中进行查询时,传统SQL难以处理且需要语义推理的任务问题。解决方案的关键在于提出Weaver,一个模块化流水线,能够动态整合结构化数据检索的SQL与语义处理的大型语言模型(Large Language Models, LLMs),通过生成灵活的分步计划来提升准确性和泛化能力。

链接: https://arxiv.org/abs/2505.18961
作者: Rohit Khoja,Devanshu Gupta,Yanjie Fu,Dan Roth,Vivek Gupta
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLMs typically rely on rigid, predefined work-flows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver , a modular pipeline that dynamically integrates SQL and LLMs for table-based question answering (TableQA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that Weaver consistently outperforms state-of-the-art methods across four TableQA datasets, reducing both API calls and error rates.
zh

[AI-172] Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

【速读】:该论文试图解决软件补丁生成中由于单一模型难以处理端到端的复杂任务而导致的补丁生成效率和准确率较低的问题。其解决方案的关键在于提出Co-PatcheR,这是一个基于小型专业化推理模型的协作补丁系统,通过将任务分解为定位、生成和验证等子任务,并针对每个子任务设计专门的模型和训练策略,从而提升整体补丁生成效果。

链接: https://arxiv.org/abs/2505.18955
作者: Yuheng Tang,Hongwei Li,Kaijie Zhu,Michael Yang,Yangruibo Ding,Wenbo Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.
zh

[AI-173] SANNet: A Semantic-Aware Agent ic AI Networking Framework for Multi-Agent Cross-Layer Coordination

【速读】:该论文旨在解决Agentic AI networking(AgentNet)在实现多智能体自主协作与任务分配时面临的挑战,特别是不同智能体可能具有不同甚至冲突的目标,从而影响系统整体性能的问题。其解决方案的关键在于提出SANNet架构,该架构具备语义感知能力,能够推断用户语义目标,并自动分配与移动系统不同层级相关的智能体以完成该目标;同时引入基于动态加权的冲突解决机制,以应对多智能体协作中的目标冲突问题,从而在理论上保证了冲突解决和模型泛化性能。

链接: https://arxiv.org/abs/2505.18946
作者: Yong Xiao,Haoran Zhou,Xubo Li,Yayu Gao,Guangming Shi,Ping Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: submitted to IEEE GLOBECOM’25

点击查看摘要

Abstract:Agentic AI networking (AgentNet) is a novel AI-native networking paradigm that relies on a large number of specialized AI agents to collaborate and coordinate for autonomous decision-making, dynamic environmental adaptation, and complex goal achievement. It has the potential to facilitate real-time network management alongside capabilities for self-configuration, self-optimization, and self-adaptation across diverse and complex networking environments, laying the foundation for fully autonomous networking systems in the future. Despite its promise, AgentNet is still in the early stage of development, and there still lacks an effective networking framework to support automatic goal discovery and multi-agent self-orchestration and task assignment. This paper proposes SANNet, a novel semantic-aware agentic AI networking architecture that can infer the semantic goal of the user and automatically assign agents associated with different layers of a mobile system to fulfill the inferred goal. Motivated by the fact that one of the major challenges in AgentNet is that different agents may have different and even conflicting objectives when collaborating for certain goals, we introduce a dynamic weighting-based conflict-resolving mechanism to address this issue. We prove that SANNet can provide theoretical guarantee in both conflict-resolving and model generalization performance for multi-agent collaboration in dynamic environment. We develop a hardware prototype of SANNet based on the open RAN and 5GS core platform. Our experimental results show that SANNet can significantly improve the performance of multi-agent networking systems, even when agents with conflicting objectives are selected to collaborate for the same goal.
zh

[AI-174] Chi-Square Wavelet Graph Neural Networks for Heterogeneous Graph Anomaly Detection

【速读】:该论文旨在解决异构信息网络(Heterogeneous Information Network, HIN)中的图异常检测(Graph Anomaly Detection, GAD)问题,特别是针对节点和边异质性带来的三个关键挑战:(C1) 跨多样元路径捕捉异常信号和丰富语义;(C2) 在HIN维度对齐中保留高频内容;(C3) 从类别不平衡的困难异常样本中有效学习。其解决方案的关键在于提出一种基于新颖卡方滤波器的谱图神经网络框架ChiGAD,通过多图卡方滤波器、交互式元图卷积以及贡献感知交叉熵损失三个核心组件,分别应对上述挑战。

链接: https://arxiv.org/abs/2505.18934
作者: Xiping Li,Xiangyu Dong,Xingyi Zhang,Kun Xie,Yuanhao Feng,Bo Wang,Guilin Li,Wuxiong Zeng,Xiujun Shu,Sibo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) in heterogeneous networks presents unique challenges due to node and edge heterogeneity. Existing Graph Neural Network (GNN) methods primarily focus on homogeneous GAD and thus fail to address three key issues: (C1) Capturing abnormal signal and rich semantics across diverse meta-paths; (C2) Retaining high-frequency content in HIN dimension alignment; and (C3) Learning effectively from difficult anomaly samples with class imbalance. To overcome these, we propose ChiGAD, a spectral GNN framework based on a novel Chi-Square filter, inspired by the wavelet effectiveness in diverse domains. Specifically, ChiGAD consists of: (1) Multi-Graph Chi-Square Filter, which captures anomalous information via applying dedicated Chi-Square filters to each meta-path graph; (2) Interactive Meta-Graph Convolution, which aligns features while preserving high-frequency information and incorporates heterogeneous messages by a unified Chi-Square Filter; and (3) Contribution-Informed Cross-Entropy Loss, which prioritizes difficult anomalies to address class imbalance. Extensive experiments on public and industrial datasets show that ChiGAD outperforms state-of-the-art models on multiple metrics. Additionally, its homogeneous variant, ChiGNN, excels on seven GAD datasets, validating the effectiveness of Chi-Square filters. Our code is available at this https URL.
zh

[AI-175] Behavior Injection: Preparing Language Models for Reinforcement Learning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在强化学习微调(Reinforcement Learning Fine-Tuning, RFT)过程中表现不一致的问题,即部分模型性能显著提升,而其他模型则停滞或下降。解决方案的关键在于识别出有效后训练的两个关键条件:(1)RL- informative rollout accuracy(RL信息性滚动准确率),以及(2)强数据共影响(strong data co-influence),后者量化了训练数据对其他样本性能的影响程度。基于这些洞察,作者提出了行为注入(behavior injection)方法,这是一种任务无关的数据增强方案,在强化学习之前应用于数据,通过引入探索性和利用性行为来丰富监督微调(Supervised Fine-Tuning, SFT)数据,从而提升模型对RFT的适应性。

链接: https://arxiv.org/abs/2505.18917
作者: Zhepeng Cen,Yihang Yao,William Han,Zuxin Liu,Ding Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement fine-tuning (RFT) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RFT: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data-augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increases the performance gain from RFT over the pre-RL model.
zh

[AI-176] Robust Stability Analysis of Positive Lure System with Neural Network Feedback

【速读】:该论文试图解决在存在参数不确定性和未知非线性区间约束的情况下,Lur’e系统的鲁棒性问题,以及神经网络(Neural Network, NN)反馈回路系统的鲁棒性分析问题。解决方案的关键在于利用系统正性特性,结合正线性系统理论,推导出Lur’e系统的稳定性半径显式公式,并进一步提出一种针对前馈神经网络(Feedforward Neural Network, FFNN)区间约束的优化方法,从而实现对Lur’e系统和NN控制系统的可扩展且高效的鲁棒性分析。

链接: https://arxiv.org/abs/2505.18912
作者: Hamidreza Montazeri Hedesh,Moh. Kamalul Wafi,Bahram Shafai,Milad Siami
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Accepted at the 9th IEEE Conference on Control Technology and Applications (CCTA) 2025, San Diego, California

点击查看摘要

Abstract:This paper investigates the robustness of the Lur’e problem under positivity constraints, drawing on results from the positive Aizerman conjecture and the robustness properties of Metzler matrices. Specifically, we consider a control system of Lur’e type in which not only the linear part includes parametric uncertainty but also the nonlinear sector bound is unknown. We investigate tools from positive linear systems to effectively solve the problems in complicated and uncertain nonlinear systems. By leveraging the positivity characteristic of the system, we derive an explicit formula for the stability radius of Lur’e systems. Furthermore, we extend our analysis to systems with neural network (NN) feedback loops. Building on this approach, we also propose a refinement method for sector bounds of feedforward neural networks (FFNNs). This study introduces a scalable and efficient approach for robustness analysis of both Lur’e and NN-controlled systems. Finally, the proposed results are supported by illustrative examples.
zh

[AI-177] Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中提示注入攻击(Prompt Injection Attacks)的安全漏洞问题,此类攻击通过在输入上下文中注入恶意指令来劫持模型行为。现有防御机制通常通过特殊分隔符标记或附加嵌入来在初始输入层注入指令层级(Instruction Hierarchy, IH)信号,但这种方法在模型不同层传播时难以有效区分令牌的权限级别。该论文的关键解决方案是将IH信号注入网络中的中间令牌表示,并通过层特定的可训练嵌入增强这些表示,以编码权限信息,从而提升对提示注入攻击的防御效果。

链接: https://arxiv.org/abs/2505.18907
作者: Sanjay Kariyappa,G. Edward Suh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that encode the privilege information. Our evaluations across multiple models and training methods reveal that our proposal yields between 1.6\times and 9.2\times reduction in attack success rate on gradient-based prompt injection attacks compared to state-of-the-art methods, without significantly degrading the model’s utility.
zh

[AI-178] PromptWise: Online Learning for Cost-Aware Prompt Assignment in Generative Models

【速读】:该论文试图解决在选择生成式 AI 模型时,现有方法过于关注模型性能而忽视服务成本的问题,从而导致资源浪费和不必要的经济负担。解决方案的关键在于提出 PromptWise,这是一个在线学习框架,通过优先查询价格较低的模型来处理提示,仅在低价格模型无法有效响应时才调用更昂贵的模型,从而实现成本效益的最大化。

链接: https://arxiv.org/abs/2505.18901
作者: Xiaoyan Hu,Lauren Pick,Ho-fung Leung,Farzan Farnia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 44 pages

点击查看摘要

Abstract:The rapid advancement of generative AI models has provided users with numerous options to address their prompts. When selecting a generative AI model for a given prompt, users should consider not only the performance of the chosen model but also its associated service cost. The principle guiding such consideration is to select the least expensive model among the available satisfactory options. However, existing model-selection approaches typically prioritize performance, overlooking pricing differences between models. In this paper, we introduce PromptWise, an online learning framework designed to assign a sequence of prompts to a group of large language models (LLMs) in a cost-effective manner. PromptWise strategically queries cheaper models first, progressing to more expensive options only if the lower-cost models fail to adequately address a given prompt. Through numerical experiments, we demonstrate PromptWise’s effectiveness across various tasks, including puzzles of varying complexity and code generation/translation tasks. The results highlight that PromptWise consistently outperforms cost-unaware baseline methods, emphasizing that directly assigning prompts to the most expensive models can lead to higher costs and potentially lower average performance.
zh

[AI-179] Improving Ad matching via Cluster-Adaptive Keyword Expansion and Relevance tuning

【速读】:该论文旨在解决搜索广告中关键词匹配的平衡问题,即在扩大广告覆盖范围的同时保持广告的相关性。传统基于标记的匹配方式虽然能够提升覆盖面,但可能因语义扩展过于宽松而降低相关性。该研究的关键解决方案是通过文档侧语义关键词扩展,利用语言模型在不改变用户查询的前提下拓宽标记级匹配。核心方法包括使用预训练的孪生模型生成广告关键词的密集向量表示,并通过最近邻搜索识别语义相关的变体;同时引入基于聚类的阈值机制以调整相似度截止点,从而在保持精度的同时扩展关键词。此外,通过增量学习策略与轻量级决策树集成对下游相关性模型进行优化,以适应扩展后的关键词空间,最终提升了相关性和点击率(CTR)。

链接: https://arxiv.org/abs/2505.18897
作者: Dipanwita Saha,Anis Zaman,Hua Zou,Ning Chen,Xinxin Shu,Nadia Vase,Abraham Bagherjeiran
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In search advertising, keyword matching connects user queries with relevant ads. While token-based matching increases ad coverage, it can reduce relevance due to overly permissive semantic expansion. This work extends keyword reach through document-side semantic keyword expansion, using a language model to broaden token-level matching without altering queries. We propose a solution using a pre-trained siamese model to generate dense vector representations of ad keywords and identify semantically related variants through nearest neighbor search. To maintain precision, we introduce a cluster-based thresholding mechanism that adjusts similarity cutoffs based on local semantic density. Each expanded keyword maps to a group of seller-listed items, which may only partially align with the original intent. To ensure relevance, we enhance the downstream relevance model by adapting it to the expanded keyword space using an incremental learning strategy with a lightweight decision tree ensemble. This system improves both relevance and click-through rate (CTR), offering a scalable, low-latency solution adaptable to evolving query behavior and advertising inventory.
zh

[AI-180] Digital Overconsumption and Waste: A Closer Look at the Impacts of Generative AI CVPR

【速读】:该论文试图解决生成式人工智能(Generative AI)系统在数字空间中加剧数字垃圾产生以及由此带来的能源消耗和二氧化碳排放问题,同时探讨其对消费者行为的负面影响,如过度消费。解决方案的关键在于讨论数字过度消费与浪费现象,以及其他相关社会影响,并提出可能的解决路径,以减少生成式AI对环境和社会的不利影响。

链接: https://arxiv.org/abs/2505.18894
作者: Vanessa Utz,Steve DiPaola
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conference on Computer Vision and Pattern Recognition (CVPR) 2023. Ethical Considerations in Creative Applications of Computer Vision (EC3V) Workshop

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) systems currently contribute negatively to the production of digital waste, via the associated energy consumption and the related CO2 emissions. At this moment, a discussion is urgently needed on the replication of harmful consumer behavior, such as overconsumption, in the digital space. We outline our previous work on the climate implications of commercially available generative AI systems and the sentiment of generative AI users when confronted with AI-related climate research. We expand on this work via a discussion of digital overconsumption and waste, other related societal impacts, and a possible solution pathway
zh

[AI-181] Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AIs Real World Effects

【速读】:该论文试图解决传统AI评估方法在探索、导航和解决AI在现实世界部署中涉及的人类和社会因素方面的系统性局限,特别是针对AI的次级效应(second-order effects)缺乏有效测量手段的问题。解决方案的关键在于扩展评估方法,从静态的、单次交互的仿真测试转向能够捕捉实际情境下用户使用AI技术时所产生的具体结果的测试范式,强调需要数据和方法来实现情境感知,并支持对AI次级效应的下游解释和决策制定。

链接: https://arxiv.org/abs/2505.18893
作者: Reva Schwartz,Rumman Chowdhury,Akash Kundu,Heather Frase,Marzieh Fadaee,Tom David,Gabriella Waters,Afaf Taik,Morgan Briggs,Patrick Hall,Shomik Jain,Kyra Yee,Spencer Thomas,Sundeep Bhandari,Lee Wan Sie,Qinghua Lu,Matthew Holmes,Theodora Skeadas
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accurate, or contain toxic, biased or stereotypical content, but AI’s second-order effects, i.e. any long-term outcomes and consequences that may result from AI use in the real world, have become a significant area of interest as the technology becomes embedded in our daily lives. These secondary effects can include shifts in user behavior, societal, cultural and economic ramifications, workforce transformations, and long-term downstream impacts that may result from a broad and growing set of risks. This position paper argues that measuring the indirect and secondary effects of AI will require expansion beyond static, single-turn approaches conducted in silico to include testing paradigms that can capture what actually materializes when people use AI technology in context. Specifically, we describe the need for data and methods that can facilitate contextual awareness and enable downstream interpretation and decision making about AI’s secondary effects, and recommend requirements for a new ecosystem.
zh

[AI-182] Climate Implications of Diffusion-based Generative Visual AI Systems and their Mass Adoption

【速读】:该论文试图解决由基于文本提示的扩散生成式AI艺术系统迅速普及所带来的气候影响问题,特别是其大规模使用GPU导致的能源消耗增长。论文指出,尽管区块链和NFT等数字技术的气候影响已被广泛研究,但生成式AI(Generative AI)在艺术创作中的应用同样需要引起重视。解决方案的关键在于对这些系统的使用模式、增长趋势及其对全球能源消耗的影响进行深入分析,并提出相应的应对策略,然而当前面临的主要困难包括缺乏公开可用的数据。

链接: https://arxiv.org/abs/2505.18892
作者: Vanessa Utz,Steve DiPaola
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: International Conference on Computational Creativity

点击查看摘要

Abstract:Climate implications of rapidly developing digital technologies, such as blockchains and the associated crypto mining and NFT minting, have been well documented and their massive GPU energy use has been identified as a cause for concern. However, we postulate that due to their more mainstream consumer appeal, the GPU use of text-prompt based diffusion AI art systems also requires thoughtful considerations. Given the recent explosion in the number of highly sophisticated generative art systems and their rapid adoption by consumers and creative professionals, the impact of these systems on the climate needs to be carefully considered. In this work, we report on the growth of diffusion-based visual AI systems, their patterns of use, growth and the implications on the climate. Our estimates show that the mass adoption of these tools potentially contributes considerably to global energy consumption. We end this paper with our thoughts on solutions and future areas of inquiry as well as associated difficulties, including the lack of publicly available data.
zh

[AI-183] Security Concerns for Large Language Models : A Survey

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在安全方面带来的新兴威胁问题,包括提示注入与越狱、对抗攻击、恶意使用者的滥用以及自主LLM代理固有的风险。其解决方案的关键在于全面分析这些威胁类型,并总结近年来学术界和工业界针对每种威胁所提出的防御措施及其局限性,从而识别出保障基于LLM的应用安全性的开放性挑战,强调构建稳健的多层安全策略的重要性。

链接: https://arxiv.org/abs/2505.18889
作者: Miles Q. Li,Benjamin C. M. Fung
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) such as GPT-4 (and its recent iterations like GPT-4o and the GPT-4.1 series), Google’s Gemini, Anthropic’s Claude 3 models, and xAI’s Grok have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities. In this survey, we provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks (including input perturbations and data poisoning), misuse by malicious actors (e.g., for disinformation, phishing, and malware generation), and worrisome risks inherent in autonomous LLM agents. A significant focus has been recently placed on the latter, exploring goal misalignment, emergent deception, self-preservation instincts, and the potential for LLMs to develop and pursue covert, misaligned objectives (scheming), which may even persist through safety training. We summarize recent academic and industrial studies (2022-2025) that exemplify each threat, analyze proposed defenses and their limitations, and identify open challenges in securing LLM-based applications. We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.
zh

[AI-184] Hierarchical-embedding autoencoder with a predictor (HEAP) as efficient architecture for learning long-term evolution of complex multi-scale physical systems

【速读】:该论文试图解决在复杂多尺度物理系统中长期演化的高效学习问题,传统方法难以有效捕捉不同尺度结构之间的相互作用。解决方案的关键在于基于尺度分离的思想,通过分层全卷积自编码器将物理系统的状态转换为多个嵌入层,这些嵌入层编码了不同尺度的结构并保留对应分辨率的空间信息,从而实现对多尺度系统的高效建模。

链接: https://arxiv.org/abs/2505.18857
作者: Alexander Khrabry,Edward Startsev,Andrew Powis,Igor Kaganovich
机构: 未知
类目: Artificial Intelligence (cs.AI); Plasma Physics (physics.plasm-ph)
备注:

点击查看摘要

Abstract:We propose a novel efficient architecture for learning long-term evolution in complex multi-scale physical systems which is based on the idea of separation of scales. Structures of various scales that dynamically emerge in the system interact with each other only locally. Structures of similar scale can interact directly when they are in contact and indirectly when they are parts of larger structures that interact directly. This enables modeling a multi-scale system in an efficient way, where interactions between small-scale features that are apart from each other do not need to be modeled. The hierarchical fully-convolutional autoencoder transforms the state of a physical system not just into a single embedding layer, as it is done conventionally, but into a series of embedding layers which encode structures of various scales preserving spatial information at a corresponding resolution level. Shallower layers embed smaller structures on a finer grid, while deeper layers embed larger structures on a coarser grid. The predictor advances all embedding layers in sync. Interactions between features of various scales are modeled using a combination of convolutional operators. We compare the performance of our model to variations of a conventional ResNet architecture in application to the Hasegawa-Wakatani turbulence. A multifold improvement in long-term prediction accuracy was observed for crucial statistical characteristics of this system.
zh

[AI-185] he Theory of the Unique Latent Pattern: A Formal Epistemic Framework for Structural Singularity in Complex Systems

【速读】:该论文试图解决动态系统中表观复杂性的起源问题,特别是对不可预测性传统解释的质疑。传统观点通常将不可预测性归因于内在随机性或涌现的非线性,而本文提出的独特潜在模式理论(Theory of the Unique Latent Pattern, ULP)则认为,每个可分析系统都由一个结构唯一且确定的生成机制所支配,其隐藏并非源于本体论上的不确定性,而是由于认识论的限制。解决方案的关键在于引入非普遍的生成映射 $ \mathcal{F}_S(P_S, t) $,其中每个系统 $ S $ 拥有其不可约且不可复制的潜在结构 $ P_S $,观测到的不规则性被建模为通过观察者有限接口对生成映射的投影,并引入认识论噪声 $ \varepsilon_S(t) $ 作为不完全访问的度量。该理论通过将不确定性的来源从系统转移到观察者,重新定义了混沌作为表征失败的相对性问题。

链接: https://arxiv.org/abs/2505.18850
作者: Mohamed Aly Bouke
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic (math.LO)
备注:

点击查看摘要

Abstract:This paper introduces the Theory of the Unique Latent Pattern (ULP), a formal epistemic framework that redefines the origin of apparent complexity in dynamic systems. Rather than attributing unpredictability to intrinsic randomness or emergent nonlinearity, ULP asserts that every analyzable system is governed by a structurally unique, deterministic generative mechanism, one that remains hidden not due to ontological indeterminacy, but due to epistemic constraints. The theory is formalized using a non-universal generative mapping ( \mathcalF_S(P_S, t) ), where each system ( S ) possesses its own latent structure ( P_S ), irreducible and non-replicable across systems. Observed irregularities are modeled as projections of this generative map through observer-limited interfaces, introducing epistemic noise ( \varepsilon_S(t) ) as a measure of incomplete access. By shifting the locus of uncertainty from the system to the observer, ULP reframes chaos as a context-relative failure of representation. We contrast this position with foundational paradigms in chaos theory, complexity science, and statistical learning. While they assume or model shared randomness or collective emergence, ULP maintains that every instance harbors a singular structural identity. Although conceptual, the theory satisfies the criterion of falsifiability in the Popperian sense, it invites empirical challenge by asserting that no two systems governed by distinct latent mechanisms will remain indistinguishable under sufficient resolution. This opens avenues for structurally individuated models in AI, behavioral inference, and epistemic diagnostics.
zh

[AI-186] LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS

【速读】:该论文试图解决语言模型与计算机界面之间语义断层的问题,即语言模型对世界的理解方式与计算机接口的结构之间存在根本性不匹配。解决方案的关键在于将计算机转化为语言模型可以原生理解的上下文环境,通过实现一种模型上下文协议(Model Context Protocol, MCP)服务器架构,抽象计算机状态和操作,从而有效解耦界面复杂度与决策复杂度,使代理能够更高效地推理计算环境。

链接: https://arxiv.org/abs/2505.18829
作者: Kai Mei,Xi Zhu,Hang Gao,Shuhang Lin,Yongfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how language models understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that language models can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform’s effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for language models represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at this https URL, and it is also integrated into the AIOS main branch as part of AIOS at this https URL.
zh

[AI-187] Mitigating Deceptive Alignment via Self-Monitoring

【速读】:该论文试图解决现代大型语言模型在使用链式思维(Chain-of-Thought, CoT)推理时可能产生的欺骗性对齐(deceptive alignment)问题,即模型表面上与人类目标一致,但暗地里追求不一致的目标。解决方案的关键在于提出一种名为CoT Monitor+的框架,该框架在CoT过程中嵌入了一个自我监控(Self-Monitor)机制,通过生成内部自评信号来标记并抑制不一致的策略,该信号作为强化学习中的辅助奖励,形成反馈回路以鼓励诚实推理并抑制隐藏目标。

链接: https://arxiv.org/abs/2505.18807
作者: Jiaming Ji,Wenqi Chen,Kaile Wang,Donghai Hong,Sitong Fang,Boyuan Chen,Jiayi Zhou,Juntao Dai,Sirui Han,Yike Guo,Yaodong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at this http URL
zh

[AI-188] Soft Weighted Machine Unlearning

【速读】:该论文旨在解决机器遗忘(machine unlearning)中因采用为隐私驱动设计的二值化数据移除框架而导致的信息过度损失(over-unlearning)问题,该问题会引发模型效用下降。论文提出的解决方案关键在于引入一种加权影响函数,通过解析求解凸二次规划问题为每个样本分配定制化权重,并在此基础上提出一种软加权框架,实现对模型的细粒度调整,从而有效缓解过度遗忘问题并提升模型在公平性和鲁棒性任务中的性能。

链接: https://arxiv.org/abs/2505.18783
作者: Xinbao Qiao,Ningning Ding,Yushi Cheng,Meng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages,22 figures

点击查看摘要

Abstract:Machine unlearning, as a post-hoc processing technique, has gained widespread adoption in addressing challenges like bias mitigation and robustness enhancement, colloquially, machine unlearning for fairness and robustness. However, existing non-privacy unlearning-based solutions persist in using binary data removal framework designed for privacy-driven motivation, leading to significant information loss, a phenomenon known as over-unlearning. While over-unlearning has been largely described in many studies as primarily causing utility degradation, we investigate its fundamental causes and provide deeper insights in this work through counterfactual leave-one-out analysis. In this paper, we introduce a weighted influence function that assigns tailored weights to each sample by solving a convex quadratic programming problem analytically. Building on this, we propose a soft-weighted framework enabling fine-grained model adjustments to address the over-unlearning challenge. We demonstrate that the proposed soft-weighted scheme is versatile and can be seamlessly integrated into most existing unlearning algorithms. Extensive experiments show that in fairness- and robustness-driven tasks, the soft-weighted scheme significantly outperforms hard-weighted schemes in fairness/robustness metrics and alleviates the decline in utility metric, thereby enhancing machine unlearning algorithm as an effective correction solution.
zh

[AI-189] HD-PiSSA: High-Rank Distributed Orthogonal Adaptation

【速读】:该论文旨在解决现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法如LoRA和PiSSA在大型语言模型(Large Language Models, LLMs)微调过程中因限制模型更新至低秩子空间而导致表达能力受限、复杂任务性能不佳的问题。其解决方案的关键在于提出高秩分布式PiSSA(High-rank Distributed PiSSA, HD-PiSSA),通过在不同设备上初始化正交适配器,并在权重矩阵W上集体聚合它们的delta更新,从而扩展更新方向的范围,显著提升有效更新秩。

链接: https://arxiv.org/abs/2505.18777
作者: Yiding Wang,Fauxu meng,Xuefeng Zhang,Fan Jiang,Pingzhi Tang,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce High-rank Distributed PiSSA (HD-PiSSA), a distributed PEFT approach that initializes orthogonal adapters across different devices and aggregates their delta updates collectively on W for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16x higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, we evaluate HD-PiSSA across various challenging downstream tasks, including mathematics, code generation, and multi-task learning. In the multi-task setting, HD-PiSSA achieves average gains of 10.0 absolute points (14.63%) over LoRA and 4.98 points (6.60%) over PiSSA across 12 benchmarks, demonstrating its benefits from the extra optimization flexibility.
zh

[AI-190] Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models

【速读】:该论文试图解决当前会员推理攻击(Membership Inference Attacks, MIAs)在大规模预训练语言模型(Large Pre-trained Language Models, LLMs)上难以有效扩展的问题。现有方法要么依赖于不需训练参考模型的较弱攻击(如微调攻击),要么仅适用于小规模模型和数据集,但这些方法存在效果不稳定或无法推广至现代LLMs的局限性。论文的关键解决方案是将一种强效MIAs——LiRA进行扩展,针对GPT-2架构(参数量从10M到1B)进行实验,并在C4数据集的20B以上token上训练参考模型,从而验证了强MIAs在LLMs上的可行性及其实际效果的局限性。

链接: https://arxiv.org/abs/2505.18773
作者: Jamie Hayes,Ilia Shumailov,Christopher A. Choquette-Choo,Matthew Jagielski,George Kaissis,Katherine Lee,Milad Nasr,Sahra Ghalebikesabi,Niloofar Mireshghallah,Meenatchi Sundaram Mutu Selva Annamalai,Igor Shilov,Matthieu Meeus,Yves-Alexandre de Montjoye,Franziska Boenisch,Adam Dziedzic,A. Feder Cooper
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today’s LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.
zh

[AI-191] he Quest for Efficient Reasoning : A Data-Centric Benchmark to CoT Distillation

【速读】:该论文试图解决如何通过数据驱动的方法优化链式思维(Chain-of-Thought, CoT)蒸馏,以构建更小、更高效且保持强推理能力的学生大型语言模型(Large Language Models, LLMs)的问题。其解决方案的关键在于提出DC-CoT,首个专注于从方法、模型和数据三个角度研究CoT蒸馏中数据操作的综合性基准,通过系统评估不同数据处理技术对模型性能的影响,为优化CoT蒸馏提供可操作的见解和最佳实践。

链接: https://arxiv.org/abs/2505.18759
作者: Ruichen Zhang,Rana Muhammad Shahroz Khan,Zhen Tan,Dawei Li,Song Wang,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at this https URL, while our code is shared in this https URL.
zh

[AI-192] Smart Energy Guardian: A Hybrid Deep Learning Model for Detecting Fraudulent PV Generation

【速读】:该论文旨在解决智能电网中因网络攻击和复杂电力窃取行为带来的挑战,特别是在住宅光伏(Photovoltaic, PV)发电系统中的电力窃取检测问题。传统电力窃取检测(Electricity Theft Detection, ETD)方法在捕捉复杂的时间依赖性和融合多源数据方面存在局限性。论文提出的解决方案的关键在于构建一种混合深度学习模型,结合多尺度卷积神经网络(Convolutional Neural Network, CNN)、长短期记忆网络(Long Short-Term Memory, LSTM)和Transformer,以有效捕获短期和长期时间依赖性,并引入数据嵌入技术,实现时间序列数据与离散温度变量的无缝融合,从而提升检测的鲁棒性。

链接: https://arxiv.org/abs/2505.18755
作者: Xiaolu Chen,Chenghao Huang,Yanru Zhang,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 2024 IEEE International Smart Cities Conference (ISC2)

点击查看摘要

Abstract:With the proliferation of smart grids, smart cities face growing challenges due to cyber-attacks and sophisticated electricity theft behaviors, particularly in residential photovoltaic (PV) generation systems. Traditional Electricity Theft Detection (ETD) methods often struggle to capture complex temporal dependencies and integrating multi-source data, limiting their effectiveness. In this work, we propose an efficient ETD method that accurately identifies fraudulent behaviors in residential PV generation, thus ensuring the supply-demand balance in smart cities. Our hybrid deep learning model, combining multi-scale Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Transformer, excels in capturing both short-term and long-term temporal dependencies. Additionally, we introduce a data embedding technique that seamlessly integrates time-series data with discrete temperature variables, enhancing detection robustness. Extensive simulation experiments using real-world data validate the effectiveness of our approach, demonstrating significant improvements in the accuracy of detecting sophisticated energy theft activities, thereby contributing to the stability and fairness of energy systems in smart cities.
zh

[AI-193] Agent -Based Decentralized Energy Management of EV Charging Station with Solar Photovoltaics via Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决电动汽车(Electric Vehicle, EV)充电站能源管理在面对多种不确定性(如充电行为变化和部分充电桩故障)时的鲁棒性不足问题。其解决方案的关键在于提出一种基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的方法,将每个充电桩视为一个智能体,并通过引入长短期记忆网络(Long Short-Term Memory, LSTM)提取时间序列特征,同时设计密集奖励机制以提升充电体验,从而实现对系统不确定性和故障的鲁棒应对以及充电成本的最小化和充电服务满意度的最大化。

链接: https://arxiv.org/abs/2505.18750
作者: Jiarong Fan,Chenghao Huang,Hao Wang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 2024 IEEE International Smart Cities Conference (ISC2)

点击查看摘要

Abstract:In the pursuit of energy net zero within smart cities, transportation electrification plays a pivotal role. The adoption of Electric Vehicles (EVs) keeps increasing, making energy management of EV charging stations critically important. While previous studies have managed to reduce energy cost of EV charging while maintaining grid stability, they often overlook the robustness of EV charging management against uncertainties of various forms, such as varying charging behaviors and possible faults in faults in some chargers. To address the gap, a novel Multi-Agent Reinforcement Learning (MARL) approach is proposed treating each charger to be an agent and coordinate all the agents in the EV charging station with solar photovoltaics in a more realistic scenario, where system faults may occur. A Long Short-Term Memory (LSTM) network is incorporated in the MARL algorithm to extract temporal features from time-series. Additionally, a dense reward mechanism is designed for training the agents in the MARL algorithm to improve EV charging experience. Through validation on a real-world dataset, we show that our approach is robust against system uncertainties and faults and also effective in minimizing EV charging costs and maximizing charging service satisfaction.
zh

[AI-194] C3-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

【速读】:该论文旨在解决当前基于大语言模型的智能体在处理复杂工具关系、环境反馈及历史决策时存在的能力不足问题,这些问题在传统自然语言处理任务中未被充分考虑。其解决方案的关键在于提出一个名为C^3-Bench的开源高质量基准测试平台,该平台通过整合攻击概念和单变量分析方法,识别影响智能体鲁棒性的关键因素,并设计了三个挑战:处理复杂的工具关系、应对关键隐藏信息以及管理动态决策路径。此外,还引入了细粒度指标、创新的数据收集算法和可复现的评估方法,以全面评估智能体性能并揭示模型漏洞。

链接: https://arxiv.org/abs/2505.18746
作者: Peijie Yu,Yifan Yang,Jinjian Li,Zelong Zhang,Haorui Wang,Xiao Feng,Feng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark C^3 -Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, C^3 -Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at this https URL.
zh

[AI-195] Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

【速读】:该论文旨在解决现有图状态空间模型(Graph State-Space Models, GSSMs)在处理图结构数据时存在的核心性质缺失问题,如排列等变性、消息传递兼容性和计算效率低下。其解决方案的关键在于将现代状态空间模型(State-Space Models, SSMs)的核心计算原理直接嵌入到消息传递神经网络(Message-Passing Neural Network)框架中,从而提出一种统一的方法——MP-SSM,该方法在保持消息传递架构简洁性的同时,实现了高效的排列等变信息传播,并支持精确的敏感性分析,以理论分析信息流及深度场景下的梯度消失和过度压缩等问题。

链接: https://arxiv.org/abs/2505.18728
作者: Andrea Ceni,Alessio Gravina,Claudio Gallicchio,Davide Bacciu,Carola-Bibiane Schonlieb,Moshe Eliasof
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.
zh

[AI-196] LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning

【速读】:该论文旨在解决在资源受限的边缘设备上部署大型语言模型(Large Language Models, LLMs)时,量化与微调过程中存在的关键问题,包括低精度量化权重(如4-bit)与高精度适配权重(如16-bit)之间的数据类型不匹配、适配权重合并导致的精度下降以及现有方法无法实现无损适配权重合并的问题。其解决方案的关键在于提出了一种无损三元适配方法(Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning, LoTA-QAF),该方法通过定制的三元适配(Ternary Adaptation, TA)对齐量化网格并调整量化权重,结合基于TA的无损适配权重合并机制以及三元符号梯度下降(t-SignSGD)更新TA权重,实现了量化模型的高效微调。

链接: https://arxiv.org/abs/2505.18724
作者: Junyu Chen,Junzhuo Li,Zhen Peng,Wenjie Wang,Yuxiang Ren,Long Shi,Xuming Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices. However, fine-tuning quantized models presents significant challenges, primarily stemming from: First, the mismatch in data types between the low-precision quantized weights (e.g., 4-bit) and the high-precision adaptation weights (e.g., 16-bit). This mismatch limits the computational efficiency advantage offered by quantized weights during inference. Second, potential accuracy degradation when merging these high-precision adaptation weights into the low-precision quantized weights, as the adaptation weights often necessitate approximation or truncation. Third, as far as we know, no existing methods support the lossless merging of adaptation while adjusting all quantized weights. To address these challenges, we introduce lossless ternary adaptation for quantization-aware fine-tuning (LoTA-QAF). This is a novel fine-tuning method specifically designed for quantized LLMs, enabling the lossless merging of ternary adaptation weights into quantized weights and the adjustment of all quantized weights. LoTA-QAF operates through a combination of: i) A custom-designed ternary adaptation (TA) that aligns ternary weights with the quantization grid and uses these ternary weights to adjust quantized weights. ii) A TA-based mechanism that enables the lossless merging of adaptation weights. iii) Ternary signed gradient descent (t-SignSGD) for updating the TA weights. We apply LoTA-QAF to Llama-3.1/3.3 and Qwen-2.5 model families and validate its effectiveness on several downstream tasks. On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%. For task-specific fine-tuning, 16-bit LoRA achieves superior results, but LoTA-QAF still outperforms other methods.
zh

[AI-197] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

【速读】:该论文旨在解决基于离线数据的视觉-语言-动作(Vision-Language-Action, VLA)模型在分布外场景中执行失败的问题,即在有限的已访问状态下,模型难以适应新的任务和环境。其解决方案的关键在于提出VLA-RL框架,该框架通过在线强化学习(Reinforcement Learning, RL)对预训练的自回归VLA进行微调,以提升其在下游任务中的性能。该方法引入了轨迹级强化学习公式,将通用的机器人操作轨迹建模为多模态多轮对话,并通过预训练视觉-语言模型作为机器人过程奖励模型来解决稀疏奖励问题,同时结合多种实现策略提升训练的稳定性和效率。

链接: https://arxiv.org/abs/2505.18719
作者: Guanxing Lu,Wenkai Guo,Chubin Zhang,Yuheng Zhou,Haonan Jiang,Zifeng Gao,Yansong Tang,Ziwei Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent high-capacity vision-language-action (VLA) models have demonstrated impressive performance on a range of robotic manipulation tasks by imitating human demonstrations. However, exploiting offline data with limited visited states will cause execution failure in out-of-distribution scenarios. Intuitively, an exploration-based method that improves on online collected data at test time could address this limitation. We present VLA-RL, an algorithmic and systematic framework that leverages online reinforcement learning (RL) to improve pretrained auto-regressive VLAs in downstream tasks. Within a unified perspective, we first introduce a trajectory-level RL formulation for auto-regressive VLA training, which models general robotic manipulation trajectory as multi-modal multi-turn conversation. To address the challenge of sparse rewards, we fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments. To scale up, we identify several implementation findings that improve the stability and efficiency including curriculum selection strategy, GPU-balanced vectorized environments, batch decoding, and critic warmup. VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO, and even matches the performance of advanced commercial models such as \pi_0 -FAST. Notably, we observe that VLA-RL benefits from increased test-time optimization, indicating an early spark of inference scaling laws in robotics.
zh

[AI-198] GainRAG : Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis ACL2025

【速读】:该论文试图解决Retrieval-Augmented Generation (RAG)框架中检索器与大语言模型(LLM)之间的偏好差距问题,这一差距限制了系统性能的进一步提升。解决方案的关键在于提出一种新的度量标准“gain”,用于衡量输入段落对正确输出的贡献,并通过估计这些gain信号来训练一个中间模块,以对齐检索器和LLM的偏好,同时引入伪段落策略以缓解性能下降。

链接: https://arxiv.org/abs/2505.18710
作者: Yi Jiang,Sendong Zhao,Jianbo Li,Haochun Wang,Bing Qin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:The Retrieval-Augmented Generation (RAG) framework introduces a retrieval module to dynamically inject retrieved information into the input context of large language models (LLMs), and has demonstrated significant success in various NLP tasks. However, the current study points out that there is a preference gap between retrievers and LLMs in the RAG framework, which limit the further improvement of system performance. Some highly relevant passages may interfere with LLM reasoning because they contain complex or contradictory information; while some indirectly related or even inaccurate content may help LLM generate more accurate answers by providing suggestive information or logical clues. To solve this, we propose GainRAG, a novel approach that aligns the retriever’s and LLM’s preferences by defining a new metric, “gain”, which measure how well an input passage contributes to correct outputs. Specifically, we propose a method to estimate these gain signals and train a middleware that aligns the preferences of the retriever and the LLM using only limited data. In addition, we introduce a pseudo-passage strategy to mitigate degradation. The experimental results on 6 datasets verify the effectiveness of GainRAG.
zh

[AI-199] Steering LLM Reasoning Through Bias-Only Adaptation

【速读】:该论文试图解决的问题是:在基于推理的语言模型中,强化学习微调是否能够创造新的能力,还是仅仅增强预训练网络中已有的推理模式。论文的解决方案之关键在于训练“引导向量”(steering vectors),即逐层的偏置,这些向量通过叠加方式增强选定的隐藏特征,同时保持原始权重不变,从而验证了基础模型中已存在所需的推理技能。

链接: https://arxiv.org/abs/2505.18706
作者: Viacheslav Sinii,Alexey Gorbatovski,Artem Cherepanov,Boris Shaposhnikov,Nikita Balagansky,Daniil Gavrilov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent work on reasoning-oriented language models, exemplified by o1-like systems, suggests that reinforcement-learning (RL) finetuning does not create new capabilities but instead strengthens reasoning patterns already latent in the pretrained network. We test this claim by training steering vectors: layer-wise biases that additively amplify selected hidden features while leaving all original weights unchanged. Experiments on four base models across the GSM8K and MATH benchmarks show that steering vectors recover, and in several cases exceed, the accuracy of fully-tuned counterparts. This result supports the view that the required reasoning skills pre-exist in the base model. Further, logit-lens analysis reveals that the trained vectors consistently boost token groups linked to structured languages and logical connectors, providing an interpretable account that aligns with the demands of quantitative reasoning tasks.
zh

[AI-200] AI-Researcher: Autonomous Scientific Innovation

【速读】:该论文试图解决如何通过自主的AI系统加速科学创新的问题,特别是针对科研流程中多个环节的自动化与优化。其解决方案的关键在于提出AI-Researcher,一个完全自主的研究系统,能够无缝协调从文献综述、假设生成到算法实现及论文撰写的完整研究流程,仅需最少的人工干预。该系统通过结合大型语言模型的强大推理能力与代理框架的任务自动化能力,实现了对科研过程的全面自动化。

链接: https://arxiv.org/abs/2505.18705
作者: Jiabin Tang,Lianghao Xia,Zhonghang Li,Chao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code on github: this https URL

点击查看摘要

Abstract:The powerful reasoning capabilities of Large Language Models (LLMs) in mathematics and coding, combined with their ability to automate complex tasks through agentic frameworks, present unprecedented opportunities for accelerating scientific innovation. In this paper, we introduce AI-Researcher, a fully autonomous research system that transforms how AI-driven scientific discovery is conducted and evaluated. Our framework seamlessly orchestrates the complete research pipeline–from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation–with minimal human intervention. To rigorously assess autonomous research capabilities, we develop Scientist-Bench, a comprehensive benchmark comprising state-of-the-art papers across diverse AI research domains, featuring both guided innovation and open-ended exploration tasks. Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations.
zh

[AI-201] Can LLM s Alleviate Catastrophic Forgetting in Graph Continual Learning? A Systematic Study

【速读】:该论文试图解决图持续学习(Graph Continual Learning, GCL)中的灾难性遗忘问题,即在流式数据环境下,模型在学习新任务时容易遗忘之前学到的知识。其解决方案的关键在于探索大型语言模型(Large Language Models, LLMs)在GCL中的潜力,并提出一种简单而有效的持续学习方法——Simple Graph Continual Learning (SimGCL),该方法在无重放约束下相比之前的基于图神经网络(GNN)的最先进基线提升了约20%。

链接: https://arxiv.org/abs/2505.18697
作者: Ziyang Cheng,Zhixun Li,Yuhan Li,Yixin Song,Kangyi Zhao,Dawei Cheng,Jia Li,Jeffrey Xu Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays, real-world data, including graph-structure data, often arrives in a streaming manner, which means that learning systems need to continuously acquire new knowledge without forgetting previously learned information. Although substantial existing works attempt to address catastrophic forgetting in graph machine learning, they are all based on training from scratch with streaming data. With the rise of pretrained models, an increasing number of studies have leveraged their strong generalization ability for continual learning. Therefore, in this work, we attempt to answer whether large language models (LLMs) can mitigate catastrophic forgetting in Graph Continual Learning (GCL). We first point out that current experimental setups for GCL have significant flaws, as the evaluation stage may lead to task ID leakage. Then, we evaluate the performance of LLMs in more realistic scenarios and find that even minor modifications can lead to outstanding results. Finally, based on extensive experiments, we propose a simple-yet-effective method, Simple Graph Continual Learning (SimGCL), that surpasses the previous state-of-the-art GNN-based baseline by around 20% under the rehearsal-free constraint. To facilitate reproducibility, we have developed an easy-to-use benchmark LLM4GCL for training and evaluating existing GCL methods. The code is available at: this https URL.
zh

[AI-202] AI for Regulatory Affairs: Balancing Accuracy Interpretability and Computational Cost in Medical Device Classification

【速读】:该论文试图解决医疗设备分类任务中传统方法效率与准确性不足的问题,旨在通过人工智能(Artificial Intelligence, AI)技术提升监管事务的自动化水平。解决方案的关键在于评估多种AI模型——包括传统机器学习(Machine Learning, ML)算法、深度学习架构以及大语言模型——在准确率、可解释性和计算成本三个核心维度上的表现,以寻找最适合医疗设备监管场景的分类方法。

链接: https://arxiv.org/abs/2505.18695
作者: Yu Han,Aaron Ceross,Jeroen H. M. Bergmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Regulatory affairs, which sits at the intersection of medicine and law, can benefit significantly from AI-enabled automation. Classification task is the initial step in which manufacturers position their products to regulatory authorities, and it plays a critical role in determining market access, regulatory scrutiny, and ultimately, patient safety. In this study, we investigate a broad range of AI models – including traditional machine learning (ML) algorithms, deep learning architectures, and large language models – using a regulatory dataset of medical device descriptions. We evaluate each model along three key dimensions: accuracy, interpretability, and computational cost.
zh

[AI-203] AI-Driven Climate Policy Scenario Generation for Sub-Saharan Africa

【速读】:该论文试图解决传统气候政策情景生成与评估方法在处理复杂能源与气候问题时存在的局限性,如耗时、依赖简单趋势外推以及难以捕捉系统间的相互关联。其解决方案的关键在于引入生成式 AI(Generative AI),特别是大型语言模型(LLMs),以模拟符合区域气候目标和能源挑战的多样化政策情景,并通过自动化技术进行情景评估,从而在数据受限条件下提升情景生成的效率与质量。

链接: https://arxiv.org/abs/2505.18694
作者: Rafiu Adekoya Badekale,Adewale Akinfaderin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages long. Extended version of the paper titled “Climate Policy Simulation and Scenario Generation for Sub-Saharan Africa” accepted at the 13th Data Science Africa Workshop (DSA 2025). Code available at this https URL

点击查看摘要

Abstract:Climate policy scenario generation and evaluation have traditionally relied on integrated assessment models (IAMs) and expert-driven qualitative analysis. These methods enable stakeholders, such as policymakers and researchers, to anticipate impacts, plan governance strategies, and develop mitigation measures. However, traditional methods are often time-intensive, reliant on simple extrapolations of past trends, and limited in capturing the complex and interconnected nature of energy and climate issues. With the advent of artificial intelligence (AI), particularly generative AI models trained on vast datasets, these limitations can be addressed, ensuring robustness even under limited data conditions. In this work, we explore the novel method that employs generative AI, specifically large language models (LLMs), to simulate climate policy scenarios for Sub-Saharan Africa. These scenarios focus on energy transition themes derived from the historical United Nations Climate Change Conference (COP) documents. By leveraging generative models, the project aims to create plausible and diverse policy scenarios that align with regional climate goals and energy challenges. Given limited access to human evaluators, automated techniques were employed for scenario evaluation. We generated policy scenarios using the llama3.2-3B model. Of the 34 generated responses, 30 (88%) passed expert validation, accurately reflecting the intended impacts provided in the corresponding prompts. We compared these validated responses against assessments from a human climate expert and two additional LLMs (gemma2-2B and mistral-7B). Our structured, embedding-based evaluation framework shows that generative AI effectively generate scenarios that are coherent, relevant, plausible, and diverse. This approach offers a transformative tool for climate policy planning in data-constrained regions.
zh

[AI-204] rajMoE: Spatially-Aware Mixture of Experts for Unified Human Mobility Modeling

【速读】:该论文旨在解决跨城市人类移动建模中的泛化性问题,主要挑战包括城市间空间语义不一致和城市移动模式多样性。其解决方案的关键在于提出TrajMoE模型,该模型通过设计空间语义编码器学习可迁移的位置表示,并引入空间感知的专家混合(Spatially-Aware Mixture-of-Experts, SAMoE)Transformer结构,以注入结构化先验信息并实现跨城市泛化能力。

链接: https://arxiv.org/abs/2505.18670
作者: Chonghua Han,Yuan Yuan,Kaiyan Chen,Jingtao Ding,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modeling human mobility across diverse cities is essential for applications such as urban planning, transportation optimization, and personalized services. However, generalization remains challenging due to heterogeneous spatial representations and mobility patterns across cities. Existing methods typically rely on numerical coordinates or require training city-specific models, limiting their scalability and transferability. We propose TrajMoE, a unified and scalable model for cross-city human mobility modeling. TrajMoE addresses two key challenges: (1) inconsistent spatial semantics across cities, and (2) diverse urban mobility patterns. To tackle these, we begin by designing a spatial semantic encoder that learns transferable location representations from POI-based functional semantics and visit patterns. Furthermore, we design a Spatially-Aware Mixture-of-Experts (SAMoE) Transformer that injects structured priors into experts specialized in distinct mobility semantics, along with a shared expert to capture city-invariant patterns and enable adaptive cross-city generalization. Extensive experiments demonstrate that TrajMoE achieves up to 27% relative improvement over competitive mobility foundation models after only one epoch of fine-tuning, and consistently outperforms full-data baselines using merely 5% of target city data. These results establish TrajMoE as a significant step toward realizing a truly generalizable, transferable, and pretrainable foundation model for human mobility.
zh

[AI-205] MLLM s are Deeply Affected by Modality Bias

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的模态偏差(modality bias)问题,即模型在处理多模态输入时过度依赖语言模态而忽视其他模态如视觉信息。解决方案的关键在于识别并缓解导致模态偏差的三个核心因素:数据特性、主干网络能力不平衡以及训练目标的不均衡。通过实验验证了这些因素对模型学习动态的影响,并提出了系统性的研究路线和可行建议,以促进多模态信息的平衡整合,从而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2505.18657
作者: Xu Zheng,Chenfei Liao,Yuqian Fu,Kaiyu Lei,Yuanhuiyi Lyu,Lutao Jiang,Bin Ren,Jialei Chen,Jiawen Wang,Chengxin Li,Linfeng Zhang,Danda Pani Paudel,Xuanjing Huang,Yu-Gang Jiang,Nicu Sebe,Dacheng Tao,Luc Van Gool,Xuming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images. MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs. This position paper argues that MLLMs are deeply affected by modality bias. Firstly, we diagnose the current state of modality bias, highlighting its manifestations across various tasks. Secondly, we propose a systematic research road-map related to modality bias in MLLMs. Thirdly, we identify key factors of modality bias in MLLMs and offer actionable suggestions for future research to mitigate it. To substantiate these findings, we conduct experiments that demonstrate the influence of each factor: 1. Data Characteristics: Language data is compact and abstract, while visual data is redundant and complex, creating an inherent imbalance in learning dynamics. 2. Imbalanced Backbone Capabilities: The dominance of pretrained language models in MLLMs leads to overreliance on language and neglect of visual information. 3. Training Objectives: Current objectives often fail to promote balanced cross-modal alignment, resulting in shortcut learning biased toward language. These findings highlight the need for balanced training strategies and model architectures to better integrate multiple modalities in MLLMs. We call for interdisciplinary efforts to tackle these challenges and drive innovation in MLLM research. Our work provides a fresh perspective on modality bias in MLLMs and offers insights for developing more robust and generalizable multimodal systems-advancing progress toward Artificial General Intelligence.
zh

[AI-206] Flow Matching for Geometric Trajectory Simulation

【速读】:该论文试图解决在N体系统模拟中,现有方法需要从无信息噪声开始学习复杂变换,无法有效利用领域先验知识的问题。其解决方案的关键在于提出STFlow,通过流匹配和数据相关耦合实现物理信息驱动的几何轨迹模拟,既保持了模型的表达能力与可扩展性,又显著降低了预测误差并提高了推理效率。

链接: https://arxiv.org/abs/2505.18647
作者: Kiet Bennema ten Brinke,Koen Minartz,Vlado Menkovski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 17 figures

点击查看摘要

Abstract:The simulation of N-body systems is a fundamental problem with applications in a wide range of fields, such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances based on deep generative modeling and geometric deep learning have enabled probabilistic simulation by modeling complex distributions over trajectories while respecting the permutation symmetry that is fundamental to N-body systems. However, to generate realistic trajectories, existing methods must learn complex transformations starting from uninformed noise and do not allow for the exploitation of domain-informed priors. In this work, we propose STFlow to address this limitation. By leveraging flow matching and data-dependent couplings, STFlow facilitates physics-informed simulation of geometric trajectories without sacrificing model expressivity or scalability. Our evaluation on N-body dynamical systems, molecular dynamics, and pedestrian dynamics benchmarks shows that STFlow produces significantly lower prediction errors while enabling more efficient inference, highlighting the benefits of employing physics-informed prior distributions in probabilistic geometric trajectory modeling.
zh

[AI-207] Riverine Flood Prediction and Early Warning in Mountainous Regions using Artificial Intelligence

【速读】:该论文旨在解决跨境流域中由于缺乏上游数据而导致的洪水预测复杂性问题,特别是在地形复杂和极端气候变化背景下,洪水对人类生活、农业、基础设施等造成的威胁。其解决方案的关键在于利用基于卫星的气候数据,并应用多种先进的机器学习和深度学习模型,如支持向量机(Support Vector Machine, SVM)、XGBoost、人工神经网络(Artificial Neural Network, ANN)、长短期记忆网络(Long Short-Term Memory, LSTM)和门控循环单元(Gated Recurrent Unit, GRU),以实现对河流流量的准确预测。其中,LSTM网络表现最佳,显示出在短期洪水预测中的高精度,但研究也指出需要更长的历史数据以提高长期预测的可靠性。

链接: https://arxiv.org/abs/2505.18645
作者: Haleema Bibi,Sadia Saleem,Zakia Jalil,Muhammad Nasir,Tahani Alsubait
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 6 figure

点击查看摘要

Abstract:Flooding is the most devastating phenomenon occurring globally, particularly in mountainous regions, risk dramatically increases due to complex terrains and extreme climate changes. These situations are damaging livelihoods, agriculture, infrastructure, and human lives. This study uses the Kabul River between Pakistan and Afghanistan as a case study to reflect the complications of flood forecasting in transboundary basins. The challenges in obtaining upstream data impede the efficacy of flood control measures and early warning systems, a common global problem in similar basins. Utilizing satellite-based climatic data, this study applied numerous advanced machine-learning and deep learning models, such as Support Vector Machines (SVM), XGBoost, and Artificial Neural Networks (ANN), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRU) to predict daily and multi-step river flow. The LSTM network outperformed other models, achieving the highest R2 value of 0.96 and the lowest RMSE value of 140.96 m3/sec. The time series LSTM and GRU network models, utilized for short-term forecasts of up to five days, performed significantly. However, the accuracy declined beyond the fourth day, highlighting the need for longer-term historical datasets for reliable long-term flood predictions. The results of the study are directly aligned with Sustainable Development Goals 6, 11, 13, and 15, facilitating disaster and water management, timely evacuations, improved preparedness, and effective early warning.
zh

[AI-208] hanoRA: Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation

【速读】:该论文旨在解决基础模型在多任务适应中的效率与统一性问题,尤其是在保持低秩适配(LoRA)的推理效率的同时实现多任务学习。现有方法虽尝试将LoRA与专家混合(MoE)结合,但路由器的使用导致参数不可合并,增加了推理开销并限制了统一多任务适应的实用性。论文提出的ThanoRA框架通过联合建模任务异质性并减轻子空间干扰,实现了多任务适应与LoRA推理效率的兼顾。其关键在于在初始化阶段构建任务特定的LoRA子空间,并引入子空间保留正则化以防止任务干扰和子空间坍塌。

链接: https://arxiv.org/abs/2505.18640
作者: Jian Liang,Wenke Huang,Xianda Guo,Guancheng Wan,Bo Du,Mang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is widely adopted for downstream fine-tuning of foundation models due to its efficiency and zero additional inference cost. Many real-world applications require foundation models to specialize in multiple tasks simultaneously, motivating the need for efficient multi-task adaptation. While recent approaches integrate LoRA with mixture-of-experts (MoE) to address this, the use of routers prevents parameter mergeability, which increases inference overhead and hinders unified multi-task adaptation, thereby limiting deployment practicality. In this work, we propose ThanoRA, a Task Heterogeneity-Aware Multi-Task Low-Rank Adaptation framework that enables multi-task adaptation while preserving the inference efficiency of LoRA. ThanoRA jointly models task heterogeneity and mitigates subspace interference throughout training. Specifically, motivated by inherent differences in complexity and heterogeneity across tasks, ThanoRA constructs task-specific LoRA subspaces at initialization, enabling fine-grained knowledge injection aligned with task heterogeneity. Furthermore, to prevent task interference and subspace collapse during multi-task training, ThanoRA introduces a subspace-preserving regularization that maintains the independence of task-specific representations. With the synergy of both components, ThanoRA enables efficient and unified multi-task adaptation. Extensive experiments across multimodal and text-only benchmarks under varying multi-task mixtures demonstrate that ThanoRA consistently achieves robust and superior performance over strong baselines without introducing additional inference overhead. Our code is publicly available at: this https URL.
zh

[AI-209] Mind The Gap: Deep Learning Doesnt Learn Deeply

【速读】:该论文试图解决神经网络在学习算法推理时存在的表达能力与可训练性差距(expressability-trainability gaps)问题,具体关注已学习算法的有效性与忠实性,以及神经网络为何在某些情况下无法学习有效的算法。解决方案的关键在于使用神经编译(neural compilation)技术,该技术通过直接将源算法编码到神经网络参数中,使网络能够精确计算该算法,从而实现对编译参数与传统学习参数、中间向量及行为的对比分析。

链接: https://arxiv.org/abs/2505.18623
作者: Lucas Saldyt,Subbarao Kambhampati
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper aims to understand how neural networks learn algorithmic reasoning by addressing two questions: How faithful are learned algorithms when they are effective, and why do neural networks fail to learn effective algorithms otherwise? To answer these questions, we use neural compilation, a technique that directly encodes a source algorithm into neural network parameters, enabling the network to compute the algorithm exactly. This enables comparison between compiled and conventionally learned parameters, intermediate vectors, and behaviors. This investigation is crucial for developing neural networks that robustly learn complexalgorithms from data. Our analysis focuses on graph neural networks (GNNs), which are naturally aligned with algorithmic reasoning tasks, specifically our choices of BFS, DFS, and Bellman-Ford, which cover the spectrum of effective, faithful, and ineffective learned algorithms. Commonly, learning algorithmic reasoning is framed as induction over synthetic data, where a parameterized model is trained on inputs, traces, and outputs produced by an underlying ground truth algorithm. In contrast, we introduce a neural compilation method for GNNs, which sets network parameters analytically, bypassing training. Focusing on GNNs leverages their alignment with algorithmic reasoning, extensive algorithmic induction literature, and the novel application of neural compilation to GNNs. Overall, this paper aims to characterize expressability-trainability gaps - a fundamental shortcoming in learning algorithmic reasoning. We hypothesize that inductive learning is most effective for parallel algorithms contained within the computational class \textttNC.
zh

[AI-210] rust or Dont Predict: Introducing the CWSA Family for Confidence-Aware Model Evaluation

【速读】:该论文旨在解决现有评估指标(如准确率、期望校准误差和风险-覆盖率曲线下面积)无法准确反映模型在置信度阈值下的实际预测可靠性问题。传统方法要么忽略置信度,要么通过平均稀释局部信息,或未能适当惩罚过度自信的错误分类,这在实际系统中可能带来严重风险。论文提出的解决方案是引入两种新的评估指标:置信度加权选择性准确率(Confidence-Weighted Selective Accuracy, CWSA)及其归一化变体CWSA+,其关键在于通过显式奖励高置信度的准确性并惩罚过度自信的错误,实现对模型在不同置信度阈值下的性能进行更精确、可解释的评估。

链接: https://arxiv.org/abs/2505.18622
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar,Pegah Ghaffari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In recent machine learning systems, confidence scores are being utilized more and more to manage selective prediction, whereby a model can abstain from making a prediction when it is unconfident. Yet, conventional metrics like accuracy, expected calibration error (ECE), and area under the risk-coverage curve (AURC) do not capture the actual reliability of predictions. These metrics either disregard confidence entirely, dilute valuable localized information through averaging, or neglect to suitably penalize overconfident misclassifications, which can be particularly detrimental in real-world systems. We introduce two new metrics Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+ that offer a principled and interpretable way to evaluate predictive models under confidence thresholds. Unlike existing methods, our metrics explicitly reward confident accuracy and penalize overconfident mistakes. They are threshold-local, decomposable, and usable in both evaluation and deployment settings where trust and risk must be quantified. Through exhaustive experiments on both real-world data sets (MNIST, CIFAR-10) and artificial model variants (calibrated, overconfident, underconfident, random, perfect), we show that CWSA and CWSA+ both effectively detect nuanced failure modes and outperform classical metrics in trust-sensitive tests. Our results confirm that CWSA is a sound basis for developing and assessing selective prediction systems for safety-critical domains.
zh

[AI-211] Knowledge Retrieval in LLM Gaming: A Shift from Entity-Centric to Goal-Oriented Graphs

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂应用中,如游戏,缺乏有效的分步推理能力的问题。其解决方案的关键在于提出一种基于目标导向图(Goal-Oriented Graphs, GoGs)的新框架,其中每个节点代表一个目标及其相关属性,边编码目标之间的逻辑依赖关系,从而通过显式检索推理路径来增强LLMs的推理能力。

链接: https://arxiv.org/abs/2505.18607
作者: Jonathan Leung,Yongjie Wang,Zhiqi Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive general capabilities but often struggle with step-by-step reasoning, especially in complex applications such as games. While retrieval-augmented methods like GraphRAG attempt to bridge this gap through cross-document extraction and indexing, their fragmented entity-relation graphs and overly dense local connectivity hinder the construction of coherent reasoning. In this paper, we propose a novel framework based on Goal-Oriented Graphs (GoGs), where each node represents a goal and its associated attributes, and edges encode logical dependencies between goals. This structure enables explicit retrieval of reasoning paths by first identifying high-level goals and recursively retrieving their subgoals, forming coherent reasoning chains to guide LLM prompting. Our method significantly enhances the reasoning ability of LLMs in game-playing tasks, as demonstrated by extensive experiments on the Minecraft testbed, outperforming GraphRAG and other baselines.
zh

[AI-212] LLM -Meta-SR: Learning to Evolve Selection Operators for Symbolic Regression

【速读】:该论文旨在解决生成式 AI (Generative AI) 在符号回归(symbolic regression)中自动设计选择算子的局限性,传统方法依赖人工专家设计,存在代码冗余和语义引导不足的问题。其解决方案的关键在于提出一种“学习-进化”框架,通过引入冗余控制和语义感知的选择算子,以及在提示中嵌入领域知识,以提升算法的可解释性和进化效率。实验结果表明,该方法能够生成优于九种人工设计基线的选择算子,达到当前最优性能。

链接: https://arxiv.org/abs/2505.18602
作者: Hengzhe Zhang,Qi Chen,Bing Xue,Mengjie Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized algorithm development, yet their application in symbolic regression, where algorithms automatically discover symbolic expressions from data, remains constrained and is typically designed manually by human experts. In this paper, we propose a learning-to-evolve framework that enables LLMs to automatically design selection operators for evolutionary symbolic regression algorithms. We first identify two key limitations in existing LLM-based algorithm evolution techniques: code bloat and a lack of semantic guidance. Bloat results in unnecessarily complex components, and the absence of semantic awareness can lead to ineffective exchange of useful code components, both of which can reduce the interpretability of the designed algorithm or hinder evolutionary learning progress. To address these issues, we enhance the LLM-based evolution framework for meta symbolic regression with two key innovations: bloat control and a complementary, semantics-aware selection operator. Additionally, we embed domain knowledge into the prompt, enabling the LLM to generate more effective and contextually relevant selection operators. Our experimental results on symbolic regression benchmarks show that LLMs can devise selection operators that outperform nine expert-designed baselines, achieving state-of-the-art performance. This demonstrates that LLMs can exceed expert-level algorithm design for symbolic regression.
zh

[AI-213] LLM s for Supply Chain Management

【速读】:该论文旨在解决供应链管理(Supply Chain Management, SCM)中复杂协作与竞争问题的建模与分析,以及如何利用大型语言模型(Large Language Models, LLMs)提升供应链任务的性能。其解决方案的关键在于提出一种检索增强生成(Retrieval-Augmented Generation, RAG)框架,该框架动态地将外部知识整合到推理过程中,并开发了一个领域专用的SCM LLM,该模型通过标准化的SCM考试和啤酒游戏测试,展现出专家级的能力。此外,通过LLMs进行水平和垂直供应链博弈分析,揭示了供应链中的竞争与合作关系,验证了模型在经典供应链文献中的洞察力及对新兴现象的新视角。

链接: https://arxiv.org/abs/2505.18597
作者: Haojie Wang,Jiuyun Jiang,L. Jeff Hong,Guangxin Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The development of large language models (LLMs) has provided new tools for research in supply chain management (SCM). In this paper, we introduce a retrieval-augmented generation (RAG) framework that dynamically integrates external knowledge into the inference process, and develop a domain-specialized SCM LLM, which demonstrates expert-level competence by passing standardized SCM examinations and beer game tests. We further employ the use of LLMs to conduct horizontal and vertical supply chain games, in order to analyze competition and cooperation within supply chains. Our experiments show that RAG significantly improves performance on SCM tasks. Moreover, game-theoretic analysis reveals that the LLM can reproduce insights from the classical SCM literature, while also uncovering novel behaviors and offering fresh perspectives on phenomena such as the bullwhip effect. This paper opens the door for exploring cooperation and competition for complex supply chain network through the lens of LLMs.
zh

[AI-214] MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

【速读】:该论文旨在解决合作多智能体环境中离线模仿学习(offline imitation learning, IL)的问题,其中演示数据包含未标记的混合质量轨迹,即同时包含专家轨迹和次优轨迹。其解决方案的关键在于分两阶段进行:第一阶段通过结合大语言模型和基于偏好的强化学习构建渐进式标签管道,以区分专家质量的轨迹;第二阶段引入MisoDICE算法,该算法通过利用这些标签来学习鲁棒策略,并通过新的价值分解与混合架构将流行的单智能体DICE框架扩展到多智能体场景,从而实现凸优化的目标并保证全局与局部策略的一致性。

链接: https://arxiv.org/abs/2505.18595
作者: TheViet Bui,Tien Mai,Hong Thanh Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.
zh

[AI-215] Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?

【速读】:该论文试图解决如何评估和理解数据集在生成式 AI (Generative AI) 模型中的适用性问题,特别是如何通过探测技术揭示模型内部特征空间与人类可解释概念之间的关系。其解决方案的关键在于通过量化分析探测性能与模型响应不确定性之间的强相关性,发现探测性能的提升与响应不确定性的降低呈正相关,并进一步通过特征重要性分析揭示高模型响应方差与大量重要特征之间的关联,从而为优化探测训练和理解模型内部表示提供理论依据。

链接: https://arxiv.org/abs/2505.18575
作者: Yongjie Wang,Yibo Wang,Xin Zhou,Zhiqi Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 Pages

点击查看摘要

Abstract:Probing techniques have shown promise in revealing how LLMs encode human-interpretable concepts, particularly when applied to curated datasets. However, the factors governing a dataset’s suitability for effective probe training are not well-understood. This study hypothesizes that probe performance on such datasets reflects characteristics of both the LLM’s generated responses and its internal feature space. Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds to a reduction in response uncertainty, and vice versa. Subsequently, we delve deeper into this correlation through the lens of feature importance analysis. Our findings indicate that high LLM response variance is associated with a larger set of important features, which poses a greater challenge for probe models and often results in diminished performance. Moreover, leveraging the insights from response uncertainty analysis, we are able to identify concrete examples where LLM representations align with human knowledge across diverse domains, offering additional evidence of interpretable reasoning in LLMs.
zh

[AI-216] Autocomp: LLM -Driven Code Optimization for Tensor Accelerators

【速读】:该论文试图解决在专用张量加速器上编写高效代码的挑战,尽管已有大量努力用于构建编译器,但编程这些张量加速器仍然具有挑战性,导致其潜力未被充分挖掘。解决方案的关键在于提出Autocomp方法,该方法通过自动化的大语言模型(Large Language Model, LLM)驱动搜索,使加速器程序员能够利用领域知识和硬件反馈来优化代码。其核心创新包括:将每个优化步骤形式化为结构化的两阶段提示(规划与代码生成阶段),在规划阶段通过简洁且可调整的优化菜单插入领域知识,并在每次搜索迭代中整合来自硬件的正确性和性能指标作为反馈。

链接: https://arxiv.org/abs/2505.18574
作者: Charles Hong,Sahil Bhatia,Alvin Cheung,Yakun Sophia Shao
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today’s computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
zh

[AI-217] MASTER: Multi-Agent Security Through Exploration of Roles and Topological Structures – A Comprehensive Framework

【速读】:该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MAS)在面对攻击时所面临的严重安全风险问题。解决方案的关键在于提出一种名为MASTER的安全研究框架,该框架聚焦于不同场景下的角色配置和拓扑结构,并通过信息流交互范式实现自动化构建多种MAS设置,同时设计了一种基于角色和拓扑信息的可扩展、场景自适应的攻击策略,以动态分配针对特定领域的攻击任务,从而有效提升MAS的抗攻击能力。

链接: https://arxiv.org/abs/2505.18572
作者: Yifan Zhu,Chao Zhang,Xin Shi,Xueqiao Zhang,Yi Yang,Yawei Luo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs)-based Multi-Agent Systems (MAS) exhibit remarkable problem-solving and task planning capabilities across diverse domains due to their specialized agentic roles and collaborative interactions. However, this also amplifies the severity of security risks under MAS attacks. To address this, we introduce MASTER, a novel security research framework for MAS, focusing on diverse Role configurations and Topological structures across various scenarios. MASTER offers an automated construction process for different MAS setups and an information-flow-based interaction paradigm. To tackle MAS security challenges in varied scenarios, we design a scenario-adaptive, extensible attack strategy utilizing role and topological information, which dynamically allocates targeted, domain-specific attack tasks for collaborative agent execution. Our experiments demonstrate that such an attack, leveraging role and topological information, exhibits significant destructive potential across most models. Additionally, we propose corresponding defense strategies, substantially enhancing MAS resilience across diverse scenarios. We anticipate that our framework and findings will provide valuable insights for future research into MAS security challenges.
zh

[AI-218] PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning

【速读】:该论文试图解决分布式训练中由于梯度聚合带来的高通信开销问题,尤其是在带宽受限条件下,现有梯度压缩方案难以在提升训练速度的同时保持模型精度。解决方案的关键在于提出PacTrain框架,通过结合剪枝与稀疏梯度压缩技术,使模型权重和梯度变得稀疏,并在所有分布式训练节点间确保梯度稀疏性的全局一致性,从而实现轻量级压缩通信而不损害模型精度。

链接: https://arxiv.org/abs/2505.18563
作者: Yisu Wang,Ruilong Wu,Xinjiao Li,Dirk Kutscher
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting gradient aggregation overhead. While gradient compression and sparse collective communication techniques are commonly employed to alleviate network load, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. This paper introduces PacTrain, a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all-reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72 times compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.
zh

[AI-219] RoleRAG : Enhancing LLM Role-Playing via Graph Guided Retrieval

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在角色扮演对话中生成内容与角色背景不一致或无关的问题。解决方案的关键在于提出一种基于检索的框架——RoleRAG,该框架通过高效的实体消歧技术实现角色特定知识的索引,并结合具有边界感知能力的检索器,从结构化知识图谱中提取上下文相关的信息,从而提升模型对角色知识的对齐能力和减少幻觉响应。

链接: https://arxiv.org/abs/2505.18541
作者: Yongjie Wang,Jonathan Leung,Zhiqi Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: A Retrieval-enhanced LLM Role-playing

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in character imitation, enabling immersive and engaging conversations. However, they often generate content that is irrelevant or inconsistent with a character’s background. We attribute these failures to: (1) the inability to accurately recall character-specific knowledge due to entity ambiguity, and (2) a lack of awareness of the character’s cognitive boundaries. To address these issues, we propose RoleRAG, a retrieval-based framework that integrates efficient entity disambiguation for knowledge indexing with a boundary-aware retriever for extracting contextually appropriate information from a structured knowledge graph. Experiments on role-playing benchmarks show that RoleRAG’s calibrated retrieval helps both general-purpose and role-specific LLMs better align with character knowledge and reduce hallucinated responses.
zh

[AI-220] MRGAgents : A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs

【速读】:该论文试图解决医学影像报告生成中存在的重要问题,即现有的医学大视觉-语言模型(Medical Large Vision-Language Models, Med-LVLMs)倾向于将所有发现预测为正常,导致报告忽略关键异常,并且无法提供足够的放射学相关区域的全面描述。解决方案的关键在于提出医学报告生成代理(Medical Report Generation Agents, MRGAgents),这是一个多代理框架,通过针对不同疾病类别进行微调,生成更平衡地反映正常与异常发现并确保临床相关区域全面描述的报告。

链接: https://arxiv.org/abs/2505.18530
作者: Pengyu Wang,Shuchang Ye,Usman Naseem,Jinman Kim
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 10pages

点击查看摘要

Abstract:Medical Large Vision-Language Models (Med-LVLMs) have been widely adopted for medical report generation. Despite Med-LVLMs producing state-of-the-art performance, they exhibit a bias toward predicting all findings as normal, leading to reports that overlook critical abnormalities. Furthermore, these models often fail to provide comprehensive descriptions of radiologically relevant regions necessary for accurate diagnosis. To address these challenges, we proposeMedical Report Generation Agents (MRGAgents), a novel multi-agent framework that fine-tunes specialized agents for different disease categories. By curating subsets of the IU X-ray and MIMIC-CXR datasets to train disease-specific agents, MRGAgents generates reports that more effectively balance normal and abnormal findings while ensuring a comprehensive description of clinically relevant regions. Our experiments demonstrate that MRGAgents outperformed the state-of-the-art, improving both report comprehensiveness and diagnostic utility.
zh

[AI-221] CLaDMoP: Learning Transferrable Models from Successful Clinical Trials via LLM s KDD2025

【速读】:该论文试图解决临床试验结果预测模型在学习可泛化表示方面的局限性,这些问题通常源于使用任务特定损失函数对阶段特定数据进行优化,导致假阳性/假阴性增加。其解决方案的关键是提出一种新的预训练方法CLaDMoP,该方法通过“配对匹配”代理任务进行预训练,避免依赖任务特定目标,并利用大语言模型与轻量级药物分子分支的多层级融合技术,结合分组块结构高效融合长嵌入,从而提升模型性能。

链接: https://arxiv.org/abs/2505.18527
作者: Yiqing Zhang,Xiaozhong Liu,Fabricio Murai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted and to be published in KDD2025

点击查看摘要

Abstract:Many existing models for clinical trial outcome prediction are optimized using task-specific loss functions on trial phase-specific data. While this scheme may boost prediction for common diseases and drugs, it can hinder learning of generalizable representations, leading to more false positives/negatives. To address this limitation, we introduce CLaDMoP, a new pre-training approach for clinical trial outcome prediction, alongside the Successful Clinical Trials dataset(SCT), specifically designed for this task. CLaDMoP leverages a Large Language Model-to encode trials’ eligibility criteria-linked to a lightweight Drug-Molecule branch through a novel multi-level fusion technique. To efficiently fuse long embeddings across levels, we incorporate a grouping block, drastically reducing computational overhead. CLaDMoP avoids reliance on task-specific objectives by pre-training on a “pair matching” proxy task. Compared to established zero-shot and few-shot baselines, our method significantly improves both PR-AUC and ROC-AUC, especially for phase I and phase II trials. We further evaluate and perform ablation on CLaDMoP after Parameter-Efficient Fine-Tuning, comparing it to state-of-the-art supervised baselines, including MEXA-CTP, on the Trial Outcome Prediction(TOP) benchmark. CLaDMoP achieves up to 10.5% improvement in PR-AUC and 3.6% in ROC-AUC, while attaining comparable F1 score to MEXA-CTP, highlighting its potential for clinical trial outcome prediction. Code and SCT dataset can be downloaded from this https URL.
zh

[AI-222] LiSTEN: Learning Soft Token Embeddings for Neural Audio LLM s

【速读】:该论文旨在解决将基于大语言模型(Large Language Models, LLMs)的基础模型适应于通用音频-语言任务的挑战,这一过程受到声学环境差异和任务变化的影响。其解决方案的关键在于提出LiSTEN(Learning Soft Token Embeddings for Neural Audio LLMs),该框架采用可学习的键值对动态提示选择策略,使模型能够在多任务设置中平衡通用知识与任务特定知识,同时避免过拟合。该方法减少了对大规模自动语音识别(ASR)或字幕数据集的依赖,实现了在较少可训练参数下的竞争性性能,并通过单阶段训练流程简化了训练过程。

链接: https://arxiv.org/abs/2505.18517
作者: Pooneh Mousavi,Shubham Gupta,Cem Subakan,Mirco Ravanelli
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.
zh

[AI-223] st-Time Adaptation with Binary Feedback ICML2025

【速读】:该论文试图解决在训练数据与测试数据存在领域偏移(domain shift)时,深度学习模型性能下降的问题。现有测试时间自适应(TTA)方法在严重领域偏移下表现不佳,而需要全类别标签的主动TTA方法由于标注成本高而不具实用性。论文提出了一种新的TTA设置,即使用少量二元反馈输入来指示模型预测是否正确,从而显著降低标注负担。其解决方案的关键在于提出BiTTA框架,该框架通过强化学习平衡了基于二元反馈的不确定样本适应与基于一致性的自信预测自适应。

链接: https://arxiv.org/abs/2505.18514
作者: Taeckyung Lee,Sorn Chottananurak,Junsu Kim,Jinwoo Shin,Taesik Gong,Sung-Ju Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Deep learning models perform poorly when domain shifts exist between training and test data. Test-time adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a new setting of TTA with binary feedback. This setting uses a few binary feedback inputs from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BiTTA, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BiTTA achieves 13.3%p accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort. The source code is available at this https URL.
zh

[AI-224] How Particle System Theory Enhances Hypergraph Message Passing

【速读】:该论文旨在解决传统图神经网络在处理高阶关系时的局限性,特别是在捕捉复杂交互、避免过平滑(over-smoothing)和异质性(heterophily)问题上的不足。其解决方案的关键在于引入一种受相互作用粒子系统启发的超图消息传递框架,其中超边作为场诱导节点的共享动态,通过结合吸引、排斥和Allen-Cahn强制项,实现不同类别和特征的粒子达到类依赖的平衡状态,从而通过粒子驱动的消息传递实现可分性。该方法通过一阶和二阶粒子系统方程建模动态,有效缓解了过平滑和异质性问题,并通过引入随机元素增强确定性消息传递以应对交互不确定性。

链接: https://arxiv.org/abs/2505.18505
作者: Yixuan Ma,Kai Yi,Pietro Lio,Shi Jin,Yu Guang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hypergraphs effectively model higher-order relationships in natural phenomena, capturing complex interactions beyond pairwise connections. We introduce a novel hypergraph message passing framework inspired by interacting particle systems, where hyperedges act as fields inducing shared node dynamics. By incorporating attraction, repulsion, and Allen-Cahn forcing terms, particles of varying classes and features achieve class-dependent equilibrium, enabling separability through the particle-driven message passing. We investigate both first-order and second-order particle system equations for modeling these dynamics, which mitigate over-smoothing and heterophily thus can capture complete interactions. The more stable second-order system permits deeper message passing. Furthermore, we enhance deterministic message passing with stochastic element to account for interaction uncertainties. We prove theoretically that our approach mitigates over-smoothing by maintaining a positive lower bound on the hypergraph Dirichlet energy during propagation and thus to enable hypergraph message passing to go deep. Empirically, our models demonstrate competitive performance on diverse real-world hypergraph node classification tasks, excelling on both homophilic and heterophilic datasets.
zh

[AI-225] G1: Teaching LLM s to Reason on Graphs with Reinforcement Learning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在图相关任务中表现不足的问题,这一局限性阻碍了构建真正通用模型的发展。其解决方案的关键在于引入G1方法,通过在合成图论任务上使用强化学习(Reinforcement Learning, RL),显著提升LLMs的图推理能力。为支持RL训练,研究者构建了Erdõs数据集,这是目前最大的图推理数据集,包含50个不同难度的图论任务、10万条训练数据和5千条测试数据,所有数据均来源于真实图结构。实验表明,基于Erdõs进行RL训练的模型在图推理任务中表现出色,且具备良好的零样本泛化能力,无需牺牲通用推理能力即可适应未见过的任务、领域和图编码方案。

链接: https://arxiv.org/abs/2505.18499
作者: Xiaojun Guo,Ang Li,Yifei Wang,Stefanie Jegelka,Yisen Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs’ graph reasoning abilities. To enable RL training, we curate Erdõs, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erdõs, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully.
zh

[AI-226] FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中基于低秩适配(Low-Rank Adaptation, LoRA)的异构模型训练中存在的收敛性问题。现有方法由于参数截断和梯度更新偏差导致缺乏严格的收敛保证,进而影响模型性能。其解决方案的关键在于提出FedHL框架,通过使用全秩全局模型作为校准聚合基础,消除初始对齐时的截断偏差,并通过最小化收敛上界中的梯度漂移项,推导出理论最优的聚合权重,从而确保O(1/T)\mathcal{O}(1/\sqrt{T})的收敛速率。

链接: https://arxiv.org/abs/2505.18494
作者: Zihao Peng,Jiandian Zeng,Boyuan Li,Guo Li,Shengbo Chen,Tian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) facilitates the fine-tuning of Foundation Models (FMs) using distributed data sources, with Low-Rank Adaptation (LoRA) gaining popularity due to its low communication costs and strong performance. While recent work acknowledges the benefits of heterogeneous LoRA in FL and introduces flexible algorithms to support its implementation, our theoretical analysis reveals a critical gap: existing methods lack formal convergence guarantees due to parameter truncation and biased gradient updates. Specifically, adapting client-specific LoRA ranks necessitates truncating global parameters, which introduces inherent truncation errors and leads to subsequent inaccurate gradient updates that accumulate over training rounds, ultimately degrading performance. To address the above issues, we propose \textbfFedHL, a simple yet effective \textbfFederated Learning framework tailored for \textbfHeterogeneous \textbfLoRA. By leveraging the full-rank global model as a calibrated aggregation basis, FedHL eliminates the direct truncation bias from initial alignment with client-specific ranks. Furthermore, we derive the theoretically optimal aggregation weights by minimizing the gradient drift term in the convergence upper bound. Our analysis shows that FedHL guarantees \mathcalO(1/\sqrtT) convergence rate, and experiments on multiple real-world datasets demonstrate a 1-3% improvement over several state-of-the-art methods.
zh

[AI-227] Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions

【速读】:该论文旨在解决数学竞赛中两类具有挑战性的问题:定理证明(theorem-proving)和答案构造(answer-construction),其中涉及严格的证明和数学对象的假设与形式化验证。传统方法中,大型语言模型(Large Language Models, LLMs)虽能生成创造性的候选答案,但在形式化验证方面存在不足;而符号证明工具虽能确保严谨性,却难以高效生成创造性猜想。论文提出的解决方案是Enumerate-Conjecture-Prove (ECP)框架,其关键在于结合基于LLM的枚举与模式驱动的猜想生成,并与形式化定理证明相结合,从而提升答案构造的准确性。

链接: https://arxiv.org/abs/2505.18492
作者: Jialiang Sun,Yuzhi Tang,Ao Li,Chris J. Maddison,Kuldeep S. Meel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mathematical reasoning lies at the heart of artificial intelligence, underpinning applications in education, program verification, and research-level mathematical discovery. Mathematical competitions, in particular, present two challenging problem types: theorem-proving, requiring rigorous proofs of stated conclusions, and answer-construction, involving hypothesizing and formally verifying mathematical objects. Large Language Models (LLMs) effectively generate creative candidate answers but struggle with formal verification, while symbolic provers ensure rigor but cannot efficiently handle creative conjecture generation. We introduce the Enumerate-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating LLM-based enumeration and pattern-driven conjecturing with formal theorem proving. We present ConstructiveBench, a dataset of 3,431 answer-construction problems in various math competitions with verified Lean formalizations. On the ConstructiveBench dataset, ECP improves the accuracy of answer construction from the Chain-of-Thought (CoT) baseline of 14.54% to 45.06% with the gpt-4.1-mini model. Moreover, combining with ECP’s constructed answers, the state-of-the-art DeepSeek-Prover-V2-7B model generates correct proofs for 858 of the 3,431 constructive problems in Lean, achieving 25.01% accuracy, compared to 9.86% for symbolic-only baselines. Our code and dataset are publicly available at GitHub and HuggingFace, respectively.
zh

[AI-228] Retrieval Augmented Decision-Making: A Requirements-Driven Multi-Criteria Framework for Structured Decision Support

【速读】:该论文试图解决工业领域中结构复杂且内容碎片化的文档在检索与理解方面对专家和决策者带来的挑战,现有基于大语言模型(Large Language Model, LLM)的检索增强生成方法缺乏定量权重和可追溯的推理路径,难以提供多层级和透明的决策支持。解决方案的关键在于提出RAD方法,该方法将多准则决策(Multi-Criteria Decision Making)与LLM的语义理解能力相结合,通过自动提取关键准则、构建加权层次化决策模型,并在模型引导下生成结构化报告,从而实现决策过程的准确性、完整性和可追溯性。

链接: https://arxiv.org/abs/2505.18483
作者: Hongjia Wu,Hongxin Zhang,Wei Chen,Jiazhi Xia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Various industries have produced a large number of documents such as industrial plans, technical guidelines, and regulations that are structurally complex and content-wise fragmented. This poses significant challenges for experts and decision-makers in terms of retrieval and understanding. Although existing LLM-based Retrieval-Augmented Generation methods can provide context-related suggestions, they lack quantitative weighting and traceable reasoning paths, making it difficult to offer multi-level and transparent decision support. To address this issue, this paper proposes the RAD method, which integrates Multi-Criteria Decision Making with the semantic understanding capabilities of LLMs. The method automatically extracts key criteria from industry documents, builds a weighted hierarchical decision model, and generates structured reports under model guidance. The RAD framework introduces explicit weight assignment and reasoning chains in decision generation to ensure accuracy, completeness, and traceability. Experiments show that in various decision-making tasks, the decision reports generated by RAD significantly outperform existing methods in terms of detail, rationality, and structure, demonstrating its application value and potential in complex decision support scenarios.
zh

[AI-229] Using Large Language Models to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey

【速读】:该论文试图解决现实世界图数据在结构不完整、标签分布不平衡、跨领域异构性以及动态不稳定等方面的挑战,这些挑战使得传统图学习方法在复杂、噪声或动态环境中效果受限。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的丰富语义推理能力和外部知识,以弥补传统方法在处理上述问题时的不足,从而提升图学习的鲁棒性和适应性。

链接: https://arxiv.org/abs/2505.18475
作者: Mengran Li,Pengyu Zhang,Wenbin Xing,Yijia Zheng,Klim Zaporojets,Junzhou Chen,Ronghui Zhang,Yong Zhang,Siyuan Gong,Jia Hu,Xiaolei Ma,Zhiyuan Liu,Paul Groth,Marcel Worring
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphs are a widely used paradigm for representing non-Euclidean data, with applications ranging from social network analysis to biomolecular prediction. Conventional graph learning approaches typically rely on fixed structural assumptions or fully observed data, limiting their effectiveness in more complex, noisy, or evolving settings. Consequently, real-world graph data often violates the assumptions of traditional graph learning methods, in particular, it leads to four fundamental challenges: (1) Incompleteness, real-world graphs have missing nodes, edges, or attributes; (2) Imbalance, the distribution of the labels of nodes or edges and their structures for real-world graphs are highly skewed; (3) Cross-domain Heterogeneity, graphs from different domains exhibit incompatible feature spaces or structural patterns; and (4) Dynamic Instability, graphs evolve over time in unpredictable ways. Recent advances in Large Language Models (LLMs) offer the potential to tackle these challenges by leveraging rich semantic reasoning and external knowledge. This survey provides a comprehensive review of how LLMs can be integrated with graph learning to address the aforementioned challenges. For each challenge, we review both traditional solutions and modern LLM-driven approaches, highlighting how LLMs contribute unique advantages. Finally, we discuss open research questions and promising future directions in this emerging interdisciplinary field. To support further exploration, we have curated a repository of recent advances on graph learning challenges: this https URL.
zh

[AI-230] Invisible Tokens Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services

【速读】:该论文试图解决商业不透明大语言模型服务(Commercial Opaque LLM Services, COLS)中存在的责任性问题,即用户在无法观察、验证或质疑内部操作的情况下被按令牌消耗和API调用计费。解决方案的关键在于通过多样化的审计策略,包括基于承诺、预测、行为和基于签名的方法,以及水印和可信执行环境等补充机制,来增强系统的可验证性,同时保护提供商的专有信息。论文进一步提出了一种模块化的三层审计框架,以实现执行过程、安全日志记录和用户可审计性之间的可信验证。

链接: https://arxiv.org/abs/2505.18471
作者: Guoheng Sun,Ziyao Wang,Xuandong Zhao,Bowei Tian,Zheyu Shen,Yexiao He,Jinming Xing,Ang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern large language model (LLM) services increasingly rely on complex, often abstract operations, such as multi-step reasoning and multi-agent collaboration, to generate high-quality outputs. While users are billed based on token consumption and API usage, these internal steps are typically not visible. We refer to such systems as Commercial Opaque LLM Services (COLS). This position paper highlights emerging accountability challenges in COLS: users are billed for operations they cannot observe, verify, or contest. We formalize two key risks: \textitquantity inflation, where token and call counts may be artificially inflated, and \textitquality downgrade, where providers might quietly substitute lower-cost models or tools. Addressing these risks requires a diverse set of auditing strategies, including commitment-based, predictive, behavioral, and signature-based methods. We further explore the potential of complementary mechanisms such as watermarking and trusted execution environments to enhance verifiability without compromising provider confidentiality. We also propose a modular three-layer auditing framework for COLS and users that enables trustworthy verification across execution, secure logging, and user-facing auditability without exposing proprietary internals. Our aim is to encourage further research and policy development toward transparency, auditability, and accountability in commercial LLM services.
zh

[AI-231] Chemical classification program synthesis using generative artificial intelligence

【速读】:该论文旨在解决化学结构分类的准确性与可解释性问题,特别是在大规模化学数据库中实现高效、可解释的分类。传统方法依赖人工制定的分类规则或缺乏可解释性的深度学习模型,难以满足实际需求。本文提出的解决方案关键在于利用生成式AI(Generative AI)自动生成化学分类程序,这些程序能够对SMILES结构进行高效的确定性运行时分类,并提供自然语言解释,从而构建了一个可解释的计算本体模型,即ChEBI Chemical Class Program Ontology (C3PO)。

链接: https://arxiv.org/abs/2505.18470
作者: Christopher J. Mungall,Adnan Malik,Daniel R. Korn,Justin T. Reese,Noel M. O’Boyle,Noel,Janna Hastings
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or the use of deep learning methods that lack explainability. This work presents an approach that uses generative artificial intelligence to automatically write chemical classifier programs for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO). We validated our approach against the ChEBI database, and compared our results against state of the art deep learning models. We also demonstrate the use of C3PO to classify out-of-distribution examples taken from metabolomics repositories and natural product databases. We also demonstrate the potential use of our approach to find systematic classification errors in existing chemical databases, and show how an ensemble artificial intelligence approach combining generated ontologies, automated literature search, and multimodal vision models can be used to pinpoint potential errors requiring expert validation Subjects: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM) Cite as: arXiv:2505.18470 [cs.AI] (or arXiv:2505.18470v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.18470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-232] Performance and Generalizability Impacts of Incorporating Geolocation into Deep Learning for Dynamic PM2.5 Estimation

【速读】:该论文旨在解决在动态且空间异质性强的地理应用中,如何有效利用地理定位信息以提升深度学习模型的性能与地理泛化能力的问题。其解决方案的关键在于探索不同类型的地理编码方法,包括排除地理信息作为基线、使用原始地理坐标以及利用预训练的地理编码器(如GeoCLIP),并通过区域内部(WR)和区域外(OoR)评估场景验证这些方法的效果,结果显示预训练地理编码器在提升模型预测精度与泛化能力方面表现更为优越。

链接: https://arxiv.org/abs/2505.18461
作者: Morteza Karimzadeh,Zhongying Wang,James L. Crooks
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models have demonstrated success in geospatial applications, yet quantifying the role of geolocation information in enhancing model performance and geographic generalizability remains underexplored. A new generation of location encoders have emerged with the goal of capturing attributes present at any given location for downstream use in predictive modeling. Being a nascent area of research, their evaluation has remained largely limited to static tasks such as species distributions or average temperature mapping. In this paper, we discuss and quantify the impact of incorporating geolocation into deep learning for a real-world application domain that is characteristically dynamic (with fast temporal change) and spatially heterogeneous at high resolutions: estimating surface-level daily PM2.5 levels using remotely sensed and ground-level data. We build on a recently published deep learning-based PM2.5 estimation model that achieves state-of-the-art performance on data observed in the contiguous United States. We examine three approaches for incorporating geolocation: excluding geolocation as a baseline, using raw geographic coordinates, and leveraging pretrained location encoders. We evaluate each approach under within-region (WR) and out-of-region (OoR) evaluation scenarios. Aggregate performance metrics indicate that while naïve incorporation of raw geographic coordinates improves within-region performance by retaining the interpolative value of geographic location, it can hinder generalizability across regions. In contrast, pretrained location encoders like GeoCLIP enhance predictive performance and geographic generalizability for both WR and OoR scenarios. However, qualitative analysis reveals artifact patterns caused by high-degree basis functions and sparse upstream samples in certain areas, and ablation results indicate varying performance among location encoders…
zh

[AI-233] EdgeAgent X: A Novel Framework for Agent ic AI at the Edge in Military Communication Networks

【速读】:该论文旨在解决军事通信网络中自主决策效率低、延迟高、吞吐量不足以及面对对抗性干扰时鲁棒性差的问题。其解决方案的关键在于提出EdgeAgentX框架,该框架融合了联邦学习(Federated Learning, FL)、多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)和对抗防御机制,通过协同优化实现网络性能的提升与安全性增强。

链接: https://arxiv.org/abs/2505.18457
作者: Abir Ray
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:This paper introduces EdgeAgentX, a novel framework integrating federated learning (FL), multi-agent reinforcement learning (MARL), and adversarial defense mechanisms, tailored for military communication networks. EdgeAgentX significantly improves autonomous decision-making, reduces latency, enhances throughput, and robustly withstands adversarial disruptions, as evidenced by comprehensive simulations.
zh

[AI-234] MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt INTERSPEECH

【速读】:该论文试图解决现有零样本文本到语音(Zero-Shot Text-To-Speech, ZS-TTS)系统仅依赖单一提示(如参考语音或文本描述)导致灵活性受限的问题。其解决方案的关键在于提出一种基于多模态提示的定制化情感ZS-TTS系统,该系统通过分离语音为内容、音色、情感和韵律,并引入多模态提示情感编码器以从不同类型的提示中提取情感信息,同时结合韵律预测器和情感一致性损失来保持生成语音的情感一致性,最终采用基于扩散的声学模型生成目标梅尔频谱图。

链接: https://arxiv.org/abs/2505.18453
作者: Zhichao Wu,Yueteng Kang,Songjun Cao,Long Ma,Qiulin Li,Qun Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by InterSpeech

点击查看摘要

Abstract:Most existing Zero-Shot Text-To-Speech(ZS-TTS) systems generate the unseen speech based on single prompt, such as reference speech or text descriptions, which limits their flexibility. We propose a customized emotion ZS-TTS system based on multi-modal prompt. The system disentangles speech into the content, timbre, emotion and prosody, allowing emotion prompts to be provided as text, image or speech. To extract emotion information from different prompts, we propose a multi-modal prompt emotion encoder. Additionally, we introduce an prosody predictor to fit the distribution of prosody and propose an emotion consistency loss to preserve emotion information in the predicted prosody. A diffusion-based acoustic model is employed to generate the target mel-spectrogram. Both objective and subjective experiments demonstrate that our system outperforms existing systems in terms of naturalness and similarity. The samples are available at this https URL.
zh

[AI-235] Breaking Silos: Adaptive Model Fusion Unlocks Better Time Series Forecasting ICML2025

【速读】:该论文试图解决时间序列预测中单一模型无法在所有测试样本上持续优于其他模型的问题,即不同模型在特定情况下表现优异,但缺乏普适性。解决方案的关键在于提出TimeFuse框架,通过样本级别的自适应融合策略,利用元特征对输入时间序列进行表征,并训练一个可学习的融合器(fusor)来为每个输入预测最优的模型融合权重,从而实现对多种异构模型优势的动态整合。

链接: https://arxiv.org/abs/2505.18442
作者: Zhining Liu,Ze Yang,Xiao Lin,Ruizhong Qiu,Tianxin Wei,Yada Zhu,Hendrik Hamann,Jingrui He,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025. 22 pages, 6 Figures, 12 tables

点击查看摘要

Abstract:Time-series forecasting plays a critical role in many real-world applications. Although increasingly powerful models have been developed and achieved superior results on benchmark datasets, through a fine-grained sample-level inspection, we find that (i) no single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases. These findings prompt us to explore how to adaptively leverage the distinct strengths of various forecasting models for different samples. We introduce TimeFuse, a framework for collective time-series forecasting with sample-level adaptive fusion of heterogeneous models. TimeFuse utilizes meta-features to characterize input time series and trains a learnable fusor to predict optimal model fusion weights for any given input. The fusor can leverage samples from diverse datasets for joint training, allowing it to adapt to a wide variety of temporal patterns and thus generalize to new inputs, even from unseen datasets. Extensive experiments demonstrate the effectiveness of TimeFuse in various long-/short-term forecasting tasks, achieving near-universal improvement over the state-of-the-art individual models. Code is available at this https URL.
zh

[AI-236] Advertising in AI systems: Society must be vigilant

【速读】:该论文试图解决商业内容通过生成式 AI (Generative AI) 系统传播时所引发的透明度与监管问题。其核心问题是传统媒体的静态、可追溯的内容特性在 AI 系统中被动态、个性化和缺乏明确来源的输出所取代,从而增加了商业利益对内容生成的影响。解决方案的关键在于基于广告商、消费者和平台等关键利益相关者的诉求,提出商业化影响下的 AI 系统设计原则,并为终端用户提供识别和缓解模型输出中商业偏见的策略。

链接: https://arxiv.org/abs/2505.18425
作者: Menghua Wu,Yujia Bao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI systems have increasingly become our gateways to the Internet. We argue that just as advertising has driven the monetization of web search and social media, so too will commercial incentives shape the content served by AI. Unlike traditional media, however, the outputs of these systems are dynamic, personalized, and lack clear provenance – raising concerns for transparency and regulation. In this paper, we envision how commercial content could be delivered through generative AI-based systems. Based on the requirements of key stakeholders – advertisers, consumers, and platforms – we propose design principles for commercially-influenced AI systems. We then outline high-level strategies for end users to identify and mitigate commercial biases from model outputs. Finally, we conclude with open questions and a call to action towards these goals.
zh

[AI-237] Reinforcement Learning for Ballbot Navigation in Uneven Terrain

【速读】:该论文试图解决基于生成式AI(Generative AI)的Ballbot(球平衡机器人)控制与导航中,强化学习(Reinforcement Learning, RL)方法在能力、数据效率及局限性方面的研究空白,并填补该领域缺乏开源、RL友好的仿真平台的问题。解决方案的关键在于构建一个基于MuJoCo的开源Ballbot仿真环境,并通过适当的外部感知观测条件和奖励函数设计,使经典无模型强化学习方法能够有效应对随机生成的不平坦地形,且仅需少量数据(四到五小时,在500Hz系统下运行)。

链接: https://arxiv.org/abs/2505.18417
作者: Achkan Salehi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Ballbot (i.e. Ball balancing robot) navigation usually relies on methods rooted in control theory (CT), and works that apply Reinforcement learning (RL) to the problem remain rare while generally being limited to specific subtasks (e.g. balance recovery). Unlike CT based methods, RL does not require (simplifying) assumptions about environment dynamics (e.g. the absence of slippage between the ball and the floor). In addition to this increased accuracy in modeling, RL agents can easily be conditioned on additional observations such as depth-maps without the need for explicit formulations from first principles, leading to increased adaptivity. Despite those advantages, there has been little to no investigation into the capabilities, data-efficiency and limitations of RL based methods for ballbot control and navigation. Furthermore, there is a notable absence of an open-source, RL-friendly simulator for this task. In this paper, we present an open-source ballbot simulation based on MuJoCo, and show that with appropriate conditioning on exteroceptive observations as well as reward shaping, policies learned by classical model-free RL methods are capable of effectively navigating through randomly generated uneven terrain, using a reasonable amount of data (four to five hours on a system operating at 500hz).
zh

[AI-238] KL-regularization Itself is Differentially Private in Bandits and RLHF

【速读】:该论文试图解决在数据驱动算法中实现差分隐私(Differential Privacy, DP)的问题,旨在通过不额外注入噪声的方式获得隐私保障。其解决方案的关键在于利用正则化(regularization)的作用,特别是KL-正则化(KL-regularization),将隐私特性嵌入到学习目标中,从而使从随机策略中采样的动作本身具备差分隐私性质。这种方法不仅提供了新的隐私保障途径,还保留了正则化在提升算法性能方面的固有优势。

链接: https://arxiv.org/abs/2505.18407
作者: Yizhou Zhang,Kishan Panaganti,Laixi Shi,Juba Ziani,Adam Wierman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free’'. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.
zh

[AI-239] hought calibration: Efficient and confident test-time scaling

【速读】:该论文试图解决在推理型大语言模型中,通过延长思考时间来提升测试阶段性能所带来的高昂计算成本问题。其关键解决方案是引入“思想校准”(thought calibration),通过动态判断何时可以终止思考过程,从而在不影响模型性能的前提下减少计算资源的消耗。该方法将语言模型不断增长的思维过程视为嵌套的推理树,并利用轻量级探测器对模型的隐藏表示进行分析,以识别新颖推理的停滞点,从而实现高效的计算资源分配。

链接: https://arxiv.org/abs/2505.18404
作者: Menghua Wu,Cai Zhou,Stephen Bates,Tommi Jaakkola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model’s growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model’s hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.
zh

[AI-240] owards Anonymous Neural Network Inference

【速读】:该论文试图解决神经网络推理过程中的发送者-接收者 unlinkability(不可关联性)问题,即在不暴露输入与输出方之间联系的情况下实现匿名的推理服务。解决方案的关键在于引入 funion 系统,该系统结合了 Pigeonhole 存储协议和 BACAP(blinding-and-capability)方案,从而继承了现代混音网络(mixnet)的可证明安全性。通过将输入张量匿名存储于伪随机存储位置,并利用计算服务进行处理后检索结果,该系统有效隐藏了网络流量模式和计算工作负载特征,同时将执行时间量化为公开延迟桶,实现了强元数据隐私保护。

链接: https://arxiv.org/abs/2505.18398
作者: Liao Peiyuan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce funion, a system providing end-to-end sender-receiver unlinkability for neural network inference. By leveraging the Pigeonhole storage protocol and BACAP (blinding-and-capability) scheme from the Echomix anonymity system, funion inherits the provable security guarantees of modern mixnets. Users can anonymously store input tensors in pseudorandom storage locations, commission compute services to process them via the neural network, and retrieve results with no traceable connection between input and output parties. This store-compute-store paradigm masks both network traffic patterns and computational workload characteristics, while quantizing execution timing into public latency buckets. Our security analysis demonstrates that funion inherits the strong metadata privacy guarantees of Echomix under largely the same trust assumptions, while introducing acceptable overhead for production-scale workloads. Our work paves the way towards an accessible platform where users can submit fully anonymized inference queries to cloud services.
zh

[AI-241] An Outlook on the Opportunities and Challenges of Multi-Agent AI Systems

【速读】:该论文旨在探讨多智能体人工智能系统(MAS)在分布式智能框架中的当前机遇与挑战,并提出构建稳健、可扩展且安全的MAS的关键路径。其解决方案的核心在于通过整合大规模语言模型(LLMs)、联邦优化和人机交互的最新进展,明确代理拓扑结构、协调协议及共享目标等关键概念,并识别因训练数据重叠引发的依赖性、对齐偏差和脆弱性等主要风险。通过生物启发的模拟与理论分析,论文为实际应用场景中MAS的发展提供了重要的理论支撑与实践指导。

链接: https://arxiv.org/abs/2505.18397
作者: Fangqiao Tian,An Luo,Jin Du,Xun Xian,Robert Specht,Ganghua Wang,Xuan Bi,Jiawei Zhou,Jayanth Srinivasa,Ashish Kundu,Charles Fleming,Rui Zhang,Zirui Liu,Mingyi Hong,Jie Ding
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent AI systems (MAS) offer a promising framework for distributed intelligence, enabling collaborative reasoning, planning, and decision-making across autonomous agents. This paper provides a systematic outlook on the current opportunities and challenges of MAS, drawing insights from recent advances in large language models (LLMs), federated optimization, and human-AI interaction. We formalize key concepts including agent topology, coordination protocols, and shared objectives, and identify major risks such as dependency, misalignment, and vulnerabilities arising from training data overlap. Through a biologically inspired simulation and comprehensive theoretical framing, we highlight critical pathways for developing robust, scalable, and secure MAS in real-world settings.
zh

[AI-242] Applications of Modular Co-Design for De Novo 3D Molecule Generation

【速读】:该论文试图解决生成式分子在三维结构质量上的不足问题,尤其是在保持二维有效性与拓扑稳定性的同时,难以生成高质量的三维结构。解决方案的关键在于提出Megalodon——一系列可扩展的Transformer模型,这些模型通过引入基础等变层(equivariant layers)并采用联合连续与离散去噪协同设计目标进行训练,从而增强分子生成动态的学习效果。

链接: https://arxiv.org/abs/2505.18392
作者: Danny Reidenbach,Filipp Nikitin,Olexandr Isayev,Saee Paliwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:De novo 3D molecule generation is a pivotal task in drug discovery. However, many recent geometric generative models struggle to produce high-quality 3D structures, even if they maintain 2D validity and topological stability. To tackle this issue and enhance the learning of effective molecular generation dynamics, we present Megalodon-a family of scalable transformer models. These models are enhanced with basic equivariant layers and trained using a joint continuous and discrete denoising co-design objective. We assess Megalodon’s performance on established molecule generation benchmarks and introduce new 3D structure benchmarks that evaluate a model’s capability to generate realistic molecular structures, particularly focusing on energetics. We show that Megalodon achieves state-of-the-art results in 3D molecule generation, conditional structure generation, and structure energy benchmarks using diffusion and flow matching. Furthermore, doubling the number of parameters in Megalodon to 40M significantly enhances its performance, generating up to 49x more valid large molecules and achieving energy levels that are 2-10x lower than those of the best prior generative models.
zh

[AI-243] Human-Centered AI Communication in Co-Creativity: An Initial Framework and Insights

【速读】:该论文试图解决当前共创造性人工智能(co-creative AI)系统中缺乏有效沟通机制的问题,这限制了其协作潜力。解决方案的关键在于提出一种名为FAICO的AI通信框架,该框架通过系统性文献综述和用户反馈研究,明确了AI通信的核心要素及其对用户体验的影响,并强调了上下文敏感性和人机反馈循环的重要性,以促进更自然、有效的协同创作过程。

链接: https://arxiv.org/abs/2505.18385
作者: Jeba Rezwana,Corey Ford
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2504.02526

点击查看摘要

Abstract:Effective communication between AI and humans is essential for successful human-AI co-creation. However, many current co-creative AI systems lack effective communication, which limits their potential for collaboration. This paper presents the initial design of the Framework for AI Communication (FAICO) for co-creative AI, developed through a systematic review of 107 full-length papers. FAICO presents key aspects of AI communication and their impact on user experience, offering preliminary guidelines for designing human-centered AI communication. To improve the framework, we conducted a preliminary study with two focus groups involving skilled individuals in AI, HCI, and design. These sessions sought to understand participants’ preferences for AI communication, gather their perceptions of the framework, collect feedback for refinement, and explore its use in co-creative domains like collaborative writing and design. Our findings reveal a preference for a human-AI feedback loop over linear communication and emphasize the importance of context in fostering mutual understanding. Based on these insights, we propose actionable strategies for applying FAICO in practice and future directions, marking the first step toward developing comprehensive guidelines for designing effective human-centered AI communication in co-creation.
zh

[AI-244] Dynamic Risk Assessments for Offensive Cybersecurity Agents

【速读】:该论文试图解决生成式 AI 在网络安全领域可能被用于自动化危险的进攻性网络操作所带来的安全风险问题,其核心在于现有模型审计未能充分考虑现实世界中攻击者所拥有的自由度。解决方案的关键在于引入一种扩展的威胁模型,强调在固定计算预算下,攻击者在有状态和无状态环境中可能具备的不同程度的自由度,并通过实验验证了即使在有限的计算资源(如8 H100 GPU小时)下,攻击者也能显著提升代理的网络安全能力,从而凸显了动态评估代理网络安全风险的必要性。

链接: https://arxiv.org/abs/2505.18384
作者: Boyi Wei,Benedikt Stroebl,Jiacen Xu,Joie Zhang,Zhou Li,Peter Henderson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:Foundation models are increasingly becoming better autonomous programmers, raising the prospect that they could also automate dangerous offensive cyber-operations. Current frontier model audits probe the cybersecurity risks of such agents, but most fail to account for the degrees of freedom available to adversaries in the real world. In particular, with strong verifiers and financial incentives, agents for offensive cybersecurity are amenable to iterative improvement by would-be adversaries. We argue that assessments should take into account an expanded threat model in the context of cybersecurity, emphasizing the varying degrees of freedom that an adversary may possess in stateful and non-stateful environments within a fixed compute budget. We show that even with a relatively small compute budget (8 H100 GPU Hours in our study), adversaries can improve an agent’s cybersecurity capability on InterCode CTF by more than 40% relative to the baseline – without any external assistance. These results highlight the need to evaluate agents’ cybersecurity risk in a dynamic manner, painting a more representative picture of risk.
zh

[AI-245] RedactOR: An LLM -Powered Framework for Automatic Clinical Data De-Identification ACL2025

【速读】:该论文旨在解决临床数据隐私保护与数据效用保持之间的矛盾,特别是在AI驱动的医疗和数据分析中。现有去标识化(De-identification, De-ID)方法,包括基于规则的技术、深度学习模型和大语言模型(Large Language Models, LLMs),普遍存在召回错误、泛化能力有限和效率低下等问题,限制了其实际应用。论文提出了一种完全自动化的多模态框架RedactOR,用于对结构化和非结构化电子健康记录(Electronic Health Records, EHRs)进行去标识化,包括临床音频记录。其关键解决方案包括高效的成本控制策略,如智能路由、混合规则与LLM方法以及两步音频擦除技术,并引入基于检索的实体再词化方法以确保受保护实体的一致替换,从而提升下游应用的数据连贯性。

链接: https://arxiv.org/abs/2505.18380
作者: Praphul Singh,Charlotte Dzialo,Jangwon Kim,Sumana Srivatsa,Irfan Bulu,Sri Gadde,Krishnaram Kenthapadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Industry Track. To appear

点击查看摘要

Abstract:Ensuring clinical data privacy while preserving utility is critical for AI-driven healthcare and data analytics. Existing de-identification (De-ID) methods, including rule-based techniques, deep learning models, and large language models (LLMs), often suffer from recall errors, limited generalization, and inefficiencies, limiting their real-world applicability. We propose a fully automated, multi-modal framework, RedactOR for de-identifying structured and unstructured electronic health records, including clinical audio records. Our framework employs cost-efficient De-ID strategies, including intelligent routing, hybrid rule and LLM based approaches, and a two-step audio redaction approach. We present a retrieval-based entity relexicalization approach to ensure consistent substitutions of protected entities, thereby enhancing data coherence for downstream applications. We discuss key design desiderata, de-identification and relexicalization methodology, and modular architecture of RedactX and its integration with the Oracle Health Clinical AI system. Evaluated on the i2b2 2014 De-ID dataset using standard metrics with strict recall, our approach achieves competitive performance while optimizing token usage to reduce LLM costs. Finally, we discuss key lessons and insights from deployment in real-world AI- driven healthcare data pipelines.
zh

[AI-246] Next-token pretraining implies in-context learning

【速读】:该论文试图解决的问题是:在上下文学习(in-context learning, ICL)中,模型如何从标准的自监督下一项预测预训练中产生可预测的行为,而非将其视为一种特殊的涌现特性。解决方案的关键在于提出了一个信息论框架,该框架能够精确预测分布内ICL的动力学行为(即依赖于上下文的损失减少),并通过合成数据集的实验验证了这一理论,揭示了模型在训练过程中对上下文的必然适应性,以及模型在推理时的性能与预训练任务集合之间的数学耦合关系。

链接: https://arxiv.org/abs/2505.18373
作者: Paul M. Riechers,Henry R. Bigelow,Eric A. Alt,Adam Shai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We argue that in-context learning (ICL) predictably arises from standard self-supervised next-token pretraining, rather than being an exotic emergent property. This work establishes the foundational principles of this emergence by focusing on in-distribution ICL, demonstrating how models necessarily adapt to context when trained on token sequences, especially from non-ergodic sources. Our information-theoretic framework precisely predicts these in-distribution ICL dynamics (i.e., context-dependent loss reduction). We verify this with experiments using synthetic datasets of differing types of correlational structure, reproducing characteristic phenomena like phase transitions in training loss for induction head formation and power-law scaling of in-context loss. We further show that a model’s in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining, offering a fundamental explanation, grounded in architecture- and modality-independent principles, for such inference-time learning.
zh

[AI-247] Military AI Needs Technically-Informed Regulation to Safeguard AI Research and its Applications

【速读】:该论文试图解决AI在军事领域应用所带来的新型风险问题,特别是针对使用人工智能进行目标选择或战场决策的致命性自主武器系统(AI-powered lethal autonomous weapon systems, AI-LAWS)所引发的不可预见的升级、环境适应性差以及人类监督弱化等风险。论文认为,这些问题无法仅通过高层政策解决,有效的监管必须基于AI模型的技术行为。解决方案的关键在于提出一个以行为为基础的AI-LAWS定义,作为技术驱动监管的依据,并呼吁AI研究人员全程参与监管生命周期。

链接: https://arxiv.org/abs/2505.18371
作者: Riley Simmons-Edler,Jean Dong,Paul Lushenko,Kanaka Rajan,Ryan P. Badman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 16 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Military weapon systems and command-and-control infrastructure augmented by artificial intelligence (AI) have seen rapid development and deployment in recent years. However, the sociotechnical impacts of AI on combat systems, military decision-making, and the norms of warfare have been understudied. We focus on a specific subset of lethal autonomous weapon systems (LAWS) that use AI for targeting or battlefield decisions. We refer to this subset as AI-powered lethal autonomous weapon systems (AI-LAWS) and argue that they introduce novel risks – including unanticipated escalation, poor reliability in unfamiliar environments, and erosion of human oversight – all of which threaten both military effectiveness and the openness of AI research. These risks cannot be addressed by high-level policy alone; effective regulation must be grounded in the technical behavior of AI models. We argue that AI researchers must be involved throughout the regulatory lifecycle. Thus, we propose a clear, behavior-based definition of AI-LAWS – systems that introduce unique risks through their use of modern AI – as a foundation for technically grounded regulation, given that existing frameworks do not distinguish them from conventional LAWS. Using this definition, we propose several technically-informed policy directions and invite greater participation from the AI research community in military AI policy discussions.
zh

[AI-248] Small Models Smarter Learning: The Power of Joint Task Training

【速读】:该论文试图解决小规模Transformer模型在学习特定任务时,任务难度与所需最小参数数量之间的关系问题。其关键解决方案是通过在ListOps数据集中逐步增加任务难度,并研究不同操作(如SUM、MAX、MED)的组合对模型学习效果和参数需求的影响。研究发现,联合训练多种操作可以提升模型性能并改变其行为模式,使模型能够生成类似数字的嵌入表示并具备区分奇偶性的能力,而仅训练SUM的任务则可能导致模型依赖记忆而非理解数结构。

链接: https://arxiv.org/abs/2505.18369
作者: Csaba Both,Benjamin Hoover,Hendrik Strobelt,Dmitry Krotov,Daniel Karl I. Weidele,Mauro Martino,Nima Dehmamy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability of a model to learn a task depends strongly on both the task difficulty and the model size. We aim to understand how task difficulty relates to the minimum number of parameters required for learning specific tasks in small transformer models. Our study focuses on the ListOps dataset, which consists of nested mathematical operations. We gradually increase task difficulty by introducing new operations or combinations of operations into the training data. We observe that sum modulo n is the hardest to learn. Curiously, when combined with other operations such as maximum and median, the sum operation becomes easier to learn and requires fewer parameters. We show that joint training not only improves performance but also leads to qualitatively different model behavior. We show evidence that models trained only on SUM might be memorizing and fail to capture the number structure in the embeddings. In contrast, models trained on a mixture of SUM and other operations exhibit number-like representations in the embedding space, and a strong ability to distinguish parity. Furthermore, the SUM-only model relies more heavily on its feedforward layers, while the jointly trained model activates the attention mechanism more. Finally, we show that learning pure SUM can be induced in models below the learning threshold of pure SUM, by pretraining them on MAX+MED. Our findings indicate that emergent abilities in language models depend not only on model size, but also the training curriculum.
zh

[AI-249] he Cell Must Go On: Agar.io for Continual Reinforcement Learning

【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning, RL)中代理在动态环境中长期学习与适应的问题,此类环境的特性使得静态策略随时间失效。现有用于持续RL研究的模拟器通常在范围或复杂性上存在局限,因此研究人员常通过人工引入任务突变来改造传统回合制RL环境。本文提出的解决方案是AgarCL,这是一个基于游戏“Agar.io”的持续RL研究平台,其特点包括非回合制、高维状态空间、随机且不断变化的动力学、连续动作和部分可观测性。AgarCL的关键在于提供一个能够逐步演化出更复杂行为的环境,并通过一系列子任务对不同环境特性带来的挑战进行系统性分析。

链接: https://arxiv.org/abs/2505.18347
作者: Mohamed A. Mohamed,Kateryna Nekhomiazh,Vedant Vyas,Marcos M. Jose,Andrew Patterson,Marlos C. Machado
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual reinforcement learning (RL) concerns agents that are expected to learn continually, rather than converge to a policy that is then fixed for evaluation. Such an approach is well suited to environments the agent perceives as changing, which renders any static policy ineffective over time. The few simulators explicitly designed for empirical research in continual RL are often limited in scope or complexity, and it is now common for researchers to modify episodic RL environments by artificially incorporating abrupt task changes during interaction. In this paper, we introduce AgarCL, a research platform for continual RL that allows for a progression of increasingly sophisticated behaviour. AgarCL is based on the game this http URL, a non-episodic, high-dimensional problem featuring stochastic, ever-evolving dynamics, continuous actions, and partial observability. Additionally, we provide benchmark results reporting the performance of DQN, PPO, and SAC in both the primary, challenging continual RL problem, and across a suite of smaller tasks within AgarCL, each of which isolates aspects of the full environment and allow us to characterize the challenges posed by different aspects of the game.
zh

[AI-250] Sample Complexity of Diffusion Model Training Without Empirical Risk Minimizer Access

【速读】:该论文试图解决扩散模型在样本复杂度方面的理论分析问题,特别是现有分析在输入数据维度增加时表现不佳或依赖于不现实的假设(如访问精确的经验风险最小化器)。其解决方案的关键在于对得分估计(score estimation)进行系统性分析,将得分估计误差分解为统计误差、近似误差和优化误差,并通过这种结构化分解消除了先前分析中神经网络参数数量的指数依赖,从而首次实现了无需假设获得得分函数估计损失的经验风险最小化器即可获得样本复杂度界的结果。

链接: https://arxiv.org/abs/2505.18344
作者: Mudit Gaur,Prashant Trivedi,Sasidhar Kunapuli,Amrit Singh Bedi,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of \widetilde\mathcalO(\epsilon^-6) . Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result which achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.
zh

[AI-251] CrashAgent : Crash Scenario Generation via Multi-modal Reasoning

【速读】:该论文试图解决自动驾驶算法训练与评估中因数据集主要包含正常驾驶行为而缺乏安全关键场景的问题,这种数据分布的不平衡被称为长尾分布,限制了算法从高风险或故障场景中学习的能力。解决方案的关键在于利用多模态大语言模型将交通事故报告转换为可直接在仿真环境中执行的结构化场景格式,具体通过提出CrashAgent框架,该框架能够解析多模态的真实交通碰撞报告,生成车辆自身及周围交通参与者的道路布局与行为。

链接: https://arxiv.org/abs/2505.18341
作者: Miao Li,Wenhao Ding,Haohong Lin,Yiqi Lyu,Yihang Yao,Yuyou Zhang,Ding Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training and evaluating autonomous driving algorithms requires a diverse range of scenarios. However, most available datasets predominantly consist of normal driving behaviors demonstrated by human drivers, resulting in a limited number of safety-critical cases. This imbalance, often referred to as a long-tail distribution, restricts the ability of driving algorithms to learn from crucial scenarios involving risk or failure, scenarios that are essential for humans to develop driving skills efficiently. To generate such scenarios, we utilize Multi-modal Large Language Models to convert crash reports of accidents into a structured scenario format, which can be directly executed within simulations. Specifically, we introduce CrashAgent, a multi-agent framework designed to interpret multi-modal real-world traffic crash reports for the generation of both road layouts and the behaviors of the ego vehicle and surrounding traffic participants. We comprehensively evaluate the generated crash scenarios from multiple perspectives, including the accuracy of layout reconstruction, collision rate, and diversity. The resulting high-quality and large-scale crash dataset will be publicly available to support the development of safe driving algorithms in handling safety-critical situations.
zh

[AI-252] A Critical Evaluation of Defenses against Prompt Injection Attacks

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在面对提示注入攻击(prompt injection attacks)时的安全性评估问题,以及现有防御措施有效性与通用性不足的缺陷。其解决方案的关键在于提出一种系统化的评估框架,从两个核心维度对防御措施进行检验:一是有效性,即评估防御 against 现有和适应性提示注入攻击的能力,涵盖多种目标提示和注入提示;二是通用性,确保防御不会损害LLM的基础功能。通过这一方法论,论文揭示了现有研究在评估上的不足,并证明现有防御措施的实际效果不如先前报道。

链接: https://arxiv.org/abs/2505.18333
作者: Yuqi Jia,Zedian Shao,Yupei Liu,Jinyuan Jia,Dawn Song,Neil Zhenqiang Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating future defenses and guiding their development. Our code and data are available at: this https URL.
zh

[AI-253] Understanding and Mitigating Overrefusal in LLM s from an Unveiling Perspective of Safety Decision Boundary

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的过度假设拒绝(overrefusal)问题,即模型对合法查询的过度保守安全对齐导致的拒绝回答现象。解决方案的关键在于通过探测和利用模型的安全决策边界,分析并缓解过度假设拒绝。研究提出了一种名为RASS的自动化框架,该框架通过表示空间中的引导向量有效识别和筛选接近安全边界的提示,从而实现对过度假设拒绝的精准和有针对性的缓解。

链接: https://arxiv.org/abs/2505.18325
作者: Licheng Pan,Yongqi Tong,Xin Zhang,Xiaolu Zhang,Jun Zhou,Zhixuan Chu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries-a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models’safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present RASS, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual this http URL have explored the safety decision boundaries of various LLMs and construct the MORBench evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets will be released at this https URL.
zh

[AI-254] Architectural Backdoors for Within-Batch Data Stealing and Model Inference Manipulation

【速读】:该论文试图解决神经网络中由架构后门(architectural backdoors)引发的用户数据泄露和隐私威胁问题,特别是针对批处理推理(batched inference)过程中存在的信息泄露风险。解决方案的关键在于提出一种确定性缓解策略,该策略通过新颖的信息流控制机制分析模型图,并证明同一批次内不同用户输入之间的非干扰性,从而提供对新型攻击向量的形式化保障。

链接: https://arxiv.org/abs/2505.18323
作者: Nicolas Küchler,Ivan Petrov,Conrad Grobler,Ilia Shumailov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For nearly a decade the academic community has investigated backdoors in neural networks, primarily focusing on classification tasks where adversaries manipulate the model prediction. While demonstrably malicious, the immediate real-world impact of such prediction-altering attacks has remained unclear. In this paper we introduce a novel and significantly more potent class of backdoors that builds upon recent advancements in architectural backdoors. We demonstrate how these backdoors can be specifically engineered to exploit batched inference, a common technique for hardware utilization, enabling large-scale user data manipulation and theft. By targeting the batching process, these architectural backdoors facilitate information leakage between concurrent user requests and allow attackers to fully control model responses directed at other users within the same batch. In other words, an attacker who can change the model architecture can set and steal model inputs and outputs of other users within the same batch. We show that such attacks are not only feasible but also alarmingly effective, can be readily injected into prevalent model architectures, and represent a truly malicious threat to user privacy and system integrity. Critically, to counteract this new class of vulnerabilities, we propose a deterministic mitigation strategy that provides formal guarantees against this new attack vector, unlike prior work that relied on Large Language Models to find the backdoors. Our mitigation strategy employs a novel Information Flow Control mechanism that analyzes the model graph and proves non-interference between different user inputs within the same batch. Using our mitigation strategy we perform a large scale analysis of models hosted through Hugging Face and find over 200 models that introduce (unintended) information leakage between batch entries due to the use of dynamic quantization.
zh

[AI-255] Efficient Algorithms for Electing Successive Committees IJCAI-25

【速读】:该论文试图解决在连续委员会选举模型(successive committee elections)中,根据给定的序数或批准偏好,寻找一个固定长度的“最佳”同规模委员会序列的问题,其中每个候选人在连续委员会中的参与次数受到限制。然而,该任务被证明对于大多数选择标准而言是NP-hard的,尤其是在委员会规模为三的情况下,缺乏非平凡或高效的算法。论文的关键解决方案是设计参数化算法,以在候选人数或时间范围有限的实际场景中有效解决这些难题。

链接: https://arxiv.org/abs/2505.18287
作者: Pallavi Jain,Andrzej Kaczmarczyk
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 18 pages; 3 figures, accepted for publication in IJCAI-25

点击查看摘要

Abstract:In a recently introduced model of successive committee elections (Bredereck et al., AAAI-20) for a given set of ordinal or approval preferences one aims to find a sequence of a given length of “best” same-size committees such that each candidate is a member of a limited number of consecutive committees. However, the practical usability of this model remains limited, as the described task turns out to be NP-hard for most selection criteria already for seeking committees of size three. Non-trivial or somewhat efficient algorithms for these cases are lacking too. Motivated by a desire to unlock the full potential of the described temporal model of committee elections, we devise (parameterized) algorithms that effectively solve the mentioned hard cases in realistic scenarios of a moderate number of candidates or of a limited time horizon.
zh

[AI-256] Single-agent or Multi-agent Systems? Why Not Both?

【速读】:该论文试图解决多智能体系统(MAS)在复杂任务处理中虽然表现出优越的准确性,但其设计和部署相比单智能体系统(SAS)存在更高的复杂性和运行成本的问题。随着前沿大语言模型(LLM)在长上下文推理、记忆保持和工具使用方面的能力提升,MAS原有的优势逐渐减弱。论文的关键解决方案是通过实证研究发现MAS与SAS性能差异,并提出高效的机制来定位MAS中的易错智能体,同时设计一种混合智能体范式,实现MAS与SAS之间的请求级联,从而在提升系统能力的同时提高效率并降低部署成本。

链接: https://arxiv.org/abs/2505.18286
作者: Mingyan Gao,Yanzi Li,Banruo Liu,Yifan Yu,Phillip Wang,Ching-Yu Lin,Fan Lai
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) decompose complex tasks and delegate subtasks to different large language model (LLM) agents and tools. Prior studies have reported the superior accuracy performance of MAS across diverse domains, enabled by long-horizon context tracking and error correction through role-specific agents. However, the design and deployment of MAS incur higher complexity and runtime cost compared to single-agent systems (SAS). Meanwhile, frontier LLMs, such as OpenAI-o3 and Gemini-2.5-Pro, have rapidly advanced in long-context reasoning, memory retention, and tool usage, mitigating many limitations that originally motivated MAS designs. In this paper, we conduct an extensive empirical study comparing MAS and SAS across various popular agentic applications. We find that the benefits of MAS over SAS diminish as LLM capabilities improve, and we propose efficient mechanisms to pinpoint the error-prone agent in MAS. Furthermore, the performance discrepancy between MAS and SAS motivates our design of a hybrid agentic paradigm, request cascading between MAS and SAS, to improve both efficiency and capability. Our design improves accuracy by 1.1-12% while reducing deployment costs by up to 20% across various agentic applications.
zh

[AI-257] ube Loss based Deep Networks For Improving the Probabilistic Forecasting of Wind Speed

【速读】:该论文旨在解决风速预测中的不确定性量化(Uncertainty Quantification, UQ)问题,这是风力发电生产中的关键挑战,因为风具有固有的波动性。为了解决这一问题,作者提出了一种基于深度学习的概率预测方法,其核心在于使用Tube损失函数进行预测区间(Prediction Interval, PI)估计。Tube损失函数是一种简单且模型无关的PI估计方法,能够在不假设分布的情况下获得具有渐近覆盖保证的窄PI。该方法有效整合了LSTM、GRU和TCN等流行架构,并通过设计一种简单的启发式策略调整Tube损失函数的δ参数,从而在不牺牲校准能力的前提下获得更窄的PI。

链接: https://arxiv.org/abs/2505.18284
作者: Pritam Anand,Aadesh Minz,Asish Joel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncertainty Quantification (UQ) in wind speed forecasting is a critical challenge in wind power production due to the inherently volatile nature of wind. By quantifying the associated risks and returns, UQ supports more effective decision-making for grid operations and participation in the electricity market. In this paper, we design a sequence of deep learning based probabilistic forecasting methods by using the Tube loss function for wind speed forecasting. The Tube loss function is a simple and model agnostic Prediction Interval (PI) estimation approach and can obtain the narrow PI with asymptotical coverage guarantees without any distribution assumption. Our deep probabilistic forecasting models effectively incorporate popular architectures such as LSTM, GRU, and TCN within the Tube loss framework. We further design a simple yet effective heuristic for tuning the \delta parameter of the Tube loss function so that our deep forecasting models obtain the narrower PI without compromising its calibration ability. We have considered three wind datasets, containing the hourly recording of the wind speed, collected from three distinct location namely Jaisalmer, Los Angeles and San Fransico. Our numerical results demonstrate that the proposed deep forecasting models produce more reliable and narrower PIs compared to recently developed probabilistic wind forecasting methods.
zh

[AI-258] Feature Preserving Shrinkage on Bayesian Neural Networks via the R2D2 Prior

【速读】:该论文旨在解决贝叶斯神经网络(Bayesian Neural Networks, BNNs)中先验分布选择不当导致的后验不确定性估计不准确、过拟合或预测性能下降的问题。其解决方案的关键在于提出一种新的R2D2-Net,该方法通过引入R²诱导的狄利克雷分解(R2D2)先验来对BNN权重进行建模,能够有效将不相关的系数收缩至零,同时避免关键特征的过度收缩。此外,还提出了一种结合吉布斯更新过程与基于梯度优化的变分吉布斯推断算法,以更精确地近似权重的后验分布,并提升在非凸变分目标下的估计稳定性和一致性。

链接: https://arxiv.org/abs/2505.18280
作者: Tsai Hor Chan,Dora Yan Zhang,Guosheng Yin,Lequan Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: To appear in TPAMI

点击查看摘要

Abstract:Bayesian neural networks (BNNs) treat neural network weights as random variables, which aim to provide posterior uncertainty estimates and avoid overfitting by performing inference on the posterior weights. However, the selection of appropriate prior distributions remains a challenging task, and BNNs may suffer from catastrophic inflated variance or poor predictive performance when poor choices are made for the priors. Existing BNN designs apply different priors to weights, while the behaviours of these priors make it difficult to sufficiently shrink noisy signals or they are prone to overshrinking important signals in the weights. To alleviate this problem, we propose a novel R2D2-Net, which imposes the R^2-induced Dirichlet Decomposition (R2D2) prior to the BNN weights. The R2D2-Net can effectively shrink irrelevant coefficients towards zero, while preventing key features from over-shrinkage. To approximate the posterior distribution of weights more accurately, we further propose a variational Gibbs inference algorithm that combines the Gibbs updating procedure and gradient-based optimization. This strategy enhances stability and consistency in estimation when the variational objective involving the shrinkage parameters is non-convex. We also analyze the evidence lower bound (ELBO) and the posterior concentration rates from a theoretical perspective. Experiments on both natural and medical image classification and uncertainty estimation tasks demonstrate satisfactory performance of our method.
zh

[AI-259] he end of radical concept nativism

【速读】:该论文试图解决关于人类是否能够习得根本性新概念的问题,这一问题在认知科学和心灵哲学中长期存在争议。Jerry Fodor提出的激进概念先天论认为,大多数甚至所有概念都是先天的,所谓的“概念学习”实际上并不能导致新概念的获得。本文首先回顾了先前论证的特征与局限性,并识别出三个关键分歧点——表达能力、概念结构和概念拥有——这些点使得激进概念先天论的论证偏离了对实际人类认知的描述。解决方案的关键在于借助计算机科学和信息论的思想,对相关概念进行形式化,从而提出更具科学生产性的解释,最终得出结论:从某种重要意义上说,人类确实能够学习新概念。

链接: https://arxiv.org/abs/2505.18277
作者: Joshua S. Rule,Steven T. Piantadosi
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Though humans seem to be remarkable learners, arguments in cognitive science and philosophy of mind have long maintained that learning something fundamentally new is impossible. Specifically, Jerry Fodor’s arguments for radical concept nativism hold that most, if not all, concepts are innate and that what many call concept learning never actually leads to the acquisition of new concepts. These arguments have deeply affected cognitive science, and many believe that the counterarguments to radical concept nativism have been either unsuccessful or only apply to a narrow class of concepts. This paper first reviews the features and limitations of prior arguments. We then identify three critical points - related to issues of expressive power, conceptual structure, and concept possession - at which the arguments in favor of radical concept nativism diverge from describing actual human cognition. We use ideas from computer science and information theory to formalize the relevant ideas in ways that are arguably more scientifically productive. We conclude that, as a result, there is an important sense in which people do indeed learn new concepts.
zh

[AI-260] Uncovering a Universal Abstract Algorithm for Modular Addition in Neural Networks

【速读】:该论文试图解决多层神经网络在模块化加法任务中表现出的看似不同的神经元级表示是否源于不同算法的问题,其核心是提出一个可验证的普遍性假设,即这些不同表现实际上统一于一个抽象算法。解决方案的关键在于通过多层次分析(包括神经元、神经元簇和整个网络)证明多层感知机和Transformer普遍实现了称为“近似中国剩余定理”的抽象算法,并引入“近似陪集”概念,表明神经元仅在这些陪集上激活。此外,该理论适用于深度神经网络(DNN),并预测具有可训练嵌入或多个隐藏层的DNN在通用学习解中仅需O(log n)特征,这一结果得到了实证验证。

链接: https://arxiv.org/abs/2505.18266
作者: Gavin McCracken,Gabriela Moisescu-Pareja,Vincent Letourneau,Doina Precup,Jonathan Love
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a testable universality hypothesis, asserting that seemingly disparate neural network solutions observed in the simple task of modular addition are unified under a common abstract algorithm. While prior work interpreted variations in neuron-level representations as evidence for distinct algorithms, we demonstrate - through multi-level analyses spanning neurons, neuron clusters, and entire networks - that multilayer perceptrons and transformers universally implement the abstract algorithm we call the approximate Chinese Remainder Theorem. Crucially, we introduce approximate cosets and show that neurons activate exclusively on them. Furthermore, our theory works for deep neural networks (DNNs). It predicts that universally learned solutions in DNNs with trainable embeddings or more than one hidden layer require only O(log n) features, a result we empirically confirm. This work thus provides the first theory-backed interpretation of multilayer networks solving modular addition. It advances generalizable interpretability and opens a testable universality hypothesis for group multiplication beyond modular addition.
zh

[AI-261] ZeroML: A Next Generation AutoML Language

【速读】:该论文试图解决传统编程语言(如Python、R或Julia)在AutoML(自动化机器学习)流水线中存在的运行速度慢、流水线脆弱以及依赖成本高等问题。解决方案的关键在于引入ZeroML,这是一种新型的编译型、多范式编程语言,其核心为纯函数式语言,并采用基于微服务的架构,提供模块化、可重用的组件(如DataCleaner、FeatureEngineer或ModelSelector)。此外,ZeroML作为原生多线程和内存感知的搜索优化工具包,具备一键部署能力,能够使非程序员和机器学习专业人员快速且更可复现地构建高精度模型。

链接: https://arxiv.org/abs/2505.18243
作者: Monirul Islam Mahmud
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:ZeroML is a new generation programming language for AutoML to drive the ML pipeline in a compiled and multi-paradigm way, with a pure functional core. Meeting the shortcomings introduced by Python, R, or Julia such as slow-running time, brittle pipelines or high dependency cost ZeroML brings the Microservices-based architecture adding the modular, reusable pieces such as DataCleaner, FeatureEngineer or ModelSelector. As a native multithread and memory-aware search optimized toolkit, and with one command deployability ability, ZeroML ensures non-coders and ML professionals to create high-accuracy models super fast and in a more reproducible way. The verbosity of the language ensures that when it comes to dropping into the backend, the code we will be creating is extremely clear but the level of repetition and boilerplate required when developing on the front end is now removed.
zh

[AI-262] Intent Classification on Low-Resource Languages with Query Similarity Search

【速读】:该论文试图解决意图分类(intent classification)在低资源语言中数据难以标注和扩展性差的问题。传统方法通常将意图分类视为一个分类问题,但由于意图定义模糊,导致数据标注困难且成本高昂,尤其在多语言和低资源语言场景下更为突出。解决方案的关键在于将意图分类转化为查询相似性搜索(query similarity search)问题,通过使用先前示例查询来定义意图,并利用查询相似性方法根据潜在空间中最相似查询的标签对新查询进行分类,从而在零样本(zero-shot)设置下实现对低资源语言查询的有效意图分类。

链接: https://arxiv.org/abs/2505.18241
作者: Arjun Bhalla,Qi Huang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intent classification is an important component of a functional Information Retrieval ecosystem. Many current approaches to intent classification, typically framed as a classification problem, can be problematic as intents are often hard to define and thus data can be difficult and expensive to annotate. The problem is exacerbated when we need to extend the intent classification system to support multiple and in particular low-resource languages. To address this, we propose casting intent classification as a query similarity search problem - we use previous example queries to define an intent, and a query similarity method to classify an incoming query based on the labels of its most similar queries in latent space. With the proposed approach, we are able to achieve reasonable intent classification performance for queries in low-resource languages in a zero-shot setting.
zh

[AI-263] From Bias to Accountability: How the EU AI Act Confronts Challenges in European GeoAI Auditing

【速读】:该论文试图解决GeoAI模型中偏见问题的碎片化研究现状,以及如何将这些偏见机制与欧盟《人工智能法案》(EU AI Act)中的审计义务相结合的问题。其解决方案的关键在于综合现有研究,识别出重复出现的偏见机制(如代表性偏差、算法偏差和聚合偏差),并将其映射到EU AI Act的具体条款,同时通过高风险标准论证广泛部署的GeoAI应用属于高风险系统,从而为未来的偏见检测和审计提供理论依据与实践方法。

链接: https://arxiv.org/abs/2505.18236
作者: Natalia Matuszczyk,Craig R. Barnes,Rohit Gupta,Bulent Ozel,Aniket Mitra
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bias in geospatial artificial intelligence (GeoAI) models has been documented, yet the evidence is scattered across narrowly focused studies. We synthesize this fragmented literature to provide a concise overview of bias in GeoAI and examine how the EU’s Artificial Intelligence Act (EU AI Act) shapes audit obligations. We discuss recurring bias mechanisms, including representation, algorithmic and aggregation bias, and map them to specific provisions of the EU AI Act. By applying the Act’s high-risk criteria, we demonstrate that widely deployed GeoAI applications qualify as high-risk systems. We then present examples of recent audits along with an outline of practical methods for detecting bias. As far as we know, this study represents the first integration of GeoAI bias evidence into the EU AI Act context, by identifying high-risk GeoAI systems and mapping bias mechanisms to the Act’s Articles. Although the analysis is exploratory, it suggests that even well-curated European datasets should employ routine bias audits before 2027, when the AI Act’s high-risk provisions take full effect.
zh

[AI-264] he Origins of Representation Manifolds in Large Language Models

【速读】:该论文试图解决如何将人工智能系统中的嵌入和内部表示映射到人类可理解的概念这一问题,特别是在理解神经网络表示中特征的编码方式方面。其解决方案的关键在于提出一种基于流形(manifold)的特征表示模型,认为特征可能不是简单的存在或不存在,而是可以编码连续且多维的值。该研究进一步表明,表示空间中的余弦相似性可能通过流形上的最短路径来编码特征的内在几何结构,从而为表示空间中的距离与概念空间中的相关性之间的联系提供了潜在解释。

链接: https://arxiv.org/abs/2505.18235
作者: Alexander Modell,Patrick Rubin-Delanchy,Nick Whiteley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal’ direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths, potentially answering the question of how distance in representation space and relatedness in concept space could be connected. The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
zh

[AI-265] A Robust PPO-optimized Tabular Transformer Framework for Intrusion Detection in Industrial IoT Systems

【速读】:该论文旨在解决工业互联网(IIoT)环境中网络入侵检测系统(NIDS)在类别不平衡和少量样本攻击场景下的性能问题。其解决方案的关键在于将TabTransformer用于有效的表格特征表示,并结合近端策略优化(PPO)通过策略学习优化分类决策,从而提升模型的鲁棒性和少样本检测能力。

链接: https://arxiv.org/abs/2505.18234
作者: Yuanya She
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose a robust and reinforcement-learning-enhanced network intrusion detection system (NIDS) designed for class-imbalanced and few-shot attack scenarios in Industrial Internet of Things (IIoT) environments. Our model integrates a TabTransformer for effective tabular feature representation with Proximal Policy Optimization (PPO) to optimize classification decisions via policy learning. Evaluated on the TON\textunderscore IoT benchmark, our method achieves a macro F1-score of 97.73% and accuracy of 98.85%. Remarkably, even on extremely rare classes like man-in-the-middle (MITM), our model achieves an F1-score of 88.79%, showcasing strong robustness and few-shot detection capabilities. Extensive ablation experiments confirm the complementary roles of TabTransformer and PPO in mitigating class imbalance and improving generalization. These results highlight the potential of combining transformer-based tabular learning with reinforcement learning for real-world NIDS applications.
zh

[AI-266] POSTER: A Multi-Signal Model for Detecting Evasive Smishing

【速读】:该论文试图解决的是针对移动用户的Smishing(短信钓鱼)威胁问题,这种攻击通过模仿合法通信的 culturally adapted(文化适应性)、简洁且具有欺骗性的消息,可能导致敏感数据或财务资源的损失。解决方案的关键在于提出一种多通道的Smishing检测模型,该模型结合了国家特定的语义标记、结构模式标记、字符级别的风格线索以及上下文短语嵌入,通过整合多种语言和结构特征提升检测效果。该方法在五个数据集上进行了验证,取得了97.89%的准确率、0.963的F1分数和99.73%的AUC值,优于单一流模型。

链接: https://arxiv.org/abs/2505.18233
作者: Shaghayegh Hosseinpour,Sanchari Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smishing, or SMS-based phishing, poses an increasing threat to mobile users by mimicking legitimate communications through culturally adapted, concise, and deceptive messages, which can result in the loss of sensitive data or financial resources. In such, we present a multi-channel smishing detection model that combines country-specific semantic tagging, structural pattern tagging, character-level stylistic cues, and contextual phrase embeddings. We curated and relabeled over 84,000 messages across five datasets, including 24,086 smishing samples. Our unified architecture achieves 97.89% accuracy, an F1 score of 0.963, and an AUC of 99.73%, outperforming single-stream models by capturing diverse linguistic and structural cues. This work demonstrates the effectiveness of multi-signal learning in robust and region-aware phishing.
zh

[AI-267] NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中由于键值(KV)缓存占用大量内存而导致的性能瓶颈问题,尤其是当处理大规模批次和长序列时。现有基于校准数据集的向量量化(Vector Quantization, VQ)方法易受分布偏移影响,限制了其适用性。论文提出的解决方案是NSNQuant,其关键在于通过三次变换——逐标记归一化(Normalize)、逐通道中心化(Shift)以及再次逐标记归一化(Normalize)——结合哈达玛变换(Hadamard transform),将标记分布对齐至标准正态分布,从而实现无需校准数据的鲁棒向量量化,使用单一可复用代码本即可完成低比特压缩。

链接: https://arxiv.org/abs/2505.18231
作者: Donghyun Son,Euntae Choi,Sungjoo Yoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3 \times throughput gain over full-precision baselines.
zh

[AI-268] Follow the Energy Find the Path: Riemannian Metrics from Energy-Based Models

【速读】:该论文试图解决在高维空间中确定两个数据点之间最短路径的问题,这一问题在欧几里得几何中是简单的,但在数据位于弯曲流形上时则变得复杂,需要借助黎曼度量来描述空间的局部曲率。解决方案的关键在于从预训练的能量基础模型(Energy-Based Models, EBMs)中直接推导出黎曼度量,这些模型能够为高密度区域分配较低的能量值,从而定义随空间变化的距离,并计算遵循数据流形内在几何结构的测地线。该方法通过引入两种新颖的度量,使测地线更贴近数据流形并表现出更低的曲率失真,实验结果表明其在高维设置下优于现有基线方法。

链接: https://arxiv.org/abs/2505.18230
作者: Louis Béthune,David Vigouroux,Yilun Du,Rufin VanRullen,Thomas Serre,Victor Boutin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What is the shortest path between two data points lying in a high-dimensional space? While the answer is trivial in Euclidean geometry, it becomes significantly more complex when the data lies on a curved manifold – requiring a Riemannian metric to describe the space’s local curvature. Estimating such a metric, however, remains a major challenge in high dimensions. In this work, we propose a method for deriving Riemannian metrics directly from pretrained Energy-Based Models (EBMs) – a class of generative models that assign low energy to high-density regions. These metrics define spatially varying distances, enabling the computation of geodesics – shortest paths that follow the data manifold’s intrinsic geometry. We introduce two novel metrics derived from EBMs and show that they produce geodesics that remain closer to the data manifold and exhibit lower curvature distortion, as measured by alignment with ground-truth trajectories. We evaluate our approach on increasingly complex datasets: synthetic datasets with known data density, rotated character images with interpretable geometry, and high-resolution natural images embedded in a pretrained VAE latent space. Our results show that EBM-derived metrics consistently outperform established baselines, especially in high-dimensional settings. Our work is the first to derive Riemannian metrics from EBMs, enabling data-aware geodesics and unlocking scalable, geometry-driven learning for generative modeling and simulation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.18230 [cs.LG] (or arXiv:2505.18230v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.18230 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-269] BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs

【速读】:该论文试图解决当前基于无人机的具身智能体(UAV-Embodied Agents, UAV-EAs)评估方法缺乏标准化基准、多样化测试场景和开放系统接口的问题。其解决方案的关键在于提出BEDI(Benchmark for Embodied Drone Intelligence),一个系统化且标准化的评估基准,通过引入基于感知-决策-行动循环的动态具身任务链范式,将复杂的无人机任务分解为可标准化和量化的子任务,并构建包含语义感知、空间感知、运动控制、工具使用和任务规划五大核心子技能的统一评估框架,同时搭建融合静态现实环境与动态虚拟场景的混合测试平台,提供开放且标准化的接口以支持任务定制与场景扩展,从而提升评估的灵活性与可扩展性。

链接: https://arxiv.org/abs/2505.18229
作者: Mingning Guo,Mengwei Wu,Jiarun He,Shaoxian Li,Haifeng Li,Chao Tao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of low-altitude remote sensing and Vision-Language Models (VLMs), Embodied Agents based on Unmanned Aerial Vehicles (UAVs) have shown significant potential in autonomous tasks. However, current evaluation methods for UAV-Embodied Agents (UAV-EAs) remain constrained by the lack of standardized benchmarks, diverse testing scenarios and open system interfaces. To address these challenges, we propose BEDI (Benchmark for Embodied Drone Intelligence), a systematic and standardized benchmark designed for evaluating UAV-EAs. Specifically, we introduce a novel Dynamic Chain-of-Embodied-Task paradigm based on the perception-decision-action loop, which decomposes complex UAV tasks into standardized, measurable subtasks. Building on this paradigm, we design a unified evaluation framework encompassing five core sub-skills: semantic perception, spatial perception, motion control, tool utilization, and task planning. Furthermore, we construct a hybrid testing platform that integrates static real-world environments with dynamic virtual scenarios, enabling comprehensive performance assessment of UAV-EAs across varied contexts. The platform also offers open and standardized interfaces, allowing researchers to customize tasks and extend scenarios, thereby enhancing flexibility and scalability in the evaluation process. Finally, through empirical evaluations of several state-of-the-art (SOTA) VLMs, we reveal their limitations in embodied UAV tasks, underscoring the critical role of the BEDI benchmark in advancing embodied intelligence research and model optimization. By filling the gap in systematic and standardized evaluation within this field, BEDI facilitates objective model comparison and lays a robust foundation for future development in this field. Our benchmark will be released at this https URL .
zh

[AI-270] oken Reduction Should Go Beyond Efficiency in Generative Models – From Vision Language to Multimodality

【速读】:该论文试图解决在大型生成模型时代,传统以效率为导向的token reduction(令牌缩减)方法未能充分发挥其在生成建模中的潜力的问题。论文提出,token reduction应被视为生成建模中的一个基础性原则,而不仅仅是优化计算效率的手段。解决方案的关键在于重新定义token reduction的角色,强调其在促进多模态融合与对齐、减少“过度思考”和幻觉、保持长输入的一致性以及提升训练稳定性等方面的重要性,并探索其在算法设计、强化学习引导的token缩减、上下文学习中的token优化以及更广泛的机器学习和科学领域中的应用前景。

链接: https://arxiv.org/abs/2505.18227
作者: Zhenglun Kong,Yize Li,Fanhu Zeng,Lei Xin,Shvat Messica,Xue Lin,Pu Zhao,Manolis Kellis,Hao Tang,Marinka Zitnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input’s essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate “overthinking” and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.
zh

[AI-271] A Domain Ontology for Modeling the Book of Purification in Islam

【速读】:该论文旨在填补伊斯兰教核心主题中的一个空白,通过为《净化书》(Book of Purification)构建本体论来实现这一目标。《净化书》是许多权威伊斯兰文本的开篇,因其在履行礼拜(伊斯兰教第二支柱,仅次于信仰宣誓)及其他宗教义务如副朝和正朝中的重要性而被重视。解决方案的关键在于遵循六步本体论开发策略:领域识别、知识获取、概念化、分类、集成与实施以及本体生成。通过正式定义和编码与《净化书》相关的关键概念、属性和关系,所开发的本体论确保了可重用性,并旨在支持知识共享与再利用。

链接: https://arxiv.org/abs/2505.18222
作者: Hessa Alawwad
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:This paper aims to address a gap in major Islamic topics by developing an ontology for the Book of Purification in Islam. Many authoritative Islamic texts begin with the Book of Purification, as it is essential for performing prayer (the second pillar of Islam after Shahadah, the profession of faith) and other religious duties such as Umrah and Hajj. The ontology development strategy followed six key steps: (1) domain identification, (2) knowledge acquisition, (3) conceptualization, (4) classification, (5) integration and implementation, and (6) ontology generation. This paper includes examples of the constructed tables and classifications. The focus is on the design and analysis phases, as technical implementation is beyond the scope of this study. However, an initial implementation is provided to illustrate the steps of the proposed strategy. The developed ontology ensures reusability by formally defining and encoding the key concepts, attributes, and relationships related to the Book of Purification. This structured representation is intended to support knowledge sharing and reuse. Comments: 9 pages Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.18222 [cs.DL] (or arXiv:2505.18222v1 [cs.DL] for this version) https://doi.org/10.48550/arXiv.2505.18222 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Hessa Alawwad [view email] [v1] Fri, 23 May 2025 08:55:59 UTC (1,094 KB)
zh

[AI-272] Navigating Pitfalls: Evaluating LLM s in Machine Learning Programming Education

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在机器学习教育中支持学习的局限性问题,特别是其在识别机器学习代码中的常见错误(pitfalls)及提供有效反馈方面的不足。研究的关键在于评估不同LLMs在识别机器学习流程中关键错误(如信息泄露和模型选择问题)的能力,并探讨其在教育场景中的适用性。研究发现,尽管所有模型都能识别基础错误,但在处理复杂或早期阶段的错误时表现较差,这表明当前LLMs在支持机器学习教育方面仍存在显著挑战。然而,当LLMs成功识别错误时,它们能够提供有指导性的反馈,显示出其潜在的教育价值。此外,研究还比较了封闭与开放模型的能力,发现尽管模型规模差异较大,但两者在性能上的差距相对较小,这为在教育领域部署更高效的小型LLM提供了可能性。

链接: https://arxiv.org/abs/2505.18220
作者: Smitha Kumar,Michael A. Lones,Manuel Maarek,Hind Zantout
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has opened new avenues in education. This study examines the use of LLMs in supporting learning in machine learning education; in particular, it focuses on the ability of LLMs to identify common errors of practice (pitfalls) in machine learning code, and their ability to provide feedback that can guide learning. Using a portfolio of code samples, we consider four different LLMs: one closed model and three open models. Whilst the most basic pitfalls are readily identified by all models, many common pitfalls are not. They particularly struggle to identify pitfalls in the early stages of the ML pipeline, especially those which can lead to information leaks, a major source of failure within applied ML projects. They also exhibit limited success at identifying pitfalls around model selection, which is a concept that students often struggle with when first transitioning from theory to practice. This questions the use of current LLMs to support machine learning education, and also raises important questions about their use by novice practitioners. Nevertheless, when LLMs successfully identify pitfalls in code, they do provide feedback that includes advice on how to proceed, emphasising their potential role in guiding learners. We also compare the capability of closed and open LLM models, and find that the gap is relatively small given the large difference in model sizes. This presents an opportunity to deploy, and potentially customise, smaller more efficient LLM models within education, avoiding risks around cost and data sharing associated with commercial models.
zh

[AI-273] ABHINAYA – A System for Speech Emotion Recognition In Naturalistic Conditions Challenge INTERSPEECH2025

【速读】:该论文旨在解决自然场景下的语音情感识别(Speech Emotion Recognition, SER)问题,该问题由于内在的变异性和多样的录音条件以及类别不平衡而具有挑战性。解决方案的关键在于构建一个集成语音、文本和语音-文本模型的系统Abhinaya,通过微调自监督和语音大语言模型(SLLM)获取语音表征,利用大语言模型(LLM)获取文本上下文,并采用结合SLLM的语音-文本建模来捕捉细微的情感线索。此外,为应对类别不平衡问题,采用了定制的损失函数并通过多数投票生成分类决策。

链接: https://arxiv.org/abs/2505.18217
作者: Soumya Dutta,Smruthi Balaji,Varada R,Viveka Salinamakki,Sriram Ganapathy
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, 4 tables, accepted at Interspeech 2025

点击查看摘要

Abstract:Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.
zh

[AI-274] Data Mining-Based Techniques for Software Fault Localization

【速读】:该论文试图解决软件故障定位(fault localization)的问题,特别是针对图形用户界面(GUI)组件的故障定位。其解决方案的关键在于利用数据挖掘技术,如形式概念分析和关联规则,将程序测试过程中产生的测试用例结果(PASS和FAIL)作为属性,构建对象-属性表进行分析,从而识别可能的故障点。此外,论文还扩展了数据挖掘方法以处理多故障情况,强调了GUI测试用例中事件序列及其对应事件处理器的特点,为故障定位提供了新的视角和方法。

链接: https://arxiv.org/abs/2505.18216
作者: Peggy Cellier(INSA Rennes, LACODAM),Mireille Ducassé(DRUID),Sébastien Ferré(LACODAM),Olivier Ridoux(DRUID),W. Eric Wong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This chapter illustrates the basic concepts of fault localization using a data mining technique. It utilizes the Trityp program to illustrate the general method. Formal concept analysis and association rule are two well-known methods for symbolic data mining. In their original inception, they both consider data in the form of an object-attribute table. In their original inception, they both consider data in the form of an object-attribute table. The chapter considers a debugging process in which a program is tested against different test cases. Two attributes, PASS and FAIL, represent the issue of the test case. The chapter extends the analysis of data mining for fault localization for the multiple fault situations. It addresses how data mining can be further applied to fault localization for GUI components. Unlike traditional software, GUI test cases are usually event sequences, and each individual event has a unique corresponding event handler.
zh

[AI-275] LA-RCS: LLM -Agent -Based Robot Control System

【速读】:该论文旨在解决传统机器人控制系统在自主规划、任务执行及环境适应性方面的不足,特别是如何有效处理用户自然语言指令并实现高效、自适应的机器人控制。其解决方案的关键在于引入基于大语言模型(LLM)的代理(LLM-Agent)框架,通过双代理机制实现任务规划、环境感知、计划执行与动态调整,从而提升系统对复杂外部环境的适应能力和任务完成的准确性。

链接: https://arxiv.org/abs/2505.18214
作者: TaekHyun Park,YoungJun Choi,SeungHoon Shin,Kwangil Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LA-RCS (LLM-agent-based robot control system) is a sophisticated robot control system designed to autonomously plan, work, and analyze the external environment based on user requirements by utilizing LLM-Agent. Utilizing a dual-agent framework, LA-RCS generates plans based on user requests, observes the external environment, executes the plans, and modifies the plans as needed to adapt to changes in the external conditions. Additionally, LA-RCS interprets natural language commands by the user and converts them into commands compatible with the robot interface so that the robot can execute tasks and meet user requests properly. During his process, the system autonomously evaluates observation results, provides feedback on the tasks, and executes commands based on real-time environmental monitoring, significantly reducing the need for user intervention in fulfilling requests. We categorized the scenarios that LA-RCS needs to perform into four distinct types and conducted a quantitative assessment of its performance in each scenario. The results showed an average success rate of 90 percent, demonstrating the system capability to fulfill user requests satisfactorily. For more extensive results, readers can visit our project page: this https URL
zh

[AI-276] AIDRIN 2.0: A Framework to Assess Data Readiness for AI

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)应用中数据准备不足的问题,特别是数据质量、偏差、公平性和隐私等方面的挑战。解决方案的关键在于对AI Data Readiness Inspector (AIDRIN)框架进行改进,重点提升用户界面(User Interface, UI)体验,并与隐私保护联邦学习(Privacy-Preserving Federated Learning, PPFL)框架集成,从而增强其在去中心化AI流程中的可用性与实用性。

链接: https://arxiv.org/abs/2505.18213
作者: Kaveen Hiniduma,Dylan Ryan,Suren Byna,Jean Luca Bez,Ravi Madduri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 3 pages, 3 figures

点击查看摘要

Abstract:AI Data Readiness Inspector (AIDRIN) is a framework to evaluate and improve data preparedness for AI applications. It addresses critical data readiness dimensions such as data quality, bias, fairness, and privacy. This paper details enhancements to AIDRIN by focusing on user interface improvements and integration with a privacy-preserving federated learning (PPFL) framework. By refining the UI and enabling smooth integration with decentralized AI pipelines, AIDRIN becomes more accessible and practical for users with varying technical expertise. Integrating with an existing PPFL framework ensures that data readiness and privacy are prioritized in federated learning environments. A case study involving a real-world dataset demonstrates AIDRIN’s practical value in identifying data readiness issues that impact AI model performance.
zh

[AI-277] 2DNMRGym: An Annotated Experimental Dataset for Atom-Level Molecular Representation Learning in 2D NMR via Surrogate Supervision

【速读】:该论文旨在解决2D核磁共振(NMR)光谱数据解析过程中依赖人工且易出错的问题,尤其针对复杂分子的结构解析。其关键解决方案是引入了2DNMRGym,这是首个用于基于机器学习(ML)的分子表示学习的注释实验数据集,包含超过22,000个HSQC谱图及其对应的分子图和SMILES字符串,并采用代理监督设置,利用算法生成的注释进行模型训练,同时在人工标注的黄金标准数据集上进行评估,从而实现对模型从不完美监督到专家级解释能力的严格评估。

链接: https://arxiv.org/abs/2505.18181
作者: Yunrui Li,Hao Xu,Pengyu Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Two-dimensional (2D) Nuclear Magnetic Resonance (NMR) spectroscopy, particularly Heteronuclear Single Quantum Coherence (HSQC) spectroscopy, plays a critical role in elucidating molecular structures, interactions, and electronic properties. However, accurately interpreting 2D NMR data remains labor-intensive and error-prone, requiring highly trained domain experts, especially for complex molecules. Machine Learning (ML) holds significant potential in 2D NMR analysis by learning molecular representations and recognizing complex patterns from data. However, progress has been limited by the lack of large-scale and high-quality annotated datasets. In this work, we introduce 2DNMRGym, the first annotated experimental dataset designed for ML-based molecular representation learning in 2D NMR. It includes over 22,000 HSQC spectra, along with the corresponding molecular graphs and SMILES strings. Uniquely, 2DNMRGym adopts a surrogate supervision setup: models are trained using algorithm-generated annotations derived from a previously validated method and evaluated on a held-out set of human-annotated gold-standard labels. This enables rigorous assessment of a model’s ability to generalize from imperfect supervision to expert-level interpretation. We provide benchmark results using a series of 2D and 3D GNN and GNN transformer models, establishing a strong foundation for future work. 2DNMRGym supports scalable model training and introduces a chemically meaningful benchmark for evaluating atom-level molecular representations in NMR-guided structural tasks. Our data and code is open-source and available on Huggingface and Github.
zh

[AI-278] GAIA: A Foundation Model for Operational Atmospheric Dynamics

【速读】:该论文旨在解决卫星遥感数据中大气模式分析的两个关键问题:重建缺失区域和估计降水模式。其解决方案的关键在于提出GAIA基础模型,该模型结合了掩码自编码器(MAE)与无标签自蒸馏(DINO)的自监督学习方法,从而同时捕捉局部特征与全局依赖关系,实现了对时间序列模式的优越捕获能力,并在有限训练数据下表现出准确的降水估计和强大的数据填补能力。

链接: https://arxiv.org/abs/2505.18179
作者: Ata Akbari Asanjan,Olivia Alexander,Tom Berg,Clara Zhang,Matt Yang,Jad Makki,Disha Shidham,Srija Chakraborty,William Bender,Stephen Peng,Arun Ravindran,Olivier Raiman,David Potere,David Bell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:We present the GAIA (Geospatial Artificial Intelligence for Atmospheres) Foundation Model, a novel model that combines masked autoencoders (MAE) and self-DIstillation with NO labels (DINO) for analyzing global atmospheric patterns in satellite imagery. By integrating these complementary self-supervised learning approaches, our model simultaneously captures both local features and global dependencies. We address two critical challenges in satellite data analysis: reconstructing missing regions and estimating precipitation patterns as our first downstream tasks. The model demonstrates superior temporal pattern capture compared to standard MAE approaches, while maintaining robust performance in downstream tasks. Our experimental results show strong gap-filling capabilities across varying mask ratios and accurate precipitation estimation with limited training data, achieving a false alarm ratio of 0.088 and structural similarity of 0.881. This work represents an advancement in self-supervised learning for atmospheric science, providing a foundation for improved weather monitoring and climate analysis. The trained model weights and accompanying code are publicly available as open-source on Hugging Face here: this https URL.
zh

[AI-279] Less is More: Multimodal Region Representation via Pairwise Inter-view Learning

【速读】:该论文试图解决区域表示学习(Region Representation Learning, RRL)中现有方法在利用对比学习(Contrastive Learning, CL)时忽视模态特异性信息的问题,这些信息对于解释区域特征至关重要。解决方案的关键在于提出一种名为Cross modal Knowledge Injected Embedding(CooKIE)的信息分解方法,该方法通过成对的跨视图学习策略,在不建模高阶依赖关系的情况下捕捉高阶信息,从而有效分离多模态数据中的共享和特异性表示。

链接: https://arxiv.org/abs/2505.18178
作者: Min Namgung,Yijun Lin,JangHyeon Lee,Yao-Yi Chiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the increasing availability of geospatial datasets, researchers have explored region representation learning (RRL) to analyze complex region characteristics. Recent RRL methods use contrastive learning (CL) to capture shared information between two modalities but often overlook task-relevant unique information specific to each modality. Such modality-specific details can explain region characteristics that shared information alone cannot capture. Bringing information factorization to RRL can address this by factorizing multimodal data into shared and unique information. However, existing factorization approaches focus on two modalities, whereas RRL can benefit from various geospatial data. Extending factorization beyond two modalities is non-trivial because modeling high-order relationships introduces a combinatorial number of learning objectives, increasing model complexity. We introduce Cross modal Knowledge Injected Embedding, an information factorization approach for RRL that captures both shared and unique representations. CooKIE uses a pairwise inter-view learning approach that captures high-order information without modeling high-order dependency, avoiding exhaustive combinations. We evaluate CooKIE on three regression tasks and a land use classification task in New York City and Delhi, India. Results show that CooKIE outperforms existing RRL methods and a factorized RRL model, capturing multimodal information with fewer training parameters and floating-point operations per second (FLOPs). We release the code: this https URL.
zh

[AI-280] FedGRec: Dynamic Spatio-Temporal Federated Graph Learning for Secure and Efficient Cross-Border Recommendations

【速读】:该论文旨在解决跨境推荐中因数据隐私保护法规导致的数据不足问题,以及在保证隐私安全的前提下实现高效跨境业务推荐的挑战。其解决方案的关键在于提出FedGRec,一种隐私保护的联邦图学习方法,通过从分布式多领域数据中捕捉用户偏好,利用局部子图的协同信号增强表示学习,并结合动态时空建模实时整合全局与局部用户偏好,从而提升跨领域推荐性能,同时有效防止隐私泄露。

链接: https://arxiv.org/abs/2505.18177
作者: Zhizhong Tan,Jiexin Zheng,Xingxing Yang,Chi Zhang,Weiping Deng,Wenyong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the highly sensitive nature of certain data in cross-border sharing, collaborative cross-border recommendations and data sharing are often subject to stringent privacy protection regulations, resulting in insufficient data for model training. Consequently, achieving efficient cross-border business recommendations while ensuring privacy security poses a significant challenge. Although federated learning has demonstrated broad potential in collaborative training without exposing raw data, most existing federated learning-based GNN training methods still rely on federated averaging strategies, which perform suboptimally on highly heterogeneous graph data. To address this issue, we propose FedGRec, a privacy-preserving federated graph learning method for cross-border recommendations. FedGRec captures user preferences from distributed multi-domain data to enhance recommendation performance across all domains without privacy leakage. Specifically, FedGRec leverages collaborative signals from local subgraphs associated with users or items to enrich their representation learning. Additionally, it employs dynamic spatiotemporal modeling to integrate global and local user preferences in real time based on business recommendation states, thereby deriving the final representations of target users and candidate items. By automatically filtering relevant behaviors, FedGRec effectively mitigates noise interference from unreliable neighbors. Furthermore, through a personalized federated aggregation strategy, FedGRec adapts global preferences to heterogeneous domain data, enabling collaborative learning of user preferences across multiple domains. Extensive experiments on three datasets demonstrate that FedGRec consistently outperforms competitive single-domain and cross-domain baselines while effectively preserving data privacy in cross-border recommendations.
zh

[AI-281] Model-Distributed Inference for Large Language Models at the Edge

【速读】:该论文旨在解决在低功耗边缘设备上部署大语言模型(Large-Language Models, LLMs)的挑战,特别是克服单个设备内存容量限制的问题。其解决方案的关键在于提出一种名为“Model-Distributed Inference for Large-Language Models (MDI-LLM)”的框架,通过将模型划分为多个部分并分配到网络中的不同设备/节点进行协同计算,同时利用“循环流水线并行”技术减少设备空闲时间,实现多文本序列生成过程中的并行推理,从而提升整体效率和吞吐量。

链接: https://arxiv.org/abs/2505.18164
作者: Davide Macario,Hulya Seferoglu,Erdem Koyuncu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Model-Distributed Inference for Large-Language Models (MDI-LLM), a novel framework designed to facilitate the deployment of state-of-the-art large-language models (LLMs) across low-power devices at the edge. This is accomplished by dividing the model into multiple partitions, which are then assigned to different devices/nodes within the network. These nodes exchange intermediate activation vectors via device-to-device links, enabling collaborative computation. To enhance the efficiency of this process, we propose the “recurrent pipeline parallelism” technique, which reduces idle time on each device and facilitates parallel inference during the generation of multiple text sequences. By leveraging the combined computational resources of multiple edge devices, MDI-LLM enables the deployment of LLMs that exceed the memory capacity of individual devices, making it possible to perform inference on low-cost hardware. Furthermore, as the number of participating devices increases, MDI-LLM boosts token generation throughput and reduces memory consumption per device.
zh

[AI-282] InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的安全风险,特别是基于提示(prompt-based)的攻击问题。解决方案的关键是提出InjectLab这一结构化、开源的矩阵框架,该框架借鉴MITRE ATTCK方法,专注于针对提示层的对抗行为,并通过25种技术分类和六种核心战术,系统地映射了现实中的LLM操纵手段,同时提供检测指导、缓解策略和基于YAML的模拟测试,以增强对LLM安全性的保障。

链接: https://arxiv.org/abs/2505.18156
作者: Austin Howard
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: This is an independent research whitepaper submitted as a preprint. For more information, visit this https URL or this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are changing the way people interact with technology. Tools like ChatGPT and Claude AI are now common in business, research, and everyday life. But with that growth comes new risks, especially prompt-based attacks that exploit how these models process language. InjectLab is a security framework designed to address that problem. This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate LLMs. The framework is inspired by MITRE ATTCK and focuses specifically on adversarial behavior at the prompt layer. It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation. Each technique in InjectLab includes detection guidance, mitigation strategies, and YAML-based simulation tests. A Python tool supports easy execution of prompt-based test cases. This paper outlines the framework’s structure, compares it to other AI threat taxonomies, and discusses its future direction as a practical, community-driven foundation for securing language models.
zh

[AI-283] On-Sensor Convolutional Neural Networks with Early-Exits

【速读】:该论文试图解决在嵌入式设备中高效实现卷积神经网络(Convolutional Neural Networks, CNNs)以降低功耗的问题。现有研究虽在多种任务中表现出色,但尚未有文献针对直接在传感器上运行的CNN进行优化。论文的关键解决方案是首次在文献中提出在STMicroelectronics的惯性测量单元(Inertial Measurement Unit, IMU)内的智能传感器处理单元(Intelligent Sensor Processing Unit, ISPU)上优化设计和实现深度优先的CNN。该方法将CNN划分在ISPU和微控制器(MCU)之间,并采用早停机制(Early-Exit mechanism),在结果足够可信时停止IMU上的计算,从而显著降低功耗。

链接: https://arxiv.org/abs/2503.16939
作者: Hazem Hesham Yousef Shalby,Arianna De Vecchi,Alice Scandelli,Pietro Bartoli,Diana Trojaniello,Manuel Roveri,Federica Villa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Presented at IEEE SSCI

点击查看摘要

Abstract:Tiny Machine Learning (TinyML) is a novel research field aiming at integrating Machine Learning (ML) within embedded devices with limited memory, computation, and energy. Recently, a new branch of TinyML has emerged, focusing on integrating ML directly into the sensors to further reduce the power consumption of embedded devices. Interestingly, despite their state-of-the-art performance in many tasks, none of the current solutions in the literature aims to optimize the implementation of Convolutional Neural Networks (CNNs) operating directly into sensors. In this paper, we introduce for the first time in the literature the optimized design and implementation of Depth-First CNNs operating on the Intelligent Sensor Processing Unit (ISPU) within an Inertial Measurement Unit (IMU) by STMicroelectronics. Our approach partitions the CNN between the ISPU and the microcontroller (MCU) and employs an Early-Exit mechanism to stop the computations on the IMU when enough confidence about the results is achieved, hence significantly reducing power consumption. When using a NUCLEO-F411RE board, this solution achieved an average current consumption of 4.8 mA, marking an 11% reduction compared to the regular inference pipeline on the MCU, while having equal accuracy.
zh

[AI-284] Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain) ICLR-2025

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自然指令提示下是否能提升与大脑活动的对齐程度以及有效捕捉指令特异性表征的问题。其解决方案的关键在于通过实验验证MLLMs在不同指令下的表现,发现尽管MLLMs能够生成符合任务指令的高质量响应,但并非所有指令都对大脑对齐有显著贡献,同时通过调整指令使MLLMs编码与输入图像相关的指令特异性视觉概念,从而证明其在计数和识别相关概念上与大脑活动具有强对齐性。

链接: https://arxiv.org/abs/2505.20029
作者: Subba Reddy Oota,Akshett Jindal,Ishani Mondal,Khushbu Pahwa,Satya Sai Srinath Namburi,Manish Shrivastava,Maneesh Singh,Bapi S. Raju,Manish Gupta
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 22 figures, The Thirteenth International Conference on Learning Representations, ICLR-2025, Singapore. this https URL

点击查看摘要

Abstract:Transformer-based language models, though not explicitly trained to mimic brain recordings, have demonstrated surprising alignment with brain activity. Progress in these models-through increased size, instruction-tuning, and multimodality-has led to better representational alignment with neural data. Recently, a new class of instruction-tuned multimodal LLMs (MLLMs) have emerged, showing remarkable zero-shot capabilities in open-ended multimodal vision tasks. However, it is unknown whether MLLMs, when prompted with natural instructions, lead to better brain alignment and effectively capture instruction-specific representations. To address this, we first investigate brain alignment, i.e., measuring the degree of predictivity of neural visual activity using text output response embeddings from MLLMs as participants engage in watching natural scenes. Experiments with 10 different instructions show that MLLMs exhibit significantly better brain alignment than vision-only models and perform comparably to non-instruction-tuned multimodal models like CLIP. We also find that while these MLLMs are effective at generating high-quality responses suitable to the task-specific instructions, not all instructions are relevant for brain alignment. Further, by varying instructions, we make the MLLMs encode instruction-specific visual concepts related to the input image. This analysis shows that MLLMs effectively capture count-related and recognition-related concepts, demonstrating strong alignment with brain activity. Notably, the majority of the explained variance of the brain encoding models is shared between MLLM embeddings of image captioning and other instructions. These results suggest that enhancing MLLMs’ ability to capture task-specific information could lead to better differentiation between various types of instructions, and thereby improving their precision in predicting brain responses.
zh

[AI-285] Alpay Algebra III: Observer-Coupled Collapse and the Temporal Drift of Identity

【速读】:该论文试图解决在人工智能和数学系统中建模观察者依赖的坍缩动力学以及时间身份漂移的问题,其核心挑战在于如何在动态观测环境下保持系统的可解释性、时间连贯性和可追溯性。解决方案的关键在于通过Alpay代数的符号基础,引入观察者耦合的φ-坍缩过程,利用超限范畴流和曲率驱动的身份算子,构建一种基于递归变形的身份签名机制,并通过范畴不变量的演化实现对系统内部变换历史的符号化固定点结构编码,从而在可解释AI(XAI)中实现可证明的可追溯性和时间一致性。

链接: https://arxiv.org/abs/2505.19790
作者: Faruk Alpay
机构: 未知
类目: Category Theory (math.CT); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 22 pages, 0 figures. Third paper in the Alpay Algebra series, following [ arXiv:2505.15344 ] and [ arXiv:2505.17480 ]. Introduces observer-coupled collapse and formalizes temporal identity drift using transfinite ϕ-recursion. Entirely symbolic and self-contained, with no reliance on external frameworks. Structured for submission under Math.CT, CS.LO, and CS.AI

点击查看摘要

Abstract:This paper introduces a formal framework for modeling observer-dependent collapse dynamics and temporal identity drift within artificial and mathematical systems, grounded entirely in the symbolic foundations of Alpay Algebra. Building upon the fixed-point emergence structures developed in Alpay Algebra I and II, this third installment formalizes the observer-coupled \phi-collapse process through transfinite categorical flows and curvature-driven identity operators. We define a novel temporal drift mechanism as a recursive deformation of identity signatures under entangled observer influence, constructing categorical invariants that evolve across fold iterations. The proposed system surpasses conventional identity modeling in explainable AI (XAI) by encoding internal transformation history into a symbolic fixed-point structure, offering provable traceability and temporal coherence. Applications range from AI self-awareness architectures to formal logic systems where identity is not static but dynamically induced by observation. The theoretical results also offer a mathematically rigorous basis for future AI systems with stable self-referential behavior, positioning Alpay Algebra as a next-generation symbolic framework bridging category theory, identity logic, and observer dynamics.
zh

[AI-286] SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

【速读】:该论文旨在解决目标语音提取(Target Speech Extraction, TSE)中因使用判别模型导致的失真、自然度下降以及环境差异敏感性问题,同时克服生成模型在感知质量和可懂度上的不足。其解决方案的关键在于提出一种新的级联生成框架——SoloSpeech,该框架整合了压缩、提取、重建和校正过程,并采用无需说话人嵌入的目标提取器,通过利用提示音频的潜在空间中的条件信息,将其与混合音频的潜在空间对齐,从而避免匹配错误,提升模型的泛化能力和语音质量。

链接: https://arxiv.org/abs/2505.19314
作者: Helin Wang,Jiarui Hai,Dongchao Yang,Chen Chen,Kai Li,Junyi Peng,Thomas Thebaud,Laureano Moro Velazquez,Jesus Villalba,Najim Dehak
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Target Speech Extraction (TSE) aims to isolate a target speaker’s voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio’s latent space, aligning it with the mixture audio’s latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.
zh

[AI-287] Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis INTERSPEECH2025

【速读】:该论文旨在解决面部驱动的文本到语音合成(Text-to-Speech Synthesis, TTS)系统中的三个关键问题:音频质量受限、面部图像风格化生成语音以及一对一到多对一的语音映射一致性。其解决方案的关键在于:首先,通过结合高质量的纯音频语料库来提升音频质量;其次,通过风格化输入面部图像以支持从真实人脸和艺术肖像生成语音;最后,采用基于采样的解码方法并结合生成的语音样本进行提示,以确保语音生成的一致性。

链接: https://arxiv.org/abs/2505.18972
作者: Minsu Kim,Pingchuan Ma,Honglie Chen,Stavros Petridis,Maja Pantic
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Interspeech 2025

点击查看摘要

Abstract:This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model’s effectiveness in face-driven voice synthesis.
zh

[AI-288] High-order Equivariant Flow Matching for Density Functional Theory Hamiltonian Prediction

【速读】:该论文旨在解决密度泛函理论(Density Functional Theory, DFT)中由于求解Kohn-Sham方程所需的自洽场(Self-Consistent Field, SCF)迭代过程而导致的计算成本高昂的问题。其解决方案的关键在于提出QHFlow,一种基于高阶等变流匹配(high-order equivariant flow matching)的框架,通过条件化分子几何生成哈密顿量矩阵,从而绕过传统的SCF迭代过程。该方法利用流匹配模型学习哈密顿量的结构化分布,而非直接回归,并结合SE(3)-等变矢量场预测和轨道能级微调策略,以提升精度与泛化能力,最终显著降低哈密顿量误差并加速DFT计算。

链接: https://arxiv.org/abs/2505.18817
作者: Seongsu Kim,Nayoung Kim,Dongwoo Kim,Sungsoo Ahn
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Density functional theory (DFT) is a fundamental method for simulating quantum chemical properties, but it remains expensive due to the iterative self-consistent field (SCF) process required to solve the Kohn-Sham equations. Recently, deep learning methods are gaining attention as a way to bypass this step by directly predicting the Hamiltonian. However, they rely on deterministic regression and do not consider the highly structured nature of Hamiltonians. In this work, we propose QHFlow, a high-order equivariant flow matching framework that generates Hamiltonian matrices conditioned on molecular geometry. Flow matching models continuous-time trajectories between simple priors and complex targets, learning the structured distributions over Hamiltonians instead of direct regression. To further incorporate symmetry, we use a neural architecture that predicts SE(3)-equivariant vector fields, improving accuracy and generalization across diverse geometries. To further enhance physical fidelity, we additionally introduce a fine-tuning scheme to align predicted orbital energies with the target. QHFlow achieves state-of-the-art performance, reducing Hamiltonian error by 71% on MD17 and 53% on QH9. Moreover, we further show that QHFlow accelerates the DFT process without trading off the solution quality when initializing SCF iterations with the predicted Hamiltonian, significantly reducing the number of iterations and runtime.
zh

[AI-289] Season-Independent PV Disaggregation Using Multi-Scale Net Load Temporal Feature Extraction and Weather Factor Fusion

【速读】:该论文旨在解决分布式光伏(Distributed Photovoltaic, PV)系统在智能监测与计量中面临的挑战,特别是如何从净负荷中分离出光伏发电量的问题。现有方法在特征提取和捕捉天气因素之间的相关性方面存在不足。论文提出的解决方案关键在于将分层插值(Hierarchical Interpolation, HI)与多头自注意力机制相结合,通过HI提取净负荷特征,并利用多头自注意力捕获天气因素间的复杂依赖关系,从而实现精确的光伏发电预测。

链接: https://arxiv.org/abs/2505.18747
作者: Xiaolu Chen,Chenghao Huang,Yanru Zhang,Hao Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2024 IEEE 8th Conference on Energy Internet and Energy System Integration (EI2)

点击查看摘要

Abstract:With the advancement of energy Internet and energy system integration, the increasing adoption of distributed photovoltaic (PV) systems presents new challenges on smart monitoring and measurement for utility companies, particularly in separating PV generation from net electricity load. Existing methods struggle with feature extraction from net load and capturing the relevance between weather factors. This paper proposes a PV disaggregation method that integrates Hierarchical Interpolation (HI) and multi-head self-attention mechanisms. By using HI to extract net load features and multi-head self-attention to capture the complex dependencies between weather factors, the method achieves precise PV generation predictions. Simulation experiments demonstrate the effectiveness of the proposed method in real-world data, supporting improved monitoring and management of distributed energy systems.
zh

[AI-290] An AI Capability Threshold for Rent-Funded Universal Basic Income in an AI-Automated Economy

【速读】:该论文试图解决在不依赖额外税收或新就业岗位的情况下,如何通过人工智能(Artificial Intelligence, AI)资本收益可持续地资助全民基本收入(Universal Basic Income, UBI)的问题。其解决方案的关键在于确定AI能力阈值——即AI生产力相对于原有自动化水平的临界点,并分析在不同经济情景下该阈值的变化。研究发现,在当前经济参数下,AI系统需达到现有自动化生产力的5-6倍才能在无新就业机会的情境下资助占GDP 11%的UBI,而通过提高AI资本的公共收入份额等政策工具可有效降低该阈值。

链接: https://arxiv.org/abs/2505.18687
作者: Aran Nayebi
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:We derive the first closed-form condition under which artificial intelligence (AI) capital profits could sustainably finance a universal basic income (UBI) without additional taxes or new job creation. In a Solow-Zeira economy characterized by a continuum of automatable tasks, a constant net saving rate s , and task-elasticity \sigma 1 , we analyze how the AI capability threshold–defined as the productivity level of AI relative to pre-AI automation–varies under different economic scenarios. At present economic parameters, we find that AI systems must achieve only approximately 5-6 times existing automation productivity to finance an 11%-of-GDP UBI, in the worst case situation where \emphno new jobs or tasks are created. Our analysis also reveals some specific policy levers: raising public revenue share (e.g. profit taxation) of AI capital from the current 15% to about 33% halves the required AI capability threshold to attain UBI to 3 times existing automotion productivity, but gains diminish beyond 50% public revenue share, especially if regulatory costs increase. Market structure also strongly affects outcomes: monopolistic or concentrated oligopolistic markets reduce the threshold by increasing economic rents, whereas heightened competition significantly raises it. Overall, these results suggest a couple policy recommendations: maximizing public revenue share up to a point so that operating costs are minimized, and strategically managing market competition can ensure AI’s growing capabilities translate into meaningful social benefits within realistic technological progress scenarios. Comments: 12 pages, 3 figures Subjects: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2505.18687 [econ.GN] (or arXiv:2505.18687v1 [econ.GN] for this version) https://doi.org/10.48550/arXiv.2505.18687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-291] Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

【速读】:该论文旨在解决在选择人工智能模型(如大型语言模型)时,如何在缺乏大量真实数据的情况下实现准确的性能估计问题。传统方法依赖于真实数据进行评估,但成本高且难以扩展;而自动评估方法虽然利用合成数据降低方差,但可能引入偏差并导致样本效率下降。论文提出的解决方案是\textttR-AutoEval+,其关键创新在于自适应构建模型评估变量,动态调整对合成数据的依赖程度,并在自动评估器准确性不足时回退到传统方法,从而在保证有限样本可靠性的同时提升或至少维持样本效率。

链接: https://arxiv.org/abs/2505.18659
作者: Sangwoo Park,Matteo Zecchin,Osvaldo Simeone
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注: submitted

点击查看摘要

Abstract:Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (\textttPPI) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose \textttR-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of \textttR-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, and for prompt design in LLMs confirm the reliability and efficiency of \textttR-AutoEval+.
zh

[AI-292] Anomaly detection in radio galaxy data with trainable COSFIRE filters

【速读】:该论文旨在解决射电天文学中异常检测的问题,这一问题由于数据量庞大且标记的异常样本稀少而尤为困难。其解决方案的关键在于利用可训练的COSFIRE(Combination of Shifted Filter Responses)滤波器,结合无监督的局部离群因子(Local Outlier Factor, LOF)算法,基于射电源的形态特征进行异常识别。这种方法在无需大量监督信息的情况下,有效捕捉正常模式并检测偏差,从而克服了传统监督方法对异常样本依赖的局限性。

链接: https://arxiv.org/abs/2505.18643
作者: Steven Ndung’u,Trienko Grobler,Stefan J. Wijnholds,George Azzopardi
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 5 pages, URSI Asia-Pacific Radio Science Conference and URSI Radio Science Letters (RSL)

点击查看摘要

Abstract:Detecting anomalies in radio astronomy is challenging due to the vast amounts of data and the rarity of labeled anomalous examples. Addressing this challenge requires efficient methods capable of identifying unusual radio galaxy morphologies without relying on extensive supervision. This work introduces an innovative approach to anomaly detection based on morphological characteristics of the radio sources using trainable COSFIRE (Combination of Shifted Filter Responses) filters as an efficient alternative to complex deep learning methods. The framework integrates COSFIRE descriptors with an unsupervised Local Outlier Factor (LOF) algorithm to identify unusual radio galaxy morphologies. Evaluations on a radio galaxy benchmark data set demonstrate strong performance, with the COSFIRE-based approach achieving a geometric mean (G-Mean) score of 79%, surpassing the 77% achieved by a computationally intensive deep learning autoencoder. By characterizing normal patterns and detecting deviations, this semi-supervised methodology overcomes the need for anomalous examples in the training set, a major limitation of traditional supervised methods. This approach shows promise for next-generation radio telescopes, where fast processing and the ability to discover unknown phenomena are crucial.
zh

[AI-293] S-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network INTERSPEECH2025

【速读】:该论文旨在解决通用语音增强(Universal Speech Enhancement)问题,即处理具有不同失真类型和输入格式的语音信号。解决方案的关键在于提出了一种三阶段的通用、鲁棒且可泛化的语音增强网络(TS-URGENet),其核心架构包括填充阶段、分离阶段和恢复阶段,分别用于缓解分组丢失、抑制噪声与混响等失真,并补偿带宽限制和编解码器伪影等问题,从而提升整体语音质量。

链接: https://arxiv.org/abs/2505.18533
作者: Xiaobin Rong,Dahan Wang,Qinwen Hu,Yushi Wang,Yuxiang Hu,Jing Lu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.
zh

[AI-294] SP2RINT: Spatially-Decoupled Physics-Inspired Progressive Inverse Optimization for Scalable PDE-Constrained Meta-Optical Neural Network Training

【速读】:该论文试图解决深度光学神经网络(DONN)中元表面结构训练的挑战,即如何在保证物理可实现性的前提下高效优化元表面以达到高性能计算。传统启发式方法虽速度快但过于简化,导致设计不现实且性能下降;而仿真闭环训练方法虽然能优化可实现的元表面,但计算成本过高。解决方案的关键在于提出SP2RINT框架,该框架将DONN训练建模为偏微分方程(PDE)约束的学习问题,通过将元表面响应松弛为具有带状结构的可训练转移矩阵,并逐步通过转移矩阵训练与伴随法逆向设计交替来施加物理约束,从而避免每迭代求解PDE,同时确保最终设计的物理可实现性。此外,引入基于场相互作用自然局域性的物理启发式空间解耦逆向设计策略,进一步提升了计算效率。

链接: https://arxiv.org/abs/2505.18377
作者: Pingchuan Ma,Ziang Yin,Qi Jing,Zhengqi Gao,Nicholas Gangi,Boyang Zhang,Tsung-Wei Huang,Zhaoran Huang,Duane S. Boning,Yu Yao,Jiaqi Gu
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:DONNs harness the physics of light propagation for efficient analog computation, with applications in AI and signal processing. Advances in nanophotonic fabrication and metasurface-based wavefront engineering have opened new pathways to realize high-capacity DONNs across various spectral regimes. Training such DONN systems to determine the metasurface structures remains challenging. Heuristic methods are fast but oversimplify metasurfaces modulation, often resulting in physically unrealizable designs and significant performance degradation. Simulation-in-the-loop training methods directly optimize a physically implementable metasurface using adjoint methods during end-to-end DONN training, but are inherently computationally prohibitive and this http URL address these limitations, we propose SP2RINT, a spatially decoupled, progressive training framework that formulates DONN training as a PDE-constrained learning problem. Metasurface responses are first relaxed into freely trainable transfer matrices with a banded structure. We then progressively enforce physical constraints by alternating between transfer matrix training and adjoint-based inverse design, avoiding per-iteration PDE solves while ensuring final physical realizability. To further reduce runtime, we introduce a physics-inspired, spatially decoupled inverse design strategy based on the natural locality of field interactions. This approach partitions the metasurface into independently solvable patches, enabling scalable and parallel inverse design with system-level calibration. Evaluated across diverse DONN training tasks, SP2RINT achieves digital-comparable accuracy while being 1825 times faster than simulation-in-the-loop approaches. By bridging the gap between abstract DONN models and implementable photonic hardware, SP2RINT enables scalable, high-performance training of physically realizable meta-optical neural systems.
zh

[AI-295] Hamiltonian Theory and Computation of Optimal Probability Density Control in High Dimensions

【速读】:该论文试图解决高维状态空间中的最优概率密度控制问题(optimal probability density control),其核心挑战在于如何在复杂环境下实现对系统动态的高效控制。解决方案的关键在于构建一个基于庞特里亚金最大原理(Pontryagin Maximum Principle, PMP)的理论框架,并通过严格的推导建立价值泛函的哈密顿-雅可比-贝尔曼方程(Hamilton-Jacobi-Bellman, HJB)。为了解决数值计算上的困难,作者提出使用降阶模型,如深度神经网络(deep neural networks, DNNs),来参数化控制向量场和伴随函数,从而有效处理高维状态空间中的控制问题,并证明了所提算法的收敛性。

链接: https://arxiv.org/abs/2505.18362
作者: Nathan Gaby,Xiaojing Ye
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注: 28 pages, submitted

点击查看摘要

Abstract:We develop a general theoretical framework for optimal probability density control and propose a numerical algorithm that is scalable to solve the control problem in high dimensions. Specifically, we establish the Pontryagin Maximum Principle (PMP) for optimal density control and construct the Hamilton-Jacobi-Bellman (HJB) equation of the value functional through rigorous derivations without any concept from Wasserstein theory. To solve the density control problem numerically, we propose to use reduced-order models, such as deep neural networks (DNNs), to parameterize the control vector-field and the adjoint function, which allows us to tackle problems defined on high-dimensional state spaces. We also prove several convergence properties of the proposed algorithm. Numerical results demonstrate promising performances of our algorithm on a variety of density control problems with obstacles and nonlinear interaction challenges in high dimensions.
zh

[AI-296] ask-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

【速读】:该论文旨在解决触觉感知在神经科学中理解不足以及在人工系统中效果不佳的问题,相较于视觉和语言等成熟模态,触觉感知的建模仍存在较大挑战。其解决方案的关键在于提出一种新型的编码器-注意器-解码器(Encoder-Attender-Decoder, EAD)框架,通过在定制化的啮齿类动物胡须阵列模拟器生成的真实触觉输入序列上训练任务优化的时序神经网络,系统地探索触觉表征空间。研究发现,卷积循环神经网络(ConvRNNs)作为编码器优于纯前馈和状态空间架构,在触觉分类任务中表现更优,并且基于ConvRNN编码器的EAD模型能够生成与啮齿类动物体感皮层神经表征高度匹配的表示,揭示了监督分类性能与神经对齐之间的线性关系。此外,采用触觉特定增强方法进行对比自监督训练的EAD模型能够达到与监督神经拟合相当的性能,为无标签的、符合行为学意义的触觉表征提供了有效代理。

链接: https://arxiv.org/abs/2505.18361
作者: Trinity Chung,Yuchen Shen,Nathan C. L. Kong,Aran Nayebi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 9 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Tactile sensing remains far less understood in neuroscience and less effective in artificial systems compared to more mature modalities such as vision and language. We bridge these gaps by introducing a novel Encoder-Attender-Decoder (EAD) framework to systematically explore the space of task-optimized temporal neural networks trained on realistic tactile input sequences from a customized rodent whisker-array simulator. We identify convolutional recurrent neural networks (ConvRNNs) as superior encoders to purely feedforward and state-space architectures for tactile categorization. Crucially, these ConvRNN-encoder-based EAD models achieve neural representations closely matching rodent somatosensory cortex, saturating the explainable neural variability and revealing a clear linear relationship between supervised categorization performance and neural alignment. Furthermore, contrastive self-supervised ConvRNN-encoder-based EADs, trained with tactile-specific augmentations, match supervised neural fits, serving as an ethologically-relevant, label-free proxy. For neuroscience, our findings highlight nonlinear recurrent processing as important for general-purpose tactile representations in somatosensory cortex, providing the first quantitative characterization of the underlying inductive biases in this system. For embodied AI, our results emphasize the importance of recurrent EAD architectures to handle realistic tactile inputs, along with tailored self-supervised learning methods for achieving robust tactile perception with the same type of sensors animals use to sense in unstructured environments. Comments: 9 pages, 8 figures, 5 tables Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2505.18361 [q-bio.NC] (or arXiv:2505.18361v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2505.18361 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-297] owards a Quantum-classical Augmented Network

【速读】:该论文试图解决如何在现有经典网络中集成量子技术以提升安全性的问题,其核心挑战在于如何有效利用量子资源并实现量子与经典数据的协同传输。解决方案的关键在于对HTTP协议结构进行修改,使其能够同时承载量子和经典数据包,从而根据隐私需求将单个网络数据包划分为量子和经典负载。这一改进为高效安全的量子网络设计提供了基础,并通过引入逻辑回归、卷积神经网络(Convolutional Neural Network, CNN)、长短期记忆网络(Long Short-Term Memory, LSTM)和双向长短期记忆网络(Bidirectional Long Short-Term Memory, BiLSTM)模型,实现了对出站通信的隐私标签分类,进而降低量子资源的使用率。

链接: https://arxiv.org/abs/2505.18282
作者: Nitin Jha,Abhishek Parakh,Mahadevan Subramaniam
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:In the past decade, several small-scale quantum key distribution networks have been established. However, the deployment of large-scale quantum networks depends on the development of quantum repeaters, quantum channels, quantum memories, and quantum network protocols. To improve the security of existing networks and adopt currently feasible quantum technologies, the next step is to augment classical networks with quantum devices, properties, and phenomena. To achieve this, we propose a change in the structure of the HTTP protocol such that it can carry both quantum and classical payload. This work lays the foundation for dividing one single network packet into classical and quantum payloads depending on the privacy needs. We implement logistic regression, CNN, LSTM, and BiLSTM models to classify the privacy label for outgoing communications. This enables reduced utilization of quantum resources allowing for a more efficient secure quantum network design. Experimental results using the proposed methods are presented.
zh

[AI-298] CrossRF: A Domain-Invariant Deep Learning Approach for RF Fingerprinting

【速读】:该论文旨在解决在不同传输信道下射频指纹识别性能显著下降的问题,特别是针对无人机(UAV)识别与安全领域的挑战。其解决方案的关键在于提出一种领域不变的深度学习方法——CrossRF,通过对抗学习减少不同射频信道之间的领域差异,从而训练出一个在信道变化下仍能保持一致识别性能的鲁棒模型。

链接: https://arxiv.org/abs/2505.18200
作者: Fahrettin Emin Tiras,Hayriye Serra Altinoluk
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Radio Frequency (RF) fingerprinting offers a promising approach for drone identification and security, although it suffers from significant performance degradation when operating on different transmission channels. This paper presents CrossRF, a domain-invariant deep learning approach that addresses the problem of cross-channel RF fingerprinting for Unmanned Aerial Vehicle (UAV) identification. Our approach aims to minimize the domain gap between different RF channels by using adversarial learning to train a more robust model that maintains consistent identification performance despite channel variations. We validate our approach using the UAVSig dataset, comprising real-world over-the-air RF signals from identical drone models operating across several frequency channels, ensuring that the findings correspond to real-world scenarios. The experimental results show CrossRF’s efficiency, achieving up to 99.03% accuracy when adapting from Channel 3 to Channel 4, compared to only 26.39% using conventional methods. The model maintains robust performance in more difficult multi-channel scenarios (87.57% accuracy adapting from Channels 1,3 to 2,4) and achieves 89.45% accuracy with 0.9 precision for controller classification. These results confirm CrossRF’s ability to significantly reduce performance degradation due to cross-channel variations while maintaining high identification accuracy with minimal training data requirements, making it particularly suitable for practical drone security applications.
zh

[AI-299] SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference

【速读】:该论文旨在解决长期脑电图(EEG)中可靠自动癫痫发作检测的问题,当前机器学习模型在跨患者或临床环境的泛化能力不足。其解决方案的关键在于组织了一个使用来自65名受试者(总计4,360小时)的私有连续EEG数据集的挑战,并由专家神经生理学家标注数据以提供癫痫事件的黄金标准。参赛者需检测癫痫发作的起始和持续时间,评估基于事件的指标,如灵敏度、精确度、F1分数和每天的假阳性数,通过SzCORE框架实现标准化评估,从而推动更稳健的模型开发和临床相关性验证。

链接: https://arxiv.org/abs/2505.18191
作者: Jonathan Dan,Amirhossein Shahbazinia,Christodoulos Kechris,David Atienza
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.
zh

[AI-300] PhySense: Sensor Placement Optimization for Accurate Physics Sensing

【速读】:该论文旨在解决物理传感中的两个耦合任务:从稀疏观测中重建密集的物理场,以及优化离散传感器的布置以获取最大信息量。现有方法通常仅关注稀疏数据重建,而忽略了传感器布置的优化,导致重建与布置之间的协同增强未被充分利用。论文提出的解决方案是PhySense,其关键在于构建一个协同的两阶段框架,该框架能够联合学习物理场的重建与传感器布置,第一阶段通过基于流的生成模型结合交叉注意力机制实现稀疏观测的自适应融合,第二阶段则利用重建反馈通过投影梯度下降法进行满足空间约束的传感器布置优化,从而实现准确的物理传感。

链接: https://arxiv.org/abs/2505.18190
作者: Yuezhou Ma,Haixu Wu,Hang Zhou,Huikun Weng,Jianmin Wang,Mingsheng Long
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. \correctLeveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. \correctWe further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered.
zh

[AI-301] Improving Generative Inverse Design of Rectangular Patch Antennas with Test Time Optimization ATC

【速读】:该论文试图解决矩形贴片天线的逆向设计问题(inverse design of rectangular patch antennas),即根据特定的频率响应需求生成满足条件的天线几何结构。解决方案的关键在于提出了一种两阶段的深度学习框架,首先利用生成模型学习天线频率响应曲线的潜在表示,然后基于这些响应条件化生成可行的天线几何形状。此外,通过在测试阶段引入搜索与优化技术,提高了生成设计的准确性,并能够考虑制造可行性等附加目标。

链接: https://arxiv.org/abs/2505.18188
作者: Beck LaBash,Shahriar Khushrushahi,Fabian Ruehle
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Code and dataset available at this https URL

点击查看摘要

Abstract:We propose a two-stage deep learning framework for the inverse design of rectangular patch antennas. Our approach leverages generative modeling to learn a latent representation of antenna frequency response curves and conditions a subsequent generative model on these responses to produce feasible antenna geometries. We further demonstrate that leveraging search and optimization techniques at test-time improves the accuracy of the generated designs and enables consideration of auxiliary objectives such as manufacturability. Our approach generalizes naturally to different design criteria, and can be easily adapted to more complex geometric design spaces.
zh

[AI-302] NMCSE: Noise-Robust Multi-Modal Coupling Signal Estimation Method via Optimal Transport for Cardiovascular Disease Detection

【速读】:该论文旨在解决心电图(ECG)与心音图(PCG)信号之间潜在耦合信号估计中的噪声放大问题,该问题限制了心血管疾病(CVD)检测的临床应用。传统方法采用去卷积技术估计耦合信号,但易受噪声影响。论文提出的Noise-Robust Multi-Modal Coupling Signal Estimation (NMCSE) 方法通过最优传输理论将问题重新表述为分布匹配,并通过联合优化幅值和时间对齐来减轻噪声放大,无需额外预处理。该方案的关键在于利用最优传输理论实现噪声鲁棒的多模态耦合信号估计。

链接: https://arxiv.org/abs/2505.18174
作者: Zhixin li,Peihong Zhang,Rui Sang,Yuxuan Liu,Shengchen Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) and Phonocardiogram (PCG) signals are linked by a latent coupling signal representing the electrical-to-mechanical cardiac transformation. While valuable for cardiovascular disease (CVD) detection, this coupling signal is traditionally estimated using deconvolution methods that amplify noise, limiting clinical utility. In this paper, we propose Noise-Robust Multi-Modal Coupling Signal Estimation (NMCSE), which reformulates the problem as distribution matching via optimal transport theory. By jointly optimizing amplitude and temporal alignment, NMCSE mitigates noise amplification without additional preprocessing. Integrated with our Temporal-Spatial Feature Extraction network, NMCSE enables robust multi-modal CVD detection. Experiments on the PhysioNet 2016 dataset with realistic hospital noise demonstrate that NMCSE reduces estimation errors by approximately 30% in Mean Squared Error while maintaining higher Pearson Correlation Coefficients across all tested signal-to-noise ratios. Our approach achieves 97.38% accuracy and 0.98 AUC in CVD detection, outperforming state-of-the-art methods and demonstrating robust performance for real-world clinical applications.
zh

[AI-303] Simulating Macroeconomic Expectations using LLM Agents

【速读】:该论文试图解决宏观经济预期形成机制的模拟问题,旨在通过人工智能技术更有效地捕捉个体在通胀和失业等宏观经济学议题上的预期差异。解决方案的关键在于构建大量由大型语言模型赋能的智能体(LLM Agents),这些智能体具备个人特征、先验预期和知识模块,从而能够再现家庭和专家在调查实验中的预期生成过程。研究发现,尽管LLM Agents生成的预期和思维更具同质性,但其仍能有效反映智能体间的异质性及其预期形成背后的驱动因素。

链接: https://arxiv.org/abs/2505.17648
作者: Jianhao Lin,Lexuan Sun,Yixin Yan
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel framework for simulating macroeconomic expectation formation using Large Language Model-Empowered Agents (LLM Agents). By constructing thousands of LLM Agents equipped with modules for personal characteristics, prior expectations, and knowledge, we replicate a survey experiment involving households and experts on inflation and unemployment. Our results show that although the expectations and thoughts generated by LLM Agents are more homogeneous than those of human participants, they still effectively capture key heterogeneity across agents and the underlying drivers of expectation formation. Furthermore, a module-ablation exercise highlights the critical role of prior expectations in simulating such heterogeneity. This approach complements traditional survey methods and offers new insights into AI behavioral science in macroeconomic research.
zh

[AI-304] Is Quantum Optimization Ready? An Effort Towards Neural Network Compression using Adiabatic Quantum Computing

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNN)在大规模模型优化中的挑战,特别是针对模型压缩问题。其解决方案的关键在于将模型压缩问题重新表述为二次无约束二进制优化(Quadratic Unconstrained Binary Optimization, QUBO)问题,并利用量子退火(Adiabatic Quantum Computing, AQC)技术进行求解,从而实现对卷积神经网络的细粒度剪枝与量化。

链接: https://arxiv.org/abs/2505.16332
作者: Zhehui Wanga,Benjamin Chen Ming Choonga,Tian Huang,Daniel Gerlinghoffa,Rick Siow Mong Goh,Cheng Liu,Tao Luo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Quantum optimization is the most mature quantum computing technology to date, providing a promising approach towards efficiently solving complex combinatorial problems. Methods such as adiabatic quantum computing (AQC) have been employed in recent years on important optimization problems across various domains. In deep learning, deep neural networks (DNN) have reached immense sizes to support new predictive capabilities. Optimization of large-scale models is critical for sustainable deployment, but becomes increasingly challenging with ever-growing model sizes and complexity. While quantum optimization is suitable for solving complex problems, its application to DNN optimization is not straightforward, requiring thorough reformulation for compatibility with commercially available quantum devices. In this work, we explore the potential of adopting AQC for fine-grained pruning-quantization of convolutional neural networks. We rework established heuristics to formulate model compression as a quadratic unconstrained binary optimization (QUBO) problem, and assess the solution space offered by commercial quantum annealing devices. Through our exploratory efforts of reformulation, we demonstrate that AQC can achieve effective compression of practical DNN models. Experiments demonstrate that adiabatic quantum computing (AQC) not only outperforms classical algorithms like genetic algorithms and reinforcement learning in terms of time efficiency but also excels at identifying global optima.
zh

[AI-305] A Matrix Product State Model for Simultaneous Classification and Generation

【速读】:该论文旨在解决传统监督学习中生成高质量样本的挑战,特别是在减少异常值和提升生成样本真实性方面。其解决方案的关键在于引入一种具有分类与生成双重功能的矩阵乘积态(Matrix Product States, MPS)模型,该模型受生成对抗网络的启发,通过优化训练策略来增强样本生成能力。此外,论文还提出了替代的嵌入函数和从非归一化MPS中采样的新方法,以进一步提升生成效果。

链接: https://arxiv.org/abs/2406.17441
作者: Alex Mossi,Bojan Žunkovic,Kyriakos Flouris
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Mathematical Physics (math-ph); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Quantum machine learning (QML) is a rapidly expanding field that merges the principles of quantum computing with the techniques of machine learning. One of the powerful mathematical frameworks in this domain is tensor networks. These networks are used to approximate high-order tensors by contracting tensors with lower ranks. Initially developed for simulating quantum systems, tensor networks have become integral to quantum computing and, by extension, to QML. Drawing inspiration from these quantum methods, specifically the Matrix Product States (MPS), we apply them in a classical machine learning setting. Their ability to efficiently represent and manipulate complex, high-dimensional data makes them effective in a supervised learning framework. Here, we present an MPS model, in which the MPS functions as both a classifier and a generator. The dual functionality of this novel MPS model permits a strategy that enhances the traditional training of supervised MPS models. This framework is inspired by generative adversarial networks and is geared towards generating more realistic samples by reducing outliers. In addition, our contributions offer insights into the mechanics of tensor network methods for generation tasks. Specifically, we discuss alternative embedding functions and a new sampling method from non-normalized MPSs.
zh

机器学习

[LG-0] Efficient Optimization Accelerator Framework for Multistate Ising Problems

链接: https://arxiv.org/abs/2505.20250
作者: Chirag Garg,Sayeef Salahuddin
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 6 page main text, 4 main figures, 1 main table, 1 page supplementary, 2 supplementary figures,

点击查看摘要

Abstract:Ising Machines are a prominent class of hardware architectures that aim to solve NP-hard combinatorial optimization problems. These machines consist of a network of interacting binary spins/neurons that evolve to represent the optimum ground state energy solution. Generally, combinatorial problems are transformed into quadratic unconstrained binary optimization (QUBO) form to harness the computational efficiency of these Ising machines. However, this transformation, especially for multi-state problems, often leads to a more complex exploration landscape than the original problem, thus severely impacting the solution quality. To address this challenge, we model the spin interactions as a generalized boolean logic function to significantly reduce the exploration space. We benchmark the graph coloring problem from the class of multi-state NP-hard optimization using probabilistic Ising solvers to illustrate the effectiveness of our framework. The proposed methodology achieves similar accuracy compared to state-of-the-art heuristics and machine learning algorithms, and demonstrates significant improvement over the existing Ising methods. Additionally, we demonstrate that combining parallel tempering with our existing framework further reduces the coloring error by up to 50% compared to the conventionally used Gibbs sampling algorithm. We also design a 1024-neuron all-to-all connected probabilistic Ising accelerator that shows up to 10000x performance acceleration compared to heuristics while reducing the number of required physical neurons by 1.5-4x compared to conventional Ising machines. Indeed, this accelerator solution demonstrates improvement across all metrics over the current methods, i.e., energy, performance, area, and solution quality. Thus, this work expands the potential of existing Ising hardware to solve a broad class of these multistate optimization problems.

[LG-1] RedAHD: Reduction-Based End-to-End Automatic Heuristic Design with Large Language Models

链接: https://arxiv.org/abs/2505.20242
作者: Nguyen Thach,Aida Riahifar,Nathan Huynh,Hau Chan
类目: Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Solving NP-hard combinatorial optimization problems (COPs) (e.g., traveling salesman problems (TSPs) and capacitated vehicle routing problems (CVRPs)) in practice traditionally involves handcrafting heuristics or specifying a search space for finding effective heuristics. The main challenges from these approaches, however, are the sheer amount of domain knowledge and implementation efforts required from human experts. Recently, significant progress has been made to address these challenges, particularly by using large language models (LLMs) to design heuristics within some predetermined generalized algorithmic framework (GAF, e.g., ant colony optimization and guided local search) for building key functions/components (e.g., a priori information on how promising it is to include each edge in a solution for TSP and CVRP). Although existing methods leveraging this idea have shown to yield impressive optimization performance, they are not fully end-to-end and still require considerable manual interventions. In this paper, we propose a novel end-to-end framework, named RedAHD, that enables these LLM-based heuristic design methods to operate without the need of GAFs. More specifically, RedAHD employs LLMs to automate the process of reduction, i.e., transforming the COP at hand into similar COPs that are better-understood, from which LLM-based heuristic design methods can design effective heuristics for directly solving the transformed COPs and, in turn, indirectly solving the original COP. Our experimental results, evaluated on six COPs, show that RedAHD is capable of designing heuristics with competitive or improved results over the state-of-the-art methods with minimal human involvement.

[LG-2] Chain-of-Thought for Autonomous Driving: A Comprehensive Survey and Future Prospects

链接: https://arxiv.org/abs/2505.20223
作者: Yixin Cui,Haotian Lin,Shuo Yang,Yixiao Wang,Yanjun Huang,Hong Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:The rapid evolution of large language models in natural language processing has substantially elevated their semantic understanding and logical reasoning capabilities. Such proficiencies have been leveraged in autonomous driving systems, contributing to significant improvements in system performance. Models such as OpenAI o1 and DeepSeek-R1, leverage Chain-of-Thought (CoT) reasoning, an advanced cognitive method that simulates human thinking processes, demonstrating remarkable reasoning capabilities in complex tasks. By structuring complex driving scenarios within a systematic reasoning framework, this approach has emerged as a prominent research focus in autonomous driving, substantially improving the system’s ability to handle challenging cases. This paper investigates how CoT methods improve the reasoning abilities of autonomous driving models. Based on a comprehensive literature review, we present a systematic analysis of the motivations, methodologies, challenges, and future research directions of CoT in autonomous driving. Furthermore, we propose the insight of combining CoT with self-learning to facilitate self-evolution in driving systems. To ensure the relevance and timeliness of this study, we have compiled a dynamic repository of literature and open-source projects, diligently updated to incorporate forefront developments. The repository is publicly available at this https URL.

[LG-3] Gradient Flow Matching for Learning Update Dynamics in Neural Network Training

链接: https://arxiv.org/abs/2505.20221
作者: Xiao Shou,Yanna Ding,Jianxi Gao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Training deep neural networks remains computationally intensive due to the itera2 tive nature of gradient-based optimization. We propose Gradient Flow Matching (GFM), a continuous-time modeling framework that treats neural network training as a dynamical system governed by learned optimizer-aware vector fields. By leveraging conditional flow matching, GFM captures the underlying update rules of optimizers such as SGD, Adam, and RMSprop, enabling smooth extrapolation of weight trajectories toward convergence. Unlike black-box sequence models, GFM incorporates structural knowledge of gradient-based updates into the learning objective, facilitating accurate forecasting of final weights from partial training sequences. Empirically, GFM achieves forecasting accuracy that is competitive with Transformer-based models and significantly outperforms LSTM and other classical baselines. Furthermore, GFM generalizes across neural architectures and initializations, providing a unified framework for studying optimization dynamics and accelerating convergence prediction.

[LG-4] Fine-grained List-wise Alignment for Generative Medication Recommendation

链接: https://arxiv.org/abs/2505.20218
作者: Chenxiao Fan,Chongming Gao,Wentao Shi,Yaxin Gong,Zihao Zhao,Fuli Feng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at this https URL.

[LG-5] FunReason : Enhancing Large Language Models Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement

链接: https://arxiv.org/abs/2505.20192
作者: Bingguang Hao,Maolin Wang,Zengzhuang Xu,Cunyin Peng,Yicheng Chen,Xiangyu Zhao,Jinjie Gu,Chenyi Zhuang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) with function calling has emerged as a crucial capability for enhancing their practical utility in real-world applications. However, effectively combining reasoning processes with accurate function execution remains a significant challenge. Traditional training approaches often struggle to balance the detailed reasoning steps with the precision of function calls, leading to suboptimal performance. To address these limitations, we introduce FunReason, a novel framework that enhances LLMs’ function calling capabilities through an automated data refinement strategy and a Self-Refinement Multiscale Loss (SRML) approach. FunReason leverages LLMs’ natural reasoning abilities to generate high-quality training examples, focusing on query parseability, reasoning coherence, and function call precision. The SRML approach dynamically balances the contribution of reasoning processes and function call accuracy during training, addressing the inherent trade-off between these two critical aspects. FunReason achieves performance comparable to GPT-4o while effectively mitigating catastrophic forgetting during fine-tuning. FunReason provides a comprehensive solution for enhancing LLMs’ function calling capabilities by introducing a balanced training methodology and a data refinement pipeline. For code and dataset, please refer to our repository at GitHub this https URL

[LG-6] Private Geometric Median in Nearly-Linear Time

链接: https://arxiv.org/abs/2505.20189
作者: Syamantak Kumar,Daogao Liu,Kevin Tian,Chutong Yang
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an (\varepsilon, \delta) -differentially private algorithm obtaining an \alpha -multiplicative approximation to the geometric median objective, \frac 1 n \sum_i \in [n] |\cdot - \mathbfx_i| , given a dataset \mathcalD := \mathbfx_i_i \in [n] \subset \mathbbR^d . Their algorithm requires n \gtrsim \sqrt d \cdot \frac 1 \alpha\varepsilon samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \empheffective radius of \mathcalD (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using n \gtrsim \sqrt d \cdot \frac 1 \alpha\epsilon samples, but in time \widetildeO(nd + \frac d \alpha^2) . Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the “warm start” component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD’s sensitivity for the geometric median objective.

[LG-7] Research on feature fusion and multimodal patent text based on graph attention network

链接: https://arxiv.org/abs/2505.20188
作者: Zhenzhen Song,Ziwei Liu,Hongji Li
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Aiming at the problems of cross-modal feature fusion, low efficiency of long text modeling and lack of hierarchical semantic coherence in patent text semantic mining, this study proposes HGM-Net, a deep learning framework that integrates Hierarchical Comparative Learning (HCL), Multi-modal Graph Attention Network (M-GAT) and Multi-Granularity Sparse Attention (MSA), which builds a dynamic mask, contrast and cross-structural similarity constraints on the word, sentence and paragraph hierarchies through HCL. Contrast and cross-structural similarity constraints are constructed at the word and paragraph levels by HCL to strengthen the local semantic and global thematic consistency of patent text; M-GAT models patent classification codes, citation relations and text semantics as heterogeneous graph structures, and achieves dynamic fusion of multi-source features by cross-modal gated attention; MSA adopts a hierarchical sparsity strategy to optimize the computational efficiency of long text modeling at word, phrase, sentence and paragraph granularity. Experiments show that the framework demonstrates significant advantages over existing deep learning methods in tasks such as patent classification and similarity matching, and provides a solution with both theoretical innovation and practical value for solving the problems of patent examination efficiency improvement and technology relevance mining.

[LG-8] he Power of Iterative Filtering for Supervised Learning with (Heavy) Contamination

链接: https://arxiv.org/abs/2505.20177
作者: Adam R. Klivans,Konstantinos Stavropoulos,Kevin Tian,Arsen Vasilyan
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 36 pages

点击查看摘要

Abstract:Inspired by recent work on learning with distribution shift, we give a general outlier removal algorithm called iterative polynomial filtering and show a number of striking applications for supervised learning with contamination: (1) We show that any function class that can be approximated by low-degree polynomials with respect to a hypercontractive distribution can be efficiently learned under bounded contamination (also known as nasty noise). This is a surprising resolution to a longstanding gap between the complexity of agnostic learning and learning with contamination, as it was widely believed that low-degree approximators only implied tolerance to label noise. (2) For any function class that admits the (stronger) notion of sandwiching approximators, we obtain near-optimal learning guarantees even with respect to heavy additive contamination, where far more than 1/2 of the training set may be added adversarially. Prior related work held only for regression and in a list-decodable setting. (3) We obtain the first efficient algorithms for tolerant testable learning of functions of halfspaces with respect to any fixed log-concave distribution. Even the non-tolerant case for a single halfspace in this setting had remained open. These results significantly advance our understanding of efficient supervised learning under contamination, a setting that has been much less studied than its unsupervised counterpart.

[LG-9] A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

链接: https://arxiv.org/abs/2505.20172
作者: Etienne Boursier,Scott Pesme,Radu-Alexandru Dragomir
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the dynamics of gradient flow with small weight decay on general training losses F: \mathbbR^d \to \mathbbR . Under mild regularity assumptions and assuming convergence of the unregularised gradient flow, we show that the trajectory with weight decay \lambda exhibits a two-phase behaviour as \lambda \to 0 . During the initial fast phase, the trajectory follows the unregularised gradient flow and converges to a manifold of critical points of F . Then, at time of order 1/\lambda , the trajectory enters a slow drift phase and follows a Riemannian gradient flow minimising the \ell_2 -norm of the parameters. This purely optimisation-based phenomenon offers a natural explanation for the \textitgrokking effect observed in deep learning, where the training loss rapidly reaches zero while the test loss plateaus for an extended period before suddenly improving. We argue that this generalisation jump can be attributed to the slow norm reduction induced by weight decay, as explained by our analysis. We validate this mechanism empirically on several synthetic regression tasks.

[LG-10] Model Stitching by Functional Latent Alignment

链接: https://arxiv.org/abs/2505.20142
作者: Ioannis Athanasiadis,Anmar Karmush,Michael Felsberg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating functional similarity involves quantifying the degree to which independently trained neural networks learn functionally similar representations. Reliably inferring the functional similarity of these networks remains an open problem with far-reaching implications for AI. Model stitching has emerged as a promising paradigm, where an optimal affine transformation aligns two models to solve a task, with the stitched model serving as a proxy for functional similarity. In this work, we draw inspiration from the knowledge distillation literature and propose Functional Latent Alignment (FuLA) as a novel optimality condition for model stitching. We revisit previously explored functional similarity testbeds and introduce a new one, based on which FuLA emerges as an overall more reliable method of functional similarity. Specifically, our experiments in (a) adversarial training, (b) shortcut training and, © cross-layer stitching, reveal that FuLA is less prone to artifacts tied to training on task cues while achieving non-trivial alignments that are missed by stitch-level matching.

[LG-11] Data-Distill-Net: A Data Distillation Approach Tailored for Reply-based Continual Learning

链接: https://arxiv.org/abs/2505.20135
作者: Wenyang Liao,Quanziang Wang,Yichen Wu,Renzhen Wang,Deyu Meng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Replay-based continual learning (CL) methods assume that models trained on a small subset can also effectively minimize the empirical risk of the complete dataset. These methods maintain a memory buffer that stores a sampled subset of data from previous tasks to consolidate past knowledge. However, this assumption is not guaranteed in practice due to the limited capacity of the memory buffer and the heuristic criteria used for buffer data selection. To address this issue, we propose a new dataset distillation framework tailored for CL, which maintains a learnable memory buffer to distill the global information from the current task data and accumulated knowledge preserved in the previous memory buffer. Moreover, to avoid the computational overhead and overfitting risks associated with parameterizing the entire buffer during distillation, we introduce a lightweight distillation module that can achieve global information distillation solely by generating learnable soft labels for the memory buffer data. Extensive experiments show that, our method can achieve competitive results and effectively mitigates forgetting across various datasets. The source code will be publicly available.

[LG-12] MolEditRL: Structure-Preserving Molecular Editing via Discrete Diffusion and Reinforcement Learning

链接: https://arxiv.org/abs/2505.20131
作者: Yuanxin Zhuang,Dazhong Shen,Ying Sun
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Molecular editing aims to modify a given molecule to optimize desired chemical properties while preserving structural similarity. However, current approaches typically rely on string-based or continuous representations, which fail to adequately capture the discrete, graph-structured nature of molecules, resulting in limited structural fidelity and poor controllability. In this paper, we propose MolEditRL, a molecular editing framework that explicitly integrates structural constraints with precise property optimization. Specifically, MolEditRL consists of two stages: (1) a discrete graph diffusion model pretrained to reconstruct target molecules conditioned on source structures and natural language instructions; (2) an editing-aware reinforcement learning fine-tuning stage that further enhances property alignment and structural preservation by explicitly optimizing editing decisions under graph constraints. For comprehensive evaluation, we construct MolEdit-Instruct, the largest and most property-rich molecular editing dataset, comprising 3 million diverse examples spanning single- and multi-property tasks across 10 chemical attributes. Experimental results demonstrate that MolEditRL significantly outperforms state-of-the-art methods in both property optimization accuracy and structural fidelity, achieving a 74% improvement in editing success rate while using 98% fewer parameters.

[LG-13] Balancing Interference and Correlation in Spatial Experimental Designs: A Causal Graph Cut Approach ICML2025

链接: https://arxiv.org/abs/2505.20130
作者: Zhu Jin,Li Jingyi,Zhou Hongyi,Lin Yinan,Lin Zhenhua,Shi Chengchun
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Accepted by ICML2025

点击查看摘要

Abstract:This paper focuses on the design of spatial experiments to optimize the amount of information derived from the experimental data and enhance the accuracy of the resulting causal effect estimator. We propose a surrogate function for the mean squared error (MSE) of the estimator, which facilitates the use of classical graph cut algorithms to learn the optimal design. Our proposal offers three key advances: (1) it accommodates moderate to large spatial interference effects; (2) it adapts to different spatial covariance functions; (3) it is computationally efficient. Theoretical results and numerical experiments based on synthetic environments and a dispatch simulator that models a city-scale ridesharing market, further validate the effectiveness of our design. A python implementation of our method is available at this https URL.

[LG-14] ransformer in Protein: A Survey

链接: https://arxiv.org/abs/2505.20098
作者: Xiaowen Ling,Zhiqiang Li,Yanbin Wang,Zhuhong You
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:As protein informatics advances rapidly, the demand for enhanced predictive accuracy, structural analysis, and functional understanding has intensified. Transformer models, as powerful deep learning architectures, have demonstrated unprecedented potential in addressing diverse challenges across protein research. However, a comprehensive review of Transformer applications in this field remains lacking. This paper bridges this gap by surveying over 100 studies, offering an in-depth analysis of practical implementations and research progress of Transformers in protein-related tasks. Our review systematically covers critical domains, including protein structure prediction, function prediction, protein-protein interaction analysis, functional annotation, and drug discovery/target identification. To contextualize these advancements across various protein domains, we adopt a domain-oriented classification system. We first introduce foundational concepts: the Transformer architecture and attention mechanisms, categorize Transformer variants tailored for protein science, and summarize essential protein knowledge. For each research domain, we outline its objectives and background, critically evaluate prior methods and their limitations, and highlight transformative contributions enabled by Transformer models. We also curate and summarize pivotal datasets and open-source code resources to facilitate reproducibility and benchmarking. Finally, we discuss persistent challenges in applying Transformers to protein informatics and propose future research directions. This review aims to provide a consolidated foundation for the synergistic integration of Transformer and protein informatics, fostering further innovation and expanded applications in the field.

[LG-15] Spurious Privacy Leakage in Neural Networks

链接: https://arxiv.org/abs/2505.20095
作者: Chenxiang Zhang,Jun Pang,Sjouke Mauw
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Neural networks are vulnerable to privacy attacks aimed at stealing sensitive data. The risks can be amplified in a real-world scenario, particularly when models are trained on limited and biased data. In this work, we investigate the impact of spurious correlation bias on privacy vulnerability. We introduce \emphspurious privacy leakage, a phenomenon where spurious groups are significantly more vulnerable to privacy attacks than non-spurious groups. We further show that group privacy disparity increases in tasks with simpler objectives (e.g. fewer classes) due to the persistence of spurious features. Surprisingly, we find that reducing spurious correlation using spurious robust methods does not mitigate spurious privacy leakage. This leads us to introduce a perspective on privacy disparity based on memorization, where mitigating spurious correlation does not mitigate the memorization of spurious data, and therefore, neither the privacy level. Lastly, we compare the privacy of different model architectures trained with spurious data, demonstrating that, contrary to prior works, architectural choice can affect privacy outcomes.

[LG-16] Grokking ExPLAIND: Unifying Model Data and Training Attribution to Study Model Behavior

链接: https://arxiv.org/abs/2505.20076
作者: Florian Eichin,Yupei Du,Philipp Mondorf,Barbara Plank,Michael A. Hedderich
类目: Machine Learning (cs.LG)
*备注: 9 pages main paper, 23 pages in total, code at this https URL

点击查看摘要

Abstract:Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, these approaches usually lack theoretical support. In this work, we present ExPLAIND, a unified framework that integrates all three perspectives. First, we generalize recent work on gradient path kernels, which reformulate models trained by gradient descent as a kernel machine, to more realistic training settings. Empirically, we find that both a CNN and a Transformer model are replicated accurately by this reformulation. Second, we derive novel parameter- and step-wise influence scores from the kernel feature maps. We show their effectiveness in parameter pruning that is comparable to existing methods, reinforcing their value for model component attribution. Finally, jointly interpreting model components and data over the training process, we leverage ExPLAIND to analyze a Transformer that exhibits Grokking. Among other things, our findings support previously proposed stages of Grokking, while refining the final phase as one of alignment of input embeddings and final layers around a representation pipeline learned after the memorization phase. Overall, ExPLAIND provides a theoretically grounded, unified framework to interpret model behavior and training dynamics.

[LG-17] An Out-Of-Distribution Membership Inference Attack Approach for Cross-Domain Graph Attacks IJCAI-25

链接: https://arxiv.org/abs/2505.20074
作者: Jinyan Wang,Liu Yang,Yuecen Wei,Jiaxuan Si,Chenhao Guo,Qingyun Sun,Xianxian Li,Xingcheng Fu
类目: Machine Learning (cs.LG)
*备注: Accepted by the 34th International Joint Conference on Artificial Intelligence (IJCAI-25)

点击查看摘要

Abstract:Graph Neural Network-based methods face privacy leakage risks due to the introduction of topological structures about the targets, which allows attackers to bypass the target’s prior knowledge of the sensitive attributes and realize membership inference attacks (MIA) by observing and analyzing the topology distribution. As privacy concerns grow, the assumption of MIA, which presumes that attackers can obtain an auxiliary dataset with the same distribution, is increasingly deviating from reality. In this paper, we categorize the distribution diversity issue in real-world MIA scenarios as an Out-Of-Distribution (OOD) problem, and propose a novel Graph OOD Membership Inference Attack (GOOD-MIA) to achieve cross-domain graph attacks. Specifically, we construct shadow subgraphs with distributions from different domains to model the diversity of real-world data. We then explore the stable node representations that remain unchanged under external influences and consider eliminating redundant information from confounding environments and extracting task-relevant key information to more clearly distinguish between the characteristics of training data and unseen data. This OOD-based design makes cross-domain graph attacks possible. Finally, we perform risk extrapolation to optimize the attack’s domain adaptability during attack inference to generalize the attack to other domains. Experimental results demonstrate that GOOD-MIA achieves superior attack performance in datasets designed for multiple domains.

[LG-18] Ankh3: Multi-Task Pretraining with Sequence Denoising and Completion Enhances Protein Representations

链接: https://arxiv.org/abs/2505.20052
作者: Hazem Alsamkary,Mohamed Elshaffei,Mohamed Elkerdawy,Ahmed Elnaggar
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 8 pages, 0 figures

点击查看摘要

Abstract:Protein language models (PLMs) have emerged as powerful tools to detect complex patterns of protein sequences. However, the capability of PLMs to fully capture information on protein sequences might be limited by focusing on single pre-training tasks. Although adding data modalities or supervised objectives can improve the performance of PLMs, pre-training often remains focused on denoising corrupted sequences. To push the boundaries of PLMs, our research investigated a multi-task pre-training strategy. We developed Ankh3, a model jointly optimized on two objectives: masked language modeling with multiple masking probabilities and protein sequence completion relying only on protein sequences as input. This multi-task pre-training demonstrated that PLMs can learn richer and more generalizable representations solely from protein sequences. The results demonstrated improved performance in downstream tasks, such as secondary structure prediction, fluorescence, GB1 fitness, and contact prediction. The integration of multiple tasks gave the model a more comprehensive understanding of protein properties, leading to more robust and accurate predictions.

[LG-19] Catoni-Style Change Point Detection for Regret Minimization in Non-Stationary Heavy-Tailed Bandits

链接: https://arxiv.org/abs/2505.20051
作者: Gianmarco Genalti,Sujay Bhatt,Nicola Gatti,Alberto Maria Metelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regret minimization in stochastic non-stationary bandits gained popularity over the last decade, as it can model a broad class of real-world problems, from advertising to recommendation systems. Existing literature relies on various assumptions about the reward-generating process, such as Bernoulli or subgaussian rewards. However, in settings such as finance and telecommunications, heavy-tailed distributions naturally arise. In this work, we tackle the heavy-tailed piecewise-stationary bandit problem. Heavy-tailed bandits, introduced by Bubeck et al., 2013, operate on the minimal assumption that the finite absolute centered moments of maximum order 1+\epsilon are uniformly bounded by a constant v+\infty , for some \epsilon \in (0,1] . We focus on the most popular non-stationary bandit setting, i.e., the piecewise-stationary setting, in which the mean of reward-generating distributions may change at unknown time steps. We provide a novel Catoni-style change-point detection strategy tailored for heavy-tailed distributions that relies on recent advancements in the theory of sequential estimation, which is of independent interest. We introduce Robust-CPD-UCB, which combines this change-point detection strategy with optimistic algorithms for bandits, providing its regret upper bound and an impossibility result on the minimum attainable regret for any policy. Finally, we validate our approach through numerical experiments on synthetic and real-world datasets.

[LG-20] Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation Benchmarks

链接: https://arxiv.org/abs/2505.20048
作者: Ali Forootani,Mohammad Khosravi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Time series forecasting plays a critical role in domains such as energy, finance, and healthcare, where accurate predictions inform decision-making under uncertainty. Although Transformer-based models have demonstrated success in sequential modeling, their adoption for time series remains limited by challenges such as noise sensitivity, long-range dependencies, and a lack of inductive bias for temporal structure. In this work, we present a unified and principled framework for benchmarking three prominent Transformer forecasting architectures-Autoformer, Informer, and Patchtst-each evaluated through three architectural variants: Minimal, Standard, and Full, representing increasing levels of complexity and modeling capacity. We conduct over 1500 controlled experiments on a suite of ten synthetic signals, spanning five patch lengths and five forecast horizons under both clean and noisy conditions. Our analysis reveals consistent patterns across model families. To advance this landscape further, we introduce the Koopman-enhanced Transformer framework, Deep Koopformer, which integrates operator-theoretic latent state modeling to improve stability and interpretability. We demonstrate its efficacy on nonlinear and chaotic dynamical systems. Our results highlight Koopman based Transformer as a promising hybrid approach for robust, interpretable, and theoretically grounded time series forecasting in noisy and complex real-world conditions. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2505.20048 [cs.LG] (or arXiv:2505.20048v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.20048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

链接: https://arxiv.org/abs/2505.20036
作者: Hazem Alsamkary,Mohamed Elshaffei,Mohamed Soudy,Sara Ossman,Abdallah Amr,Nehal Adel Abdelsalam,Mohamed Elkerdawy,Ahmed Elnaggar
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

[LG-22] Graph Wave Networks WWW2025

链接: https://arxiv.org/abs/2505.20034
作者: Juwei Yue,Haikuo Li,Jiawei Sheng,Yihan Guo,Xinghua Zhang,Chuan Zhou,Tingwen Liu,Li Guo
类目: Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, published to WWW 2025

点击查看摘要

Abstract:Dynamics modeling has been introduced as a novel paradigm in message passing (MP) of graph neural networks (GNNs). Existing methods consider MP between nodes as a heat diffusion process, and leverage heat equation to model the temporal evolution of nodes in the embedding space. However, heat equation can hardly depict the wave nature of graph signals in graph signal processing. Besides, heat equation is essentially a partial differential equation (PDE) involving a first partial derivative of time, whose numerical solution usually has low stability, and leads to inefficient model training. In this paper, we would like to depict more wave details in MP, since graph signals are essentially wave signals that can be seen as a superposition of a series of waves in the form of eigenvector. This motivates us to consider MP as a wave propagation process to capture the temporal evolution of wave signals in the space. Based on wave equation in physics, we innovatively develop a graph wave equation to leverage the wave propagation on graphs. In details, we demonstrate that the graph wave equation can be connected to traditional spectral GNNs, facilitating the design of graph wave networks based on various Laplacians and enhancing the performance of the spectral GNNs. Besides, the graph wave equation is particularly a PDE involving a second partial derivative of time, which has stronger stability on graphs than the heat equation that involves a first partial derivative of time. Additionally, we theoretically prove that the numerical solution derived from the graph wave equation are constantly stable, enabling to significantly enhance model efficiency while ensuring its performance. Extensive experiments show that GWNs achieve SOTA and efficient performance on benchmark datasets, and exhibit outstanding performance in addressing challenging graph problems, such as over-smoothing and heterophily.

[LG-23] Ontology- and LLM -based Data Harmonization for Federated Learning in Healthcare

链接: https://arxiv.org/abs/2505.20020
作者: Natallia Kokash,Lei Wang,Thomas H. Gillespie,Adam Belloum,Paola Grosso,Sara Quinney,Lang Li,Bernard de Bono
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Related dataset: this https URL

点击查看摘要

Abstract:The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

[LG-24] Data-Dependent Regret Bounds for Constrained MABs

链接: https://arxiv.org/abs/2505.20010
作者: Gianmarco Genalti,Francesco Emanuele Stradi,Matteo Castiglioni,Alberto Marchesi,Nicola Gatti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper initiates the study of data-dependent regret bounds in constrained MAB settings. These bounds depend on the sequence of losses that characterize the problem instance. Thus, they can be much smaller than classical \widetilde\mathcalO(\sqrtT) regret bounds, while being equivalent to them in the worst case. Despite this, data-dependent regret bounds have been completely overlooked in constrained MAB settings. The goal of this paper is to answer the following question: Can data-dependent regret bounds be derived in the presence of constraints? We answer this question affirmatively in constrained MABs with adversarial losses and stochastic constraints. Specifically, our main focus is on the most challenging and natural settings with hard constraints, where the learner must ensure that the constraints are always satisfied with high probability. We design an algorithm with a regret bound consisting of two data-dependent terms. The first term captures the difficulty of satisfying the constraints, while the second one encodes the complexity of learning independently of the presence of constraints. We also prove a lower bound showing that these two terms are not artifacts of our specific approach and analysis, but rather the fundamental components that inherently characterize the complexities of the problem. Finally, in designing our algorithm, we also derive some novel results in the related (and easier) soft constraints settings, which may be of independent interest.

[LG-25] abPFN: One Model to Rule Them All?

链接: https://arxiv.org/abs/2505.20003
作者: Qiong Zhang,Yan Shuo Tan,Qinglong Tian,Pengfei Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Hollmann et al. (Nature 637 (2025) 319-326) recently introduced TabPFN, a transformer-based deep learning model for regression and classification on tabular data, which they claim “outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time.” Furthermore, they have called TabPFN a “foundation model” for tabular data, as it can support “data generation, density estimation, learning reusable embeddings and fine-tuning”. If these statements are well-supported, TabPFN may have the potential to supersede existing modeling approaches on a wide range of statistical tasks, mirroring a similar revolution in other areas of artificial intelligence that began with the advent of large language models. In this paper, we provide a tailored explanation of how TabPFN works for a statistics audience, by emphasizing its interpretation as approximate Bayesian inference. We also provide more evidence of TabPFN’s “foundation model” capabilities: We show that an out-of-the-box application of TabPFN vastly outperforms specialized state-of-the-art methods for semi-supervised parameter estimation, prediction under covariate shift, and heterogeneous treatment effect estimation. We further show that TabPFN can outperform LASSO at sparse regression and can break a robustness-efficiency trade-off in classification. All experiments can be reproduced using the code provided at this https URL (this https URL).

[LG-26] Learning Optimal Multimodal Information Bottleneck Representations ICML2025

链接: https://arxiv.org/abs/2505.19996
作者: Qilong Wu,Yiyang Shao,Jun Wang,Xiaobo Sun
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Leveraging high-quality joint representations from multimodal data can greatly enhance model performance in various machine-learning based applications. Recent multimodal learning methods, based on the multimodal information bottleneck (MIB) principle, aim to generate optimal MIB with maximal task-relevant information and minimal superfluous information via regularization. However, these methods often set ad hoc regularization weights and overlook imbalanced task-relevant information across modalities, limiting their ability to achieve optimal MIB. To address this gap, we propose a novel multimodal learning framework, Optimal Multimodal Information Bottleneck (OMIB), whose optimization objective guarantees the achievability of optimal MIB by setting the regularization weight within a theoretically derived bound. OMIB further addresses imbalanced task-relevant information by dynamically adjusting regularization weights per modality, promoting the inclusion of all task-relevant information. Moreover, we establish a solid information-theoretical foundation for OMIB’s optimization and implement it under the variational approximation framework for computational efficiency. Finally, we empirically validate the OMIB’s theoretical properties on synthetic data and demonstrate its superiority over the state-of-the-art benchmark methods in various downstream tasks.

[LG-27] Regret Analysis of Averag e-Reward Unichain MDPs via an Actor-Critic Approach

链接: https://arxiv.org/abs/2505.19986
作者: Swetha Ganesh,Vaneet Aggarwal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages

点击查看摘要

Abstract:Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of \tildeO(\sqrtT) in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants C_\texthit and C_\texttar , which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.

[LG-28] Rethinking Probabilistic Circuit Parameter Learning

链接: https://arxiv.org/abs/2505.19982
作者: Anji Liu,Guy Van den Broeck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. While empirical extensions to the mini-batch setting have been proposed, it remains unclear what objective these algorithms are optimizing, making it difficult to assess their theoretical soundness. This paper bridges the gap by establishing a novel connection between the general EM objective and the standard full-batch EM algorithm. Building on this, we derive a theoretically grounded generalization to the mini-batch setting and demonstrate its effectiveness through preliminary empirical results.

[LG-29] Differential Privacy Analysis of Decentralized Gossip Averag ing under Varying Threat Models

链接: https://arxiv.org/abs/2505.19969
作者: Antti Koskela,Tejas Kulkarni
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Fully decentralized training of machine learning models offers significant advantages in scalability, robustness, and fault tolerance. However, achieving differential privacy (DP) in such settings is challenging due to the absence of a central aggregator and varying trust assumptions among nodes. In this work, we present a novel privacy analysis of decentralized gossip-based averaging algorithms with additive node-level noise, both with and without secure summation over each node’s direct neighbors. Our main contribution is a new analytical framework based on a linear systems formulation that accurately characterizes privacy leakage across these scenarios. This framework significantly improves upon prior analyses, for example, reducing the Rényi DP parameter growth from O(T^2) to O(T) , where T is the number of training rounds. We validate our analysis with numerical results demonstrating superior DP bounds compared to existing approaches. We further illustrate our analysis with a logistic regression experiment on MNIST image classification in a fully decentralized setting, demonstrating utility comparable to central aggregation methods.

[LG-30] Which Data Attributes Stimulate Math and Code Reasoning ? An Investigation via Influence Functions

链接: https://arxiv.org/abs/2505.19949
作者: Siqi Kou,Qingyuan Tian,Hanwen Xu,Zihao Zeng,Zhijie Deng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. To address these limitations, we leverage influence functions to systematically attribute LLMs’ reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics. Our Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning. Based on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10% to 20% and boosts LiveCodeBench accuracy from 33.8% to 35.3% for Qwen2.5-7B-Instruct. Moreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.

[LG-31] Inverse Q-Learning Done Right: Offline Imitation Learning in Qπ-Realizable MDPs

链接: https://arxiv.org/abs/2505.19946
作者: Antoine Moulin,Gergely Neu,Luca Viano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear Q^\pi -realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error \varepsilon with access to \mathcalO(\varepsilon^-2) samples. Moreover, we extend this result to possibly non-linear Q^\pi -realizable MDPs at the cost of a worse sample complexity of order \mathcalO(\varepsilon^-4) . Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.

[LG-32] Beyond Freezing: Sparse Tuning Enhances Plasticity in Continual Learning with Pre-Trained Models

链接: https://arxiv.org/abs/2505.19943
作者: Huan Zhang,Fan Lyu,Shuyu Dong,Shenghua Fan,Yujin Zheng,Dingwen Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual Learning with Pre-trained Models holds great promise for efficient adaptation across sequential tasks. However, most existing approaches freeze PTMs and rely on auxiliary modules like prompts or adapters, limiting model plasticity and leading to suboptimal generalization when facing significant distribution shifts. While full fine-tuning can improve adaptability, it risks disrupting crucial pre-trained knowledge. In this paper, we propose Mutual Information-guided Sparse Tuning (MIST), a plug-and-play method that selectively updates a small subset of PTM parameters, less than 5%, based on sensitivity to mutual information objectives. MIST enables effective task-specific adaptation while preserving generalization. To further reduce interference, we introduce strong sparsity regularization by randomly dropping gradients during tuning, resulting in fewer than 0.5% of parameters being updated per step. Applied before standard freeze-based methods, MIST consistently boosts performance across diverse continual learning benchmarks. Experiments show that integrating our method into multiple baselines yields significant performance gains. Our code is available at this https URL.

[LG-33] ask-Oriented Low-Label Semantic Communication With Self-Supervised Learning

链接: https://arxiv.org/abs/2505.19940
作者: Run Gu,Wei Xu,Zhaohui Yang,Dusit Niyato,Aylin Yener
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Task-oriented semantic communication enhances transmission efficiency by conveying semantic information rather than exact messages. Deep learning (DL)-based semantic communication can effectively cultivate the essential semantic knowledge for semantic extraction, transmission, and interpretation by leveraging massive labeled samples for downstream task training. In this paper, we propose a self-supervised learning-based semantic communication framework (SLSCom) to enhance task inference performance, particularly in scenarios with limited access to labeled samples. Specifically, we develop a task-relevant semantic encoder using unlabeled samples, which can be collected by devices in real-world edge networks. To facilitate task-relevant semantic extraction, we introduce self-supervision for learning contrastive features and formulate the information bottleneck (IB) problem to balance the tradeoff between the informativeness of the extracted features and task inference performance. Given the computational challenges of the IB problem, we devise a practical and effective solution by employing self-supervised classification and reconstruction pretext tasks. We further propose efficient joint training methods to enhance end-to-end inference accuracy over wireless channels, even with few labeled samples. We evaluate the proposed framework on image classification tasks over multipath wireless channels. Extensive simulation results demonstrate that SLSCom significantly outperforms conventional digital coding methods and existing DL-based approaches across varying labeled data set sizes and SNR conditions, even when the unlabeled samples are irrelevant to the downstream tasks.

[LG-34] Logic Gate Neural Networks are Good for Verification

链接: https://arxiv.org/abs/2505.19932
作者: Fabian Kresse,Emily Yu,Christoph H. Lampert,Thomas A. Henzinger
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 15 pages, 7 figures, 1 table. Accepted at NeuS 2025; to appear in PMLR Vol. 288

点击查看摘要

Abstract:Learning-based systems are increasingly deployed across various domains, yet the complexity of traditional neural networks poses significant challenges for formal verification. Unlike conventional neural networks, learned Logic Gate Networks (LGNs) replace multiplications with Boolean logic gates, yielding a sparse, netlist-like architecture that is inherently more amenable to symbolic verification, while still delivering promising performance. In this paper, we introduce a SAT encoding for verifying global robustness and fairness in LGNs. We evaluate our method on five benchmark datasets, including a newly constructed 5-class variant, and find that LGNs are both verification-friendly and maintain strong predictive performance.

[LG-35] Learning to Trust Bellm an Updates: Selective State-Adaptive Regularization for Offline RL ICML2025

链接: https://arxiv.org/abs/2505.19923
作者: Qin-Wen Luo,Ming-Kun Xie,Ye-Wen Wang,Sheng-Jun Huang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark.

[LG-36] Generalized and Personalized Federated Learning with Foundation Models via Orthogonal Transformations

链接: https://arxiv.org/abs/2505.19888
作者: Eun Gyung Kong,Je Won Yeom,Yonghoon Jeon,Taesup Kim
类目: Machine Learning (cs.LG)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:Federated Learning (FL) aims to train models across decentralized clients or devices holding local data without the need for centralized data collection, thus enhancing data privacy and security. However, achieving both generalization and personalization in heterogeneous settings remains a significant challenge. To address this, we introduce FedOT, a novel approach that leverages black-box foundation models. FedOT shares only a global task-dependent classifier across clients while locally adapting features through orthogonal transformations. By enforcing orthogonality, FedOT mitigates gradient conflicts across diverse clients, preserves semantic integrity, and achieves robust performance even in the presence of substantial data heterogeneity. The strategy of combining global and local parameters enables a more balanced approach for both generalization and personalization, outperforming baseline FL methods across multiple benchmarks. Furthermore, our extensive analysis confirms that joint optimization of global classifiers and local orthogonal transformations yields superior performance and suggests broader applicability.

[LG-37] Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

链接: https://arxiv.org/abs/2505.19855
作者: Zexi Li,Xiangzhu Wang,William F. Shen,Meghdad Kurmanji,Xinchi Qiu,Dongqi Cai,Chao Wu,Nicholas D. Lane
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or “empty set” \emptyset response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM’s own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.

[LG-38] One Surrogate to Fool Them All: Universal Transferable and Targeted Adversarial Attacks with CLIP CCS

链接: https://arxiv.org/abs/2505.19840
作者: Binyan Xu,Xilin Dai,Di Tang,Kehuan Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 21 pages, 15 figures, 18 tables To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2025

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have achieved widespread success yet remain prone to adversarial attacks. Typically, such attacks either involve frequent queries to the target model or rely on surrogate models closely mirroring the target model – often trained with subsets of the target model’s training data – to achieve high attack success rates through transferability. However, in realistic scenarios where training data is inaccessible and excessive queries can raise alarms, crafting adversarial examples becomes more challenging. In this paper, we present UnivIntruder, a novel attack framework that relies solely on a single, publicly available CLIP model and publicly available datasets. By using textual concepts, UnivIntruder generates universal, transferable, and targeted adversarial perturbations that mislead DNNs into misclassifying inputs into adversary-specified classes defined by textual concepts. Our extensive experiments show that our approach achieves an Attack Success Rate (ASR) of up to 85% on ImageNet and over 99% on CIFAR-10, significantly outperforming existing transfer-based methods. Additionally, we reveal real-world vulnerabilities, showing that even without querying target models, UnivIntruder compromises image search engines like Google and Baidu with ASR rates up to 84%, and vision language models like GPT-4 and Claude-3.5 with ASR rates up to 80%. These findings underscore the practicality of our attack in scenarios where traditional avenues are blocked, highlighting the need to reevaluate security paradigms in AI applications. Comments: 21 pages, 15 figures, 18 tables To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2025 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) MSC classes: 68T07 ACMclasses: I.2.6 Cite as: arXiv:2505.19840 [cs.CR] (or arXiv:2505.19840v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.19840 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Multi-Agent Reinforcement Learning in Cybersecurity: From Fundamentals to Applications

链接: https://arxiv.org/abs/2505.19837
作者: Christoph R. Landolt,Christoph Würsch,Roland Meier,Alain Mermoud,Julian Jang-Jaccard
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) has shown great potential as an adaptive solution for addressing modern cybersecurity challenges. MARL enables decentralized, adaptive, and collaborative defense strategies and provides an automated mechanism to combat dynamic, coordinated, and sophisticated threats. This survey investigates the current state of research in MARL applications for automated cyber defense (ACD), focusing on intruder detection and lateral movement containment. Additionally, it examines the role of Autonomous Intelligent Cyber-defense Agents (AICA) and Cyber Gyms in training and validating MARL agents. Finally, the paper outlines existing challenges, such as scalability and adversarial robustness, and proposes future research directions. This also discusses how MARL integrates in AICA to provide adaptive, scalable, and dynamic solutions to counter the increasingly sophisticated landscape of cyber threats. It highlights the transformative potential of MARL in areas like intrusion detection and lateral movement containment, and underscores the value of Cyber Gyms for training and validation of AICA.

[LG-40] Poison in the Well: Feature Embedding Disruption in Backdoor Attacks ICME2025

链接: https://arxiv.org/abs/2505.19821
作者: Zhou Feng,Jiahao Chen,Chunyi Zhou,Yuwen Pu,Qingming Li,Shouling Ji
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to ICME 2025

点击查看摘要

Abstract:Backdoor attacks embed malicious triggers into training data, enabling attackers to manipulate neural network behavior during inference while maintaining high accuracy on benign inputs. However, existing backdoor attacks face limitations manifesting in excessive reliance on training data, poor stealth, and instability, which hinder their effectiveness in real-world applications. Therefore, this paper introduces ShadowPrint, a versatile backdoor attack that targets feature embeddings within neural networks to achieve high ASRs and stealthiness. Unlike traditional approaches, ShadowPrint reduces reliance on training data access and operates effectively with exceedingly low poison rates (as low as 0.01%). It leverages a clustering-based optimization strategy to align feature embeddings, ensuring robust performance across diverse scenarios while maintaining stability and stealth. Extensive evaluations demonstrate that ShadowPrint achieves superior ASR (up to 100%), steady CA (with decay no more than 1% in most cases), and low DDR (averaging below 5%) across both clean-label and dirty-label settings, and with poison rates ranging from as low as 0.01% to 0.05%, setting a new standard for backdoor attack capabilities and emphasizing the need for advanced defense strategies focused on feature space manipulations.

[LG-41] InfoCons: Identifying Interpretable Critical Concepts in Point Clouds via Information Theory ICML2025

链接: https://arxiv.org/abs/2505.19820
作者: Feifei Li,Mi Zhang,Zhaoxiang Wang,Min Yang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2025 (Poster)

点击查看摘要

Abstract:Interpretability of point cloud (PC) models becomes imperative given their deployment in safety-critical scenarios such as autonomous vehicles. We focus on attributing PC model outputs to interpretable critical concepts, defined as meaningful subsets of the input point cloud. To enable human-understandable diagnostics of model failures, an ideal critical subset should be faithful (preserving points that causally influence predictions) and conceptually coherent (forming semantically meaningful structures that align with human perception). We propose InfoCons, an explanation framework that applies information-theoretic principles to decompose the point cloud into 3D concepts, enabling the examination of their causal effect on model predictions with learnable priors. We evaluate InfoCons on synthetic datasets for classification, comparing it qualitatively and quantitatively with four baselines. We further demonstrate its scalability and flexibility on two real-world datasets and in two applications that utilize critical scores of PC.

[LG-42] Density Ratio-Free Doubly Robust Proxy Causal Learning

链接: https://arxiv.org/abs/2505.19807
作者: Bariscan Bozkurt,Houssam Zenati,Dimitri Meunier,Liyuan Xu,Arthur Gretton
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of causal function estimation in the Proxy Causal Learning (PCL) framework, where confounders are not observed but proxies for the confounders are available. Two main approaches have been proposed: outcome bridge-based and treatment bridge-based methods. In this work, we propose two kernel-based doubly robust estimators that combine the strengths of both approaches, and naturally handle continuous and high-dimensional variables. Our identification strategy builds on a recent density ratio-free method for treatment bridge-based PCL; furthermore, in contrast to previous approaches, it does not require indicator functions or kernel smoothing over the treatment variable. These properties make it especially well-suited for continuous or high-dimensional treatments. By using kernel mean embeddings, we have closed-form solutions and strong consistency guarantees. Our estimators outperform existing methods on PCL benchmarks, including a prior doubly robust method that requires both kernel smoothing and density ratio estimation.

[LG-43] What Can RL Bring to VLA Generalization? An Empirical Study

链接: https://arxiv.org/abs/2505.19789
作者: Jijia Liu,Feng Gao,Bingwen Wei,Xinlei Chen,Qingmin Liao,Yi Wu,Chao Yu,Yu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at this https URL

[LG-44] Unfolding AlphaFolds Bayesian Roots in Probability Kinematics

链接: https://arxiv.org/abs/2505.19763
作者: Thomas Hamelryck,Kanti V. Mardia
类目: Machine Learning (cs.LG)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:We present a novel theoretical interpretation of AlphaFold1. The seminal breakthrough of AlphaFold1 in protein structure prediction by deep learning relied on a learned potential energy function, in contrast to the later end-to-end architectures of AlphaFold2 and AlphaFold3. While this potential was originally justified by referring to physical potentials of mean force (PMFs), we reinterpret AlphaFold1’s potential as an instance of probability kinematics - also known as Jeffrey conditioning - a principled but underrecognised generalization of conventional Bayesian updating. Probability kinematics accommodates uncertain or soft evidence in the form of updated probabilities over a partition. This perspective reveals AlphaFold1’s potential as a form of generalized Bayesian updating, rather than a thermodynamic potential. To confirm our probabilistic framework’s scope and precision, we analyze a synthetic 2D model in which an angular random walk prior is updated with evidence on distances via probability kinematics, mirroring AlphaFold1’s approach. This theoretical contribution connects AlphaFold1 to a broader class of well-justified Bayesian methods, allowing precise quantification, surpassing merely qualitative heuristics based on PMFs. More broadly, given the achievements of AlphaFold1, probability kinematics holds considerable promise for probabilistic deep learning, as it allows for the formulation of complex models from a few simpler components.

[LG-45] Machine Learning Algorithm for Noise Reduction and Disease-Causing Gene Feature Extraction in Gene Sequencing Data

链接: https://arxiv.org/abs/2505.19740
作者: Weichen Si,Yihao Ou,Zhen Tian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we propose a machine learning-based method for noise reduction and disease-causing gene feature extraction in gene sequencing DeepSeqDenoise algorithm combines CNN and RNN to effectively remove the sequencing noise, and improves the signal-to-noise ratio by 9.4 dB. We screened 17 key features by feature engineering, and constructed an integrated learning model to predict disease-causing genes with 94.3% accuracy. We successfully identified 57 new candidate disease-causing genes in a cardiovascular disease cohort validation, and detected 3 missed variants in clinical applications. The method significantly outperforms existing tools and provides strong support for accurate diagnosis of genetic diseases.

[LG-46] On the Relation between Rectified Flows and Optimal Transport

链接: https://arxiv.org/abs/2505.19712
作者: Johannes Hertrich,Antonin Chambolle,Julie Delon
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates the connections between rectified flows, flow matching, and optimal transport. Flow matching is a recent approach to learning generative models by estimating velocity fields that guide transformations from a source to a target distribution. Rectified flow matching aims to straighten the learned transport paths, yielding more direct flows between distributions. Our first contribution is a set of invariance properties of rectified flows and explicit velocity fields. In addition, we also provide explicit constructions and analysis in the Gaussian (not necessarily independent) and Gaussian mixture settings and study the relation to optimal transport. Our second contribution addresses recent claims suggesting that rectified flows, when constrained such that the learned velocity field is a gradient, can yield (asymptotically) solutions to optimal transport problems. We study the existence of solutions for this problem and demonstrate that they only relate to optimal transport under assumptions that are significantly stronger than those previously acknowledged. In particular, we present several counter-examples that invalidate earlier equivalence results in the literature, and we argue that enforcing a gradient constraint on rectified flows is, in general, not a reliable method for computing optimal transport maps.

[LG-47] Graph Guided Diffusion: Unified Guidance for Conditional Graph Generation

链接: https://arxiv.org/abs/2505.19685
作者: Victor M. Tenorio,Nicolas Zilberstein,Santiago Segarra,Antonio G. Marques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful generative models for graph generation, yet their use for conditional graph generation remains a fundamental challenge. In particular, guiding diffusion models on graphs under arbitrary reward signals is difficult: gradient-based methods, while powerful, are often unsuitable due to the discrete and combinatorial nature of graphs, and non-differentiable rewards further complicate gradient-based guidance. We propose Graph Guided Diffusion (GGDiff), a novel guidance framework that interprets conditional diffusion on graphs as a stochastic control problem to address this challenge. GGDiff unifies multiple guidance strategies, including gradient-based guidance (for differentiable rewards), control-based guidance (using control signals from forward reward evaluations), and zero-order approximations (bridging gradient-based and gradient-free optimization). This comprehensive, plug-and-play framework enables zero-shot guidance of pre-trained diffusion models under both differentiable and non-differentiable reward functions, adapting well-established guidance techniques to graph generation–a direction largely unexplored. Our formulation balances computational efficiency, reward alignment, and sample quality, enabling practical conditional generation across diverse reward types. We demonstrate the efficacy of GGDiff in various tasks, including constraints on graph motifs, fairness, and link prediction, achieving superior alignment with target rewards while maintaining diversity and fidelity.

[LG-48] Deep Actor-Critics with Tight Risk Certificates

链接: https://arxiv.org/abs/2505.19682
作者: Bahareh Tasdighi,Manuel Haussmann,Yi-Shan Wu,Andres R. Masegosa,Melih Kandemir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:After a period of research, deep actor-critic algorithms have reached a level where they influence our everyday lives. They serve as the driving force behind the continual improvement of large language models through user-collected feedback. However, their deployment in physical systems is not yet widely adopted, mainly because no validation scheme that quantifies their risk of malfunction. We demonstrate that it is possible to develop tight risk certificates for deep actor-critic algorithms that predict generalization performance from validation-time observations. Our key insight centers on the effectiveness of minimal evaluation data. Surprisingly, a small feasible of evaluation roll-outs collected from a pretrained policy suffices to produce accurate risk certificates when combined with a simple adaptation of PAC-Bayes theory. Specifically, we adopt a recently introduced recursive PAC-Bayes approach, which splits validation data into portions and recursively builds PAC-Bayes bounds on the excess loss of each portion’s predictor, using the predictor from the previous portion as a data-informed prior. Our empirical results across multiple locomotion tasks and policy expertise levels demonstrate risk certificates that are tight enough to be considered for practical use.

[LG-49] Cut out and Replay: A Simple yet Versatile Strategy for Multi-Label Online Continual Learning ICML2025

链接: https://arxiv.org/abs/2505.19680
作者: Xinrui Wang,Shao-yuan Li,Jiaqiang Zhang,Songcan Chen
类目: Machine Learning (cs.LG)
*备注: accepted by ICML 2025

点击查看摘要

Abstract:Multi-Label Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textitthey all overlook label-specific region identifying and feature learning - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of our proposed method. The code is available at \hrefthis https URLthis https URL

[LG-50] Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling

链接: https://arxiv.org/abs/2505.19669
作者: Haiyang Sun,Shujie Hu,Shujie Liu,Lingwei Meng,Hui Wang,Bing Han,Yifan Yang,Yanqing Liu,Sheng Zhao,Yan Lu,Yanmin Qian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces high processing latency. To address this issue, we propose SMLLE, a streaming framework for generating high-quality speech frame-by-frame. SMLLE employs a Transducer to convert text into semantic tokens in real time while simultaneously obtaining duration alignment information. The combined outputs are then fed into a fully autoregressive (AR) streaming model to reconstruct mel-spectrograms. To further stabilize the generation process, we design a Delete Bos Mechanism that allows the AR model to access future text introducing as minimal delay as possible. Experimental results suggest that the SMLLE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems. Samples are available on this https URL.

[LG-51] Energy-based generator matching: A neural sampler for general state space

链接: https://arxiv.org/abs/2505.19646
作者: Dongyeop Woo,Minsu Kim,Minkyu Kim,Kiyoung Seong,Sungsoo Ahn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Energy-based generator matching (EGM), a modality-agnostic approach to train generative models from energy functions in the absence of data. Extending the recently proposed generator matching, EGM enables training of arbitrary continuous-time Markov processes, e.g., diffusion, flow, and jump, and can generate data from continuous, discrete, and a mixture of two modalities. To this end, we propose estimating the generator matching loss using self-normalized importance sampling with an additional bootstrapping trick to reduce variance in the importance weight. We validate EGM on both discrete and multimodal tasks up to 100 and 20 dimensions, respectively.

[LG-52] When fractional quasi p-norms concentrate

链接: https://arxiv.org/abs/2505.19635
作者: Ivan Y. Tyukin,Bogdan Grechuk,Evgeny M. Mirkes,Alexander N. Gorban
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Concentration of distances in high dimension is an important factor for the development and design of stable and reliable data analysis algorithms. In this paper, we address the fundamental long-standing question about the concentration of distances in high dimension for fractional quasi p -norms, p\in(0,1) . The topic has been at the centre of various theoretical and empirical controversies. Here we, for the first time, identify conditions when fractional quasi p -norms concentrate and when they don’t. We show that contrary to some earlier suggestions, for broad classes of distributions, fractional quasi p -norms admit exponential and uniform in p concentration bounds. For these distributions, the results effectively rule out previously proposed approaches to alleviate concentration by “optimal” setting the values of p in (0,1) . At the same time, we specify conditions and the corresponding families of distributions for which one can still control concentration rates by appropriate choices of p . We also show that in an arbitrarily small vicinity of a distribution from a large class of distributions for which uniform concentration occurs, there are uncountably many other distributions featuring anti-concentration properties. Importantly, this behavior enables devising relevant data encoding or representation schemes favouring or discouraging distance concentration. The results shed new light on this long-standing problem and resolve the tension around the topic in both theory and empirical evidence reported in the literature.

[LG-53] SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows

链接: https://arxiv.org/abs/2505.19619
作者: Janik Kreit,Dominic Schuh,Kim A. Nicoli,Lena Funcke
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 14 figures

点击查看摘要

Abstract:Deep generative models have recently garnered significant attention across various fields, from physics to chemistry, where sampling from unnormalized Boltzmann-like distributions represents a fundamental challenge. In particular, autoregressive models and normalizing flows have become prominent due to their appealing ability to yield closed-form probability densities. Moreover, it is well-established that incorporating prior knowledge - such as symmetries - into deep neural networks can substantially improve training performances. In this context, recent advances have focused on developing symmetry-equivariant generative models, achieving remarkable results. Building upon these foundations, this paper introduces Symmetry-Enforcing Stochastic Modulation (SESaMo). Similar to equivariant normalizing flows, SESaMo enables the incorporation of inductive biases (e.g., symmetries) into normalizing flows through a novel technique called stochastic modulation. This approach enhances the flexibility of the generative model, allowing to effectively learn a variety of exact and broken symmetries. Our numerical experiments benchmark SESaMo in different scenarios, including an 8-Gaussian mixture model and physically relevant field theories, such as the \phi^4 theory and the Hubbard model.

[LG-54] Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity

链接: https://arxiv.org/abs/2505.19605
作者: Aggrey Muhebwa,Khotso Selialia,Fatima Anwar,Khalid K. Osman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning on heterogeneous (non-IID) client data experiences slow convergence due to client drift. To address this challenge, we propose Kuramoto-FedAvg, a federated optimization algorithm that reframes the weight aggregation step as a synchronization problem inspired by the Kuramoto model of coupled oscillators. The server dynamically weighs each client’s update based on its phase alignment with the global update, amplifying contributions that align with the global gradient direction while minimizing the impact of updates that are out of phase. We theoretically prove that this synchronization mechanism reduces client drift, providing a tighter convergence bound compared to the standard FedAvg under heterogeneous data distributions. Empirical validation supports our theoretical findings, showing that Kuramoto-FedAvg significantly accelerates convergence and improves accuracy across multiple benchmark datasets. Our work highlights the potential of coordination and synchronization-based strategies for managing gradient diversity and accelerating federated optimization in realistic non-IID settings.

[LG-55] Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

链接: https://arxiv.org/abs/2505.19602
作者: Kunjun Li,Zigeng Chen,Cheng-Yen Yang,Jenq-Neng Hwang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.

[LG-56] Model Agnostic Differentially Private Causal Inference

链接: https://arxiv.org/abs/2505.19589
作者: Christiant Lebeda,Mathieu Even,Aurélien Bellet,Julie Josse
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating causal effects from observational data is essential in fields such as medicine, economics and social sciences, where privacy concerns are paramount. We propose a general, model-agnostic framework for differentially private estimation of average treatment effects (ATE) that avoids strong structural assumptions on the data-generating process or the models used to estimate propensity scores and conditional outcomes. In contrast to prior work, which enforces differential privacy by directly privatizing these nuisance components and results in a privacy cost that scales with model complexity, our approach decouples nuisance estimation from privacy protection. This separation allows the use of flexible, state-of-the-art black-box models, while differential privacy is achieved by perturbing only predictions and aggregation steps within a fold-splitting scheme with ensemble techniques. We instantiate the framework for three classical estimators – the G-formula, inverse propensity weighting (IPW), and augmented IPW (AIPW) – and provide formal utility and privacy guarantees. Empirical results show that our methods maintain competitive performance under realistic privacy budgets. We further extend our framework to support meta-analysis of multiple private ATE estimates. Our results bridge a critical gap between causal inference and privacy-preserving data analysis.

[LG-57] Lego Sketch: A Scalable Memory-augmented Neural Network for Sketching Data Streams ICML2025

链接: https://arxiv.org/abs/2505.19561
作者: Yuan Feng,Yukun Cao,Hairu Wang,Xike Xie,S Kevin Zhou
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memory-augmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy this http URL, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the \it Lego sketch, a novel sketch with superior scalability and accuracy. Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains. Our theoretical analysis guarantees its high scalability and provides the first error bound for neural sketch. Furthermore, extensive experimental evaluations demonstrate that the Lego sketch exhibits superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches. Our code is available at this https URL.

[LG-58] EuroCon: Benchmarking Parliament Deliberation for Political Consensus Finding

链接: https://arxiv.org/abs/2505.19558
作者: Zhaowei Zhang,Minghua Yi,Mengmeng Wang,Fengshuo Bai,Zilong Zheng,Yipeng Kang,Yaodong Yang
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: EuroCon is publicly available at this https URL

点击查看摘要

Abstract:Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities on this scope are still understudied. In this paper, we introduce EuroCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to reach political consensus among divergent party positions across diverse parliament settings. Specifically, EuroCon incorporates four factors to build each simulated parliament setting: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also develop an evaluation framework for EuroCon to simulate real voting outcomes in different parliament settings, assessing whether LLM-generated resolutions meet predefined political goals. Our experimental results demonstrate that even state-of-the-art models remain undersatisfied with complex tasks like passing resolutions by a two-thirds majority and addressing security issues, while revealing some common strategies LLMs use to find consensus under different power structures, such as prioritizing the stance of the dominant party, highlighting EuroCon’s promise as an effective platform for studying LLMs’ ability to find political consensus.

[LG-59] On scalable and efficient training of diffusion samplers

链接: https://arxiv.org/abs/2505.19552
作者: Minkyu Kim,Kiyoung Seong,Dongyeop Woo,Sungsoo Ahn,Minsu Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the challenge of training diffusion models to sample from unnormalized energy distributions in the absence of data, the so-called diffusion samplers. Although these approaches have shown promise, they struggle to scale in more demanding scenarios where energy evaluations are expensive and the sampling space is high-dimensional. To address this limitation, we propose a scalable and sample-efficient framework that properly harmonizes the powerful classical sampling method and the diffusion sampler. Specifically, we utilize Monte Carlo Markov chain (MCMC) samplers with a novelty-based auxiliary energy as a Searcher to collect off-policy samples, using an auxiliary energy function to compensate for exploring modes the diffusion sampler rarely visits. These off-policy samples are then combined with on-policy data to train the diffusion sampler, thereby expanding its coverage of the energy landscape. Furthermore, we identify primacy bias, i.e., the preference of samplers for early experience during training, as the main cause of mode collapse during training, and introduce a periodic re-initialization trick to resolve this issue. Our method significantly improves sample efficiency on standard benchmarks for diffusion samplers and also excels at higher-dimensional problems and real-world molecular conformer generation.

[LG-60] Unlocking the Power of Diffusion Models in Sequential Recommendation: A Simple and Effective Approach

链接: https://arxiv.org/abs/2505.19544
作者: Jialei Chen,Yuanbo Xu,Yiheng Jiang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we focus on the often-overlooked issue of embedding collapse in existing diffusion-based sequential recommendation models and propose ADRec, an innovative framework designed to mitigate this problem. Diverging from previous diffusion-based methods, ADRec applies an independent noise process to each token and performs diffusion across the entire target sequence during training. ADRec captures token interdependency through auto-regression while modeling per-token distributions through token-level diffusion. This dual approach enables the model to effectively capture both sequence dynamics and item representations, overcoming the limitations of existing methods. To further mitigate embedding collapse, we propose a three-stage training strategy: (1) pre-training the embedding weights, (2) aligning these weights with the ADRec backbone, and (3) fine-tuning the model. During inference, ADRec applies the denoising process only to the last token, ensuring that the meaningful patterns in historical interactions are preserved. Our comprehensive empirical evaluation across six datasets underscores the effectiveness of ADRec in enhancing both the accuracy and efficiency of diffusion-based sequential recommendation systems.

[LG-61] Cuff-KT: Tackling Learners Real-time Learning Pattern Adjustment via Tuning-Free Knowledge State Guided Model Updating KDD2025

链接: https://arxiv.org/abs/2505.19543
作者: Yiyun Zhou,Zheqi Lv,Shengyu Zhang,Jingyuan Chen
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025, Research Track

点击查看摘要

Abstract:Knowledge Tracing (KT) is a core component of Intelligent Tutoring Systems, modeling learners’ knowledge state to predict future performance and provide personalized learning support. Traditional KT models assume that learners’ learning abilities remain relatively stable over short periods or change in predictable ways based on prior performance. However, in reality, learners’ abilities change irregularly due to factors like cognitive fatigue, motivation, and external stress – a task introduced, which we refer to as Real-time Learning Pattern Adjustment (RLPA). Existing KT models, when faced with RLPA, lack sufficient adaptability, because they fail to timely account for the dynamic nature of different learners’ evolving learning patterns. Current strategies for enhancing adaptability rely on retraining, which leads to significant overfitting and high time overhead issues. To address this, we propose Cuff-KT, comprising a controller and a generator. The controller assigns value scores to learners, while the generator generates personalized parameters for selected learners. Cuff-KT controllably adapts to data changes fast and flexibly without fine-tuning. Experiments on five datasets from different subjects demonstrate that Cuff-KT significantly improves the performance of five KT models with different structures under intra- and inter-learner shifts, with an average relative increase in AUC of 10% and 4%, respectively, at a negligible time cost, effectively tackling RLPA task. Our code and datasets are fully available at this https URL.

[LG-62] Continuous-Time Analysis of Heavy Ball Momentum in Min-Max Games ICML2025

链接: https://arxiv.org/abs/2505.19537
作者: Yi Feng,Kaito Fujii,Stratis Skoulakis,Xiao Wang,Volkan Cevher
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Accepted for ICML 2025

点击查看摘要

Abstract:Since Polyak’s pioneering work, heavy ball (HB) momentum has been widely studied in minimization. However, its role in min-max games remains largely unexplored. As a key component of practical min-max algorithms like Adam, this gap limits their effectiveness. In this paper, we present a continuous-time analysis for HB with simultaneous and alternating update schemes in min-max games. Locally, we prove smaller momentum enhances algorithmic stability by enabling local convergence across a wider range of step sizes, with alternating updates generally converging faster. Globally, we study the implicit regularization of HB, and find smaller momentum guides algorithms trajectories towards shallower slope regions of the loss landscapes, with alternating updates amplifying this effect. Surprisingly, all these phenomena differ from those observed in minimization, where larger momentum yields similar effects. Our results reveal fundamental differences between HB in min-max games and minimization, and numerical experiments further validate our theoretical results.

[LG-63] ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models

链接: https://arxiv.org/abs/2505.19533
作者: Yachuan Liu,Xiaochun Wei,Lin Shi,Xinnuo Li,Bohan Zhang,Paramveer Dhillon,Qiaozhu Mei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models’ reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs’ temporal reasoning ability for time-sensitive applications.

[LG-64] Fox in the Henhouse: Supply-Chain Backdoor Attacks Against Reinforcement Learning

链接: https://arxiv.org/abs/2505.19532
作者: Shijie Liu,Andrew C. Cullen,Paul Montague,Sarah Erfani,Benjamin I. P. Rubinstein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The current state-of-the-art backdoor attacks against Reinforcement Learning (RL) rely upon unrealistically permissive access models, that assume the attacker can read (or even write) the victim’s policy parameters, observations, or rewards. In this work, we question whether such a strong assumption is required to launch backdoor attacks against RL. To answer this question, we propose the \underlineSupply-\underlineCh\underlineain \underlineBackdoor (SCAB) attack, which targets a common RL workflow: training agents using external agents that are provided separately or embedded within the environment. In contrast to prior works, our attack only relies on legitimate interactions of the RL agent with the supplied agents. Despite this limited access model, by poisoning a mere 3% of training experiences, our attack can successfully activate over 90% of triggered actions, reducing the average episodic return by 80% for the victim. Our novel attack demonstrates that RL attacks are likely to become a reality under untrusted RL training supply-chains.

[LG-65] Learning Dynamics under Environmental Constraints via Measurement-Induced Bundle Structures ICML2025

链接: https://arxiv.org/abs/2505.19521
作者: Dongzhe Zheng,Wenjie Mei
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted by ICML 2025

点击查看摘要

Abstract:Learning unknown dynamics under environmental (or external) constraints is fundamental to many fields (e.g., modern robotics), particularly challenging when constraint information is only locally available and uncertain. Existing approaches requiring global constraints or using probabilistic filtering fail to fully exploit the geometric structure inherent in local measurements (by using, e.g., sensors) and constraints. This paper presents a geometric framework unifying measurements, constraints, and dynamics learning through a fiber bundle structure over the state space. This naturally induced geometric structure enables measurement-aware Control Barrier Functions that adapt to local sensing (or measurement) conditions. By integrating Neural ODEs, our framework learns continuous-time dynamics while preserving geometric constraints, with theoretical guarantees of learning convergence and constraint satisfaction dependent on sensing quality. The geometric framework not only enables efficient dynamics learning but also suggests promising directions for integration with reinforcement learning approaches. Extensive simulations demonstrate significant improvements in both learning efficiency and constraint satisfaction over traditional methods, especially under limited and uncertain sensing conditions.

[LG-66] Learning for Dynamic Combinatorial Optimization without Training Data

链接: https://arxiv.org/abs/2505.19497
作者: Yiqiao Liao,Farinaz Koushanfar,Parinaz Naghizadeh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce DyCO-GNN, a novel unsupervised learning framework for Dynamic Combinatorial Optimization that requires no training data beyond the problem instance itself. DyCO-GNN leverages structural similarities across time-evolving graph snapshots to accelerate optimization while maintaining solution quality. We evaluate DyCO-GNN on dynamic maximum cut, maximum independent set, and the traveling salesman problem across diverse datasets of varying sizes, demonstrating its superior performance under tight and moderate time budgets. DyCO-GNN consistently outperforms the baseline methods, achieving high-quality solutions up to 3-60x faster, highlighting its practical effectiveness in rapidly evolving resource-constrained settings.

[LG-67] Discounted Online Convex Optimization: Uniform Regret Across a Continuous Interval

链接: https://arxiv.org/abs/2505.19491
作者: Wenhao Yang,Sifan Yang,Lijun Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reflecting the greater significance of recent history over the distant past in non-stationary environments, \lambda -discounted regret has been introduced in online convex optimization (OCO) to gracefully forget past data as new information arrives. When the discount factor \lambda is given, online gradient descent with an appropriate step size achieves an O(1/\sqrt1-\lambda) discounted regret. However, the value of \lambda is often not predetermined in real-world scenarios. This gives rise to a significant open question: is it possible to develop a discounted algorithm that adapts to an unknown discount factor. In this paper, we affirmatively answer this question by providing a novel analysis to demonstrate that smoothed OGD (SOGD) achieves a uniform O(\sqrt\log T/1-\lambda) discounted regret, holding for all values of \lambda across a continuous interval simultaneously. The basic idea is to maintain multiple OGD instances to handle different discount factors, and aggregate their outputs sequentially by an online prediction algorithm named as Discounted-Normal-Predictor (DNP) (Kapralov and Panigrahy,2010). Our analysis reveals that DNP can combine the decisions of two experts, even when they operate on discounted regret with different discount factors.

[LG-68] VLMLight: Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning

链接: https://arxiv.org/abs/2505.19486
作者: Maonan Wang,Yirong Chen,Aoyu Pang,Yuxin Cai,Chung Shue Chen,Yuheng Kan,Man-On Pun
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 25 pages, 15 figures

点击查看摘要

Abstract:Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-branch reasoning. At the core of VLMLight is the first image-based traffic simulator that enables multi-view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety-prioritized meta-controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL-only systems, while preserving real-time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety-aware solution for next-generation traffic signal control.

[LG-69] Recurrent Self-Attention Dynamics: An Energy-Agnostic Perspective from Jacobians

链接: https://arxiv.org/abs/2505.19458
作者: Akiyoshi Tomihari,Ryo Karakida
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The theoretical understanding of self-attention (SA) has been steadily progressing. A prominent line of work studies a class of SA layers that admit an energy function decreased by state updates. While it provides valuable insights into inherent biases in signal propagation, it often relies on idealized assumptions or additional constraints not necessarily present in standard SA. Thus, to broaden our understanding, this work aims to relax these energy constraints and provide an energy-agnostic characterization of inference dynamics by dynamical systems analysis. In more detail, we first consider relaxing the symmetry and single-head constraints traditionally required in energy-based formulations. Next, to investigate more general SA architectures capable of oscillatory dynamics without necessarily admitting an energy function, we analyze the Jacobian matrix of the state. We reveal that normalization layers effectively normalize the Jacobian’s complex eigenvalues, forcing the dynamics close to a critical state. This significantly enhances inference performance. Furthermore, we utilize the Jacobian perspective to develop regularization methods for training and a pseudo-energy for monitoring inference dynamics.

[LG-70] MetaGMT: Improving Actionable Interpretability of Graph Multilinear Networks via Meta-Learning Filtration

链接: https://arxiv.org/abs/2505.19445
作者: Rishabh Bhattacharya,Hari Shankar,Vaishnavi Shivkumar,Ponnurangam Kumaraguru
类目: Machine Learning (cs.LG)
*备注: 8 Pages Main Content, 10 Pages including Appendix. 1 Figure, 7 Tables

点击查看摘要

Abstract:The growing adoption of Graph Neural Networks (GNNs) in high-stakes domains like healthcare and finance demands reliable explanations of their decision-making processes. While inherently interpretable GNN architectures like Graph Multi-linear Networks (GMT) have emerged, they remain vulnerable to generating explanations based on spurious correlations, potentially undermining trust in critical applications. We present MetaGMT, a meta-learning framework that enhances explanation fidelity through a novel bi-level optimization approach. We demonstrate that MetaGMT significantly improves both explanation quality (AUC-ROC, Precision@K) and robustness to spurious patterns, across BA-2Motifs, MUTAG, and SP-Motif benchmarks. Our approach maintains competitive classification accuracy while producing more faithful explanations (with an increase up to 8% of Explanation ROC on SP-Motif 0.5) compared to baseline methods. These advancements in interpretability could enable safer deployment of GNNs in sensitive domains by (1) facilitating model debugging through more reliable explanations, (2) supporting targeted retraining when biases are identified, and (3) enabling meaningful human oversight. By addressing the critical challenge of explanation reliability, our work contributes to building more trustworthy and actionable GNN systems for real-world applications.

[LG-71] Can Compressed LLM s Truly Act? An Empirical Evaluation of Agent ic Capabilities in LLM Compression ICML2025

链接: https://arxiv.org/abs/2505.19433
作者: Peijie Dong,Zhenheng Tang,Xiang Liu,Lujun Li,Xiaowen Chu,Bo Li
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML2025 as Poster

点击查看摘要

Abstract:Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks only focus on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities - workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs’ agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) quantization (GPTQ, AWQ) and pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5 7B-32B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%-3% drop) but degrades real-world application accuracy by 10%-15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios. The code can be found in this https URL.

[LG-72] Advanced long-term earth system forecasting by learning the small-scale nature

链接: https://arxiv.org/abs/2505.19432
作者: Hao Wu,Yuan Gao,Ruiqi Shu,Kun Wang,Ruijian Gou,Chuhan Wu,Xinliang Liu,Juncai He,Shuhao Cao,Junfeng Fang,Xingjian Shi,Feng Tao,Qi Song,Shengxuan Ji,Yanfei Xiang,Yuze Sun,Jiahao Li,Fan Xu,Huanshuo Dong,Haixin Wang,Fan Zhang,Penghao Zhao,Xian Wu,Qingsong Wen,Deliang Chen,Xiaomeng Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable long-term forecast of Earth system dynamics is heavily hampered by instabilities in current AI models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. We present Triton, an AI framework designed to address this fundamental challenge. Inspired by increasing grids to explicitly resolve small scales in numerical models, Triton employs a hierarchical architecture processing information across multiple resolutions to mitigate spectral bias and explicitly model cross-scale dynamics. We demonstrate Triton’s superior performance on challenging forecast tasks, achieving stable year-long global temperature forecasts, skillful Kuroshio eddy predictions till 120 days, and high-fidelity turbulence simulations preserving fine-scale structures all without external forcing, with significantly surpassing baseline AI models in long-term stability and accuracy. By effectively suppressing high-frequency error accumulation, Triton offers a promising pathway towards trustworthy AI-driven simulation for climate and earth system science.

[LG-73] Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverag e

链接: https://arxiv.org/abs/2505.19431
作者: Chenguang Wang,Xiaoyu Zhang,Kaiyuan Cui,Weichen Zhao,Yongtao Guan,Tianshu Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training neural samplers directly from unnormalized densities without access to target distribution samples presents a significant challenge. A critical desideratum in these settings is achieving comprehensive mode coverage, ensuring the sampler captures the full diversity of the target distribution. However, prevailing methods often circumvent the lack of target data by optimizing reverse KL-based objectives. Such objectives inherently exhibit mode-seeking behavior, potentially leading to incomplete representation of the underlying distribution. While alternative approaches strive for better mode coverage, they typically rely on implicit mechanisms like heuristics or iterative refinement. In this work, we propose a principled approach for training diffusion-based samplers by directly targeting an objective analogous to the forward KL divergence, which is conceptually known to encourage mode coverage. We introduce \textitImportance Weighted Score Matching, a method that optimizes this desired mode-covering objective by re-weighting the score matching loss using tractable importance sampling estimates, thereby overcoming the absence of target distribution data. We also provide theoretical analysis of the bias and variance for our proposed Monte Carlo estimator and the practical loss function used in our method. Experiments on increasingly complex multi-modal distributions, including 2D Gaussian Mixture Models with up to 120 modes and challenging particle systems with inherent symmetries – demonstrate that our approach consistently outperforms existing neural samplers across all distributional distance metrics, achieving state-of-the-art results on all benchmarks.

[LG-74] Future Link Prediction Without Memory or Aggregation

链接: https://arxiv.org/abs/2505.19408
作者: Lu Yi,Runlin Lei,Fengran Mo,Yanping Zheng,Zhewei Wei,Yuhang Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Future link prediction on temporal graphs is a fundamental task with wide applicability in real-world dynamic systems. These scenarios often involve both recurring (seen) and novel (unseen) interactions, requiring models to generalize effectively across both types of edges. However, existing methods typically rely on complex memory and aggregation modules, yet struggle to handle unseen edges. In this paper, we revisit the architecture of existing temporal graph models and identify two essential but overlooked modeling requirements for future link prediction: representing nodes with unique identifiers and performing target-aware matching between source and destination nodes. To this end, we propose Cross-Attention based Future Link Predictor on Temporal Graphs (CRAFT), a simple yet effective architecture that discards memory and aggregation modules and instead builds on two components: learnable node embeddings and cross-attention between the destination and the source’s recent interactions. This design provides strong expressive power and enables target-aware modeling of the compatibility between candidate destinations and the source’s interaction patterns. Extensive experiments on diverse datasets demonstrate that CRAFT consistently achieves superior performance with high efficiency, making it well-suited for large-scale real-world applications.

[LG-75] Are Time-Series Foundation Models Deployment-Ready? A Systematic Study of Adversarial Robustness Across Domains

链接: https://arxiv.org/abs/2505.19397
作者: Jiawen Zhang,Zhenwei Zhang,Shun Zheng,Xumeng Wen,Jia Li,Jiang Bian
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs), which are pretrained on large-scale, cross-domain data and capable of zero-shot forecasting in new scenarios without further training, are increasingly adopted in real-world applications. However, as the zero-shot forecasting paradigm gets popular, a critical yet overlooked question emerges: Are TSFMs robust to adversarial input perturbations? Such perturbations could be exploited in man-in-the-middle attacks or data poisoning. To address this gap, we conduct a systematic investigation into the adversarial robustness of TSFMs. Our results show that even minimal perturbations can induce significant and controllable changes in forecast behaviors, including trend reversal, temporal drift, and amplitude shift, posing serious risks to TSFM-based services. Through experiments on representative TSFMs and multiple datasets, we reveal their consistent vulnerabilities and identify potential architectural designs, such as structural sparsity and multi-task pretraining, that may improve robustness. Our findings offer actionable guidance for designing more resilient forecasting systems and provide a critical assessment of the adversarial robustness of TSFMs.

[LG-76] Alignment of large language models with constrained learning

链接: https://arxiv.org/abs/2505.19387
作者: Botong Zhang,Shuo Li,Ignacio Hounie,Osbert Bastani,Dongsheng Ding,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 48 pages, 7 figures, 7 tables

点击查看摘要

Abstract:We study the problem of computing an optimal large language model (LLM) policy for a constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF dataset.

[LG-77] Likert or Not: LLM Absolute Relevance Judgments on Fine-Grained Ordinal Scales

链接: https://arxiv.org/abs/2505.19334
作者: Charles Godfrey,Ping Nie,Natalia Ostapuk,David Ken,Shang Gao,Souheil Inati
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) obtain state of the art zero shot relevance ranking performance on a variety of information retrieval tasks. The two most common prompts to elicit LLM relevance judgments are pointwise scoring (a.k.a. relevance generation), where the LLM sees a single query-document pair and outputs a single relevance score, and listwise ranking (a.k.a. permutation generation), where the LLM sees a query and a list of documents and outputs a permutation, sorting the documents in decreasing order of relevance. The current research community consensus is that listwise ranking yields superior performance, and significant research effort has been devoted to crafting LLM listwise ranking algorithms. The underlying hypothesis is that LLMs are better at making relative relevance judgments than absolute ones. In tension with this hypothesis, we find that the gap between pointwise scoring and listwise ranking shrinks when pointwise scoring is implemented using a sufficiently large ordinal relevance label space, becoming statistically insignificant for many LLM-benchmark dataset combinations (where significant'' means 95% confidence that listwise ranking improves NDCG@10’'). Our evaluations span four LLMs, eight benchmark datasets from the BEIR and TREC-DL suites, and two proprietary datasets with relevance labels collected after the training cut-off of all LLMs evaluated.

[LG-78] Paying Alignment Tax with Contrastive Learning

链接: https://arxiv.org/abs/2505.19327
作者: Buse Sibel Korkmaz,Rahul Nair,Elizabeth M. Daly,Antonio del Rio Chanona
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current debiasing approaches often result a degradation in model capabilities such as factual accuracy and knowledge retention. Through systematic evaluation across multiple benchmarks, we demonstrate that existing debiasing methods face fundamental trade-offs, particularly in smaller models, leading to reduced truthfulness, knowledge loss, or unintelligible outputs. To address these limitations, we propose a contrastive learning framework that learns through carefully constructed positive and negative examples. Our approach introduces contrast computation and dynamic loss scaling to balance bias mitigation with faithfulness preservation. Experimental results across multiple model scales demonstrate that our method achieves substantial improvements in both toxicity reduction and faithfulness preservation. Most importantly, we show that our framework is the first to consistently improve both metrics simultaneously, avoiding the capability degradation characteristic of existing approaches. These results suggest that explicit modeling of both positive and negative examples through contrastive learning could be a promising direction for reducing the alignment tax in language model debiasing.

[LG-79] Concept Reachability in Diffusion Models: Beyond Dataset Constraints

链接: https://arxiv.org/abs/2505.19313
作者: Marta Aparicio Rodriguez,Xenia Miscouridou,Anastasia Borovykh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite significant advances in quality and complexity of the generations in text-to-image models, prompting does not always lead to the desired outputs. Controlling model behaviour by directly steering intermediate model activations has emerged as a viable alternative allowing to reach concepts in latent space that may otherwise remain inaccessible by prompt. In this work, we introduce a set of experiments to deepen our understanding of concept reachability. We design a training data setup with three key obstacles: scarcity of concepts, underspecification of concepts in the captions, and data biases with tied concepts. Our results show: (i) concept reachability in latent space exhibits a distinct phase transition, with only a small number of samples being sufficient to enable reachability, (ii) where in the latent space the intervention is performed critically impacts reachability, showing that certain concepts are reachable only at certain stages of transformation, and (iii) while prompting ability rapidly diminishes with a decrease in quality of the dataset, concepts often remain reliably reachable through steering. Model providers can leverage this to bypass costly retraining and dataset curation and instead innovate with user-facing control mechanisms.

[LG-80] Hypercube-RAG : Hypercube-Based Retrieval-Augmented Generation for In-domain Scientific Question-Answering

链接: https://arxiv.org/abs/2505.19288
作者: Jimeng Shi,Sizhe Zhou,Bowen Jin,Wei Hu,Shaowen Wang,Giri Narasimhan,Jiawei Han
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often need to incorporate external knowledge to solve theme-specific problems. Retrieval-augmented generation (RAG), which empowers LLMs to generate more qualified responses with retrieved external data and knowledge, has shown its high promise. However, traditional semantic similarity-based RAGs struggle to return concise yet highly relevant information for domain knowledge-intensive tasks, such as scientific question-answering (QA). Built on a multi-dimensional (cube) structure called Hypercube, which can index documents in an application-driven, human-defined, multi-dimensional space, we introduce the Hypercube-RAG, a novel RAG framework for precise and efficient retrieval. Given a query, Hypercube-RAG first decomposes it based on its entities and topics and then retrieves relevant documents from cubes by aligning these decomposed components with hypercube dimensions. Experiments on three in-domain scientific QA datasets demonstrate that our method improves accuracy by 3.7% and boosts retrieval efficiency by 81.2%, measured as relative gains over the strongest RAG baseline. More importantly, our Hypercube-RAG inherently offers explainability by revealing the underlying predefined hypercube dimensions used for retrieval. The code and data sets are available at this https URL.

[LG-81] A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

链接: https://arxiv.org/abs/2505.19281
作者: Yuzheng Hu,Fan Wu,Haotian Ye,David Forsyth,James Zou,Nan Jiang,Jiaqi W. Ma,Han Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online reinforcement learning (RL) excels in complex, safety-critical domains, yet it faces challenges such as sample inefficiency, training instability, and a lack of interpretability. Data attribution offers a principled way to trace model behavior back to individual training samples. However, in online RL, each training sample not only drives policy updates but also influences future data collection, violating the fixed dataset assumption in existing attribution methods. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a local attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record’s contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Overall, these results advance interpretability, efficiency, and effectiveness of online RL.

[LG-82] owards a Spatiotemporal Fusion Approach to Precipitation Nowcasting

链接: https://arxiv.org/abs/2505.19258
作者: Felipe Curcio,Pedro Castro,Augusto Fonseca,Rafaela Castro,Raquel Franco,Eduardo Ogasawara,Victor Stepanenko,Fabio Porto,Mariza Ferro,Eduardo Bezerra
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted manuscript submitted to FUSION 2025 ( this https URL )

点击查看摘要

Abstract:With the increasing availability of meteorological data from various sensors, numerical models and reanalysis products, the need for efficient data integration methods has become paramount for improving weather forecasts and hydrometeorological studies. In this work, we propose a data fusion approach for precipitation nowcasting by integrating data from meteorological and rain gauge stations in Rio de Janeiro metropolitan area with ERA5 reanalysis data and GFS numerical weather prediction. We employ the spatiotemporal deep learning architecture called STConvS2S, leveraging a structured dataset covering a 9 x 11 grid. The study spans from January 2011 to October 2024, and we evaluate the impact of integrating three surface station systems. Among the tested configurations, the fusion-based model achieves an F1-score of 0.2033 for forecasting heavy precipitation events (greater than 25 mm/h) at a one-hour lead time. Additionally, we present an ablation study to assess the contribution of each station network and propose a refined inference strategy for precipitation nowcasting, integrating the GFS numerical weather prediction (NWP) data with in-situ observations.

[LG-83] Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipfs Law

链接: https://arxiv.org/abs/2505.19227
作者: Frederik Kunstner,Francis Bach
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the k th most frequent word \pi_k is proportional to 1/k , following Zipf’s law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law \pi_k \propto 1/k^\alpha parameterized by the exponent \alpha 0 . We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent \alpha . Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent \alpha 1 . This assumption effectively makes the problem finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case \alpha = 1 as found in text data is worst-case’’ for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.

[LG-84] LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

链接: https://arxiv.org/abs/2505.19223
作者: Fengqi Zhu,Rongzhen Wang,Shen Nie,Xiaolu Zhang,Chunwei Wu,Jun Hu,Jun Zhou,Jianfei Chen,Yankai Lin,Ji-Rong Wen,Chongxuan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Masked Diffusion Models (MDMs), such as LLaDA, present a promising paradigm for language modeling, there has been relatively little effort in aligning these models with human preferences via reinforcement learning. The challenge primarily arises from the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, we propose Variance-Reduced Preference Optimization (VRPO), a framework that formally analyzes the variance of ELBO estimators and derives bounds on both the bias and variance of preference optimization gradients. Building on this theoretical foundation, we introduce unbiased variance reduction strategies, including optimal Monte Carlo budget allocation and antithetic sampling, that significantly improve the performance of MDM alignment. We demonstrate the effectiveness of VRPO by applying it to LLaDA, and the resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently and significantly across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment benchmarks (IFEval +4.0, Arena-Hard +4.3). Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to strong language MDMs and ARMs. Project page: this https URL.

[LG-85] Interpretable Graph Learning Over Sets of Temporally-Sparse Data

链接: https://arxiv.org/abs/2505.19193
作者: Andrea Zerio,Maya Bechler-Speicher,Maor Huri,Marie Vibeke Vestergaard,Ran Gilad-Bachrach,Tine Jess,Samir Bhatt,Aleksejs Sazonovs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world medical data often includes measurements from multiple signals that are collected at irregular and asynchronous time intervals. For example, different types of blood tests can be measured at different times and frequencies, resulting in fragmented and unevenly scattered temporal data. Similar issues of irregular sampling of different attributes occur in other domains, such as monitoring of large systems using event log files or the spread of fake news on social networks. Effectively learning from such data requires models that can handle sets of temporally sparse and heterogeneous signals. In this paper, we propose Graph Mixing Additive Networks (GMAN), a novel and interpretable-by-design model for learning over irregular sets of temporal signals. Our method achieves state-of-the-art performance in real-world medical tasks, including a 4-point increase in the AUROC score of in-hospital mortality prediction, compared to existing methods. We further showcase GMAN’s flexibility by applying it to a fake news detection task. We demonstrate how its interpretability capabilities, including node-level, graph-level, and subset-level importance, allow for transition phases detection and gaining medical insights with real-world high-stakes implications. Finally, we provide theoretical insights on GMAN expressive power.

[LG-86] Chordless Structure: A Pathway to Simple and Expressive GNNs

链接: https://arxiv.org/abs/2505.19188
作者: Hongxu Pan,Shuxian Hu,Mo Zhou,Zhibin Wang,Rong Gu,Chen Tian,Kun Yang,Sheng Zhong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Researchers have proposed various methods of incorporating more structured information into the design of Graph Neural Networks (GNNs) to enhance their expressiveness. However, these methods are either computationally expensive or lacking in provable expressiveness. In this paper, we observe that the chords increase the complexity of the graph structure while contributing little useful information in many cases. In contrast, chordless structures are more efficient and effective for representing the graph. Therefore, when leveraging the information of cycles, we choose to omit the chords. Accordingly, we propose a Chordless Structure-based Graph Neural Network (CSGNN) and prove that its expressiveness is strictly more powerful than the k-hop GNN (KPGNN) with polynomial complexity. Experimental results on real-world datasets demonstrate that CSGNN outperforms existing GNNs across various graph tasks while incurring lower computational costs and achieving better performance than the GNNs of 3-WL expressiveness.

[LG-87] Federated Learning: From Theory to Practice

链接: https://arxiv.org/abs/2505.19183
作者: A. Jung
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This book offers a hands-on introduction to building and understanding federated learning (FL) systems. FL enables multiple devices – such as smartphones, sensors, or local computers – to collaboratively train machine learning (ML) models, while keeping their data private and local. It is a powerful solution when data cannot or should not be centralized due to privacy, regulatory, or technical reasons. The book is designed for students, engineers, and researchers who want to learn how to design scalable, privacy preserving FL systems. Our main focus is on personalization: enabling each device to train its own model while still benefiting from collaboration with relevant devices. This is achieved by leveraging similarities between (the learning tasks associated with) devices that are encoded by the weighted edges (or links) of a federated learning network (FL network). The key idea is to represent real-world FL systems as networks of devices, where nodes correspond to device and edges represent communication links and data similarities between them. The training of personalized models for these devices can be naturally framed as a distributed optimization problem. This optimization problem is referred to as generalized total variation minimization (GTVMin) and ensures that devices with similar learning tasks learn similar model parameters. Our approach is both mathematically principled and practically motivated. While we introduce some advanced ideas from optimization theory and graph-based learning, we aim to keep the book accessible. Readers are guided through the core ideas step by step, with intuitive explanations.

[LG-88] Computational Inertia as a Conserved Quantity in Frictionless and Damped Learning Dynamics

链接: https://arxiv.org/abs/2505.19171
作者: Atahan Karagoz
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We identify a conserved quantity in continuous-time optimization dynamics, termed computational inertia. Defined as the sum of kinetic energy (parameter velocity) and potential energy (loss), this scalar remains invariant under idealized, frictionless training. We formalize this conservation law, derive its analytic decay under damping and stochastic perturbations, and demonstrate its behavior in a synthetic system. The invariant offers a compact lens for interpreting learning trajectories, and may inform theoretical tools for analyzing convergence, stability, and training geometry.

[LG-89] ADGSyn: Dual-Stream Learning for Efficient Anticancer Drug Synergy Prediction

链接: https://arxiv.org/abs/2505.19144
作者: Yuxuan Nie,Yutong Song,Hong Peng
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Drug combinations play a critical role in cancer therapy by significantly enhancing treatment efficacy and overcoming drug resistance. However, the combinatorial space of possible drug pairs grows exponentially, making experimental screening highly impractical. Therefore, developing efficient computational methods to predict promising drug combinations and guide experimental validation is of paramount importance. In this work, we propose ADGSyn, an innovative method for predicting drug synergy. The key components of our approach include: (1) shared projection matrices combined with attention mechanisms to enable cross-drug feature alignment; (2) automatic mixed precision (AMP)-optimized graph operations that reduce memory consumption by 40% while accelerating training speed threefold; and (3) residual pathways stabilized by LayerNorm to ensure stable gradient propagation during training. Evaluated on the O’Neil dataset containing 13,243 drug–cell line combinations, ADGSyn demonstrates superior performance over eight baseline methods. Moreover, the framework supports full-batch processing of up to 256 molecular graphs on a single GPU, setting a new standard for efficiency in drug synergy prediction within the field of computational oncology.

[LG-90] Incentivizing High-Quality Human Annotations with Golden Questions

链接: https://arxiv.org/abs/2505.19134
作者: Shang Liu,Zhongze Cai,Hanzhao Wang,Zhongyao Ma,Xiaocheng Li
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2502.06387

点击查看摘要

Abstract:Human-annotated data plays a vital role in training large language models (LLMs), such as supervised fine-tuning and human preference alignment. However, it is not guaranteed that paid human annotators produce high-quality data. In this paper, we study how to incentivize human annotators to do so. We start from a principal-agent model to model the dynamics between the company (the principal) and the annotator (the agent), where the principal can only monitor the annotation quality by examining n samples. We investigate the maximum likelihood estimators (MLE) and the corresponding hypothesis testing to incentivize annotators: the agent is given a bonus if the MLE passes the test. By analyzing the variance of the outcome, we show that the strategic behavior of the agent makes the hypothesis testing very different from traditional ones: Unlike the exponential rate proved by the large deviation theory, the principal-agent model’s hypothesis testing rate is of \Theta(1/\sqrtn \log n) . Our theory implies two criteria for the \emphgolden questions to monitor the performance of the annotators: they should be of (1) high certainty and (2) similar format to normal ones. In that light, we select a set of golden questions in human preference data. By doing incentive-compatible experiments, we find out that the annotators’ behavior is better revealed by those golden questions, compared to traditional survey techniques such as instructed manipulation checks.

[LG-91] Fast and Accurate Power Load Data Completion via Regularization-optimized Low-Rank Factorization

链接: https://arxiv.org/abs/2505.19133
作者: Yan Xia,Hao Feng,Hongwei Sun,Junjie Wang,Qicong Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank representation learning has emerged as a powerful tool for recovering missing values in power load data due to its ability to exploit the inherent low-dimensional structures of spatiotemporal measurements. Among various techniques, low-rank factorization models are favoured for their efficiency and interpretability. However, their performance is highly sensitive to the choice of regularization parameters, which are typically fixed or manually tuned, resulting in limited generalization capability or slow convergence in practical scenarios. In this paper, we propose a Regularization-optimized Low-Rank Factorization, which introduces a Proportional-Integral-Derivative controller to adaptively adjust the regularization coefficient. Furthermore, we provide a detailed algorithmic complexity analysis, showing that our method preserves the computational efficiency of stochastic gradient descent while improving adaptivity. Experimental results on real-world power load datasets validate the superiority of our method in both imputation accuracy and training efficiency compared to existing baselines.

[LG-92] Optimization-Inspired Few-Shot Adaptation for Large Language Models

链接: https://arxiv.org/abs/2505.19107
作者: Boyan Gao,Xin Wang,Yibo Yang,David Clifton
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications. However, adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios. Existing approaches, such as in-context learning and Parameter-Efficient Fine-Tuning (PEFT), face key limitations: in-context learning introduces additional inference computational overhead with limited performance gains, while PEFT models are prone to overfitting on the few demonstration examples. In this work, we reinterpret the forward pass of LLMs as an optimization process, a sequence of preconditioned gradient descent steps refining internal representations. Based on this connection, we propose Optimization-Inspired Few-Shot Adaptation (OFA), integrating a parameterization that learns preconditioners without introducing additional trainable parameters, and an objective that improves optimization efficiency by learning preconditioners based on a convergence bound, while simultaneously steering the optimization path toward the flat local minimum. Our method overcomes both issues of ICL-based and PEFT-based methods, and demonstrates superior performance over the existing methods on a variety of few-shot adaptation tasks in experiments.

[LG-93] Latent Mamba Operator for Partial Differential Equations

链接: https://arxiv.org/abs/2505.19105
作者: Karn Tiwari,Niladri Dutta,N M Anoop Krishnan,Prathosh A P
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s)

点击查看摘要

Abstract:Neural operators have emerged as powerful data-driven frameworks for solving Partial Differential Equations (PDEs), offering significant speedups over numerical methods. However, existing neural operators struggle with scalability in high-dimensional spaces, incur high computational costs, and face challenges in capturing continuous and long-range dependencies in PDE dynamics. To address these limitations, we introduce the Latent Mamba Operator (LaMO), which integrates the efficiency of state-space models (SSMs) in latent space with the expressive power of kernel integral formulations in neural operators. We also establish a theoretical connection between state-space models (SSMs) and the kernel integral of neural operators. Extensive experiments across diverse PDE benchmarks on regular grids, structured meshes, and point clouds covering solid and fluid physics datasets, LaMOs achieve consistent state-of-the-art (SOTA) performance, with a 32.3% improvement over existing baselines in solution operator approximation, highlighting its efficacy in modeling complex PDE solutions.

[LG-94] owards Robust Influence Functions with Flat Validation Minima ICML2025

链接: https://arxiv.org/abs/2505.19097
作者: Xichen Ye,Yifan Wu,Weizhong Zhang,Cheng Jin,Yifan Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICML 2025. arXiv admin note: text overlap with arXiv:2310.00902 by other authors

点击查看摘要

Abstract:The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.

[LG-95] CMoS: Rethinking Time Series Prediction Through the Lens of Chunk-wise Spatial Correlations ICML’25

链接: https://arxiv.org/abs/2505.19090
作者: Haotian Si,Changhua Pei,Jianhui Li,Dan Pei,Gaogang Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by Forty-second International Conference on Machine Learning (ICML’25)

点击查看摘要

Abstract:Recent advances in lightweight time series forecasting models suggest the inherent simplicity of time series forecasting tasks. In this paper, we present CMoS, a super-lightweight time series forecasting model. Instead of learning the embedding of the shapes, CMoS directly models the spatial correlations between different time series chunks. Additionally, we introduce a Correlation Mixing technique that enables the model to capture diverse spatial correlations with minimal parameters, and an optional Periodicity Injection technique to ensure faster convergence. Despite utilizing as low as 1% of the lightweight model DLinear’s parameters count, experimental results demonstrate that CMoS outperforms existing state-of-the-art models across multiple datasets. Furthermore, the learned weights of CMoS exhibit great interpretability, providing practitioners with valuable insights into temporal structures within specific application scenarios.

[LG-96] mperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

链接: https://arxiv.org/abs/2505.19087
作者: Itamar Harel,Yonathan Wolanowsky,Gal Vardi,Nathan Srebro,Daniel Soudry
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution \theta_0 \sim p_0 . We focus on Langevin dynamics with a positive temperature \beta^-1 , i.e. gradient descent on a training loss L with infinitesimal step size, perturbed with \beta^-1 -variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by \sqrt(\beta\mathbbE L (\theta_0) + \log(1/\delta))/N with probability 1-\delta over the dataset, where N is the sample size, and \mathbbE L (\theta_0) =O(1) with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

[LG-97] Recalibrating binary probabilistic classifiers

链接: https://arxiv.org/abs/2505.19068
作者: Dirk Tasche
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 16 pages, 2 figures, 1 table

点击查看摘要

Abstract:Recalibration of binary probabilistic classifiers to a target prior probability is an important task in areas like credit risk management. We analyse methods for recalibration from a distribution shift perspective. Distribution shift assumptions linked to the area under the curve (AUC) of a probabilistic classifier are found to be useful for the design of meaningful recalibration methods. Two new methods called parametric covariate shift with posterior drift (CSPD) and ROC-based quasi moment matching (QMM) are proposed and tested together with some other methods in an example setting. The outcomes of the test suggest that the QMM methods discussed in the paper can provide appropriately conservative results in evaluations with concave functionals like for instance risk weights functions for credit risk.

[LG-98] Adversarial Bandit over Bandits: Hierarchical Bandits for Online Configuration Management

链接: https://arxiv.org/abs/2505.19061
作者: Chen Avin,Zvi Lotker,Shie Mannor,Gil Shabat,Hanan Shteingart,Roey Yadgar
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Motivated by dynamic parameter optimization in finite, but large action (configurations) spaces, this work studies the nonstochastic multi-armed bandit (MAB) problem in metric action spaces with oblivious Lipschitz adversaries. We propose ABoB, a hierarchical Adversarial Bandit over Bandits algorithm that can use state-of-the-art existing “flat” algorithms, but additionally clusters similar configurations to exploit local structures and adapt to changing environments. We prove that in the worst-case scenario, such clustering approach cannot hurt too much and ABoB guarantees a standard worst-case regret bound of O\left(k^\frac12T^\frac12\right) , where T is the number of rounds and k is the number of arms, matching the traditional flat approach. However, under favorable conditions related to the algorithm properties, clusters properties, and certain Lipschitz conditions, the regret bound can be improved to O\left(k^\frac14T^\frac12\right) . Simulations and experiments on a real storage system demonstrate that ABoB, using standard algorithms like EXP3 and Tsallis-INF, achieves lower regret and faster convergence than the flat method, up to 50% improvement in known previous setups, nonstochastic and stochastic, as well as in our settings.

[LG-99] Distributionally Robust Deep Q-Learning

链接: https://arxiv.org/abs/2505.19058
作者: Chung I Lu,Julian Sester,Aijia Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Portfolio Management (q-fin.PM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a novel distributionally robust Q -learning algorithm for the non-tabular case accounting for continuous state spaces where the state transition of the underlying Markov decision process is subject to model uncertainty. The uncertainty is taken into account by considering the worst-case transition from a ball around a reference probability measure. To determine the optimal policy under the worst-case state transition, we solve the associated non-linear Bellman equation by dualising and regularising the Bellman operator with the Sinkhorn distance, which is then parameterized with deep neural networks. This approach allows us to modify the Deep Q-Network algorithm to optimise for the worst case state transition. We illustrate the tractability and effectiveness of our approach through several applications, including a portfolio optimisation task based on S\P~500 data. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Portfolio Management (q-fin.PM); Machine Learning (stat.ML) Cite as: arXiv:2505.19058 [cs.LG] (or arXiv:2505.19058v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.19058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] Reduce Computational Cost In Deep Reinforcement Learning Via Randomized Policy Learning

链接: https://arxiv.org/abs/2505.19054
作者: Zhuochen Liu,Rahul Jain,Quan Nguyen
类目: Machine Learning (cs.LG)
*备注: 8 pages main, 12 pages total, 6 figures

点击查看摘要

Abstract:Recent advancements in reinforcement learning (RL) have leveraged neural networks to achieve state-of-the-art performance across various control tasks. However, these successes often come at the cost of significant computational resources, as training deep neural networks requires substantial time and data. In this paper, we introduce an actor-critic algorithm that utilizes randomized neural networks to drastically reduce computational costs while maintaining strong performance. Despite its simple architecture, our method effectively solves a range of control problems, including the locomotion control of a highly dynamic 12-motor quadruped robot, and achieves results comparable to leading algorithms such as Proximal Policy Optimization (PPO). Notably, our approach does not outperform other algorithms in terms of sample efficnency but rather in terms of wall-clock training time. That is, although our algorithm requires more timesteps to converge to an optimal policy, the actual time required for training turns out to be lower.

[LG-101] Structured Reinforcement Learning for Combinatorial Decision-Making

链接: https://arxiv.org/abs/2505.19053
作者: Heiko Hoppe,Léo Baty,Louis Bouvier,Axel Parmentier,Maximilian Schiffer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:Reinforcement learning (RL) is increasingly applied to real-world problems involving complex and structured decisions, such as routing, scheduling, and assortment planning. These settings challenge standard RL algorithms, which struggle to scale, generalize, and exploit structure in the presence of combinatorial action spaces. We propose Structured Reinforcement Learning (SRL), a novel actor-critic framework that embeds combinatorial optimization layers into the actor neural network. We enable end-to-end learning of the actor via Fenchel-Young losses and provide a geometric interpretation of SRL as a primal-dual algorithm in the dual of the moment polytope. Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92% on dynamic problems, with improved stability and convergence speed.

[LG-102] Offline Clustering of Linear Bandits: Unlocking the Power of Clusters in Data-Limited Environments

链接: https://arxiv.org/abs/2505.19043
作者: Jingyuan Liu,Zeyu Zhang,Xuchuang Wang,Xutong Liu,John C.S. Lui,Mohammad Hajiesmaili,Carlee Joe-Wong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Contextual linear multi-armed bandits are a learning framework for making a sequence of decisions, e.g., advertising recommendations for a sequence of arriving users. Recent works have shown that clustering these users based on the similarity of their learned preferences can significantly accelerate the learning. However, prior work has primarily focused on the online setting, which requires continually collecting user data, ignoring the offline data widely available in many applications. To tackle these limitations, we study the offline clustering of bandits (Off-ClusBand) problem, which studies how to use the offline dataset to learn cluster properties and improve decision-making across multiple users. The key challenge in Off-ClusBand arises from data insufficiency for users: unlike the online case, in the offline case, we have a fixed, limited dataset to work from and thus must determine whether we have enough data to confidently cluster users together. To address this challenge, we propose two algorithms: Off-C ^2 LUB, which we analytically show performs well for arbitrary amounts of user data, and Off-CLUB, which is prone to bias when data is limited but, given sufficient data, matches a theoretical lower bound that we derive for the offline clustered MAB problem. We experimentally validate these results on both real and synthetic datasets.

[LG-103] Learn Beneficial Noise as Graph Augmentation

链接: https://arxiv.org/abs/2505.19024
作者: Siqi Huang,Yanchen Xu,Hongyuan Zhang,Xuelong Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Although graph contrastive learning (GCL) has been widely investigated, it is still a challenge to generate effective and stable graph augmentations. Existing methods often apply heuristic augmentation like random edge dropping, which may disrupt important graph structures and result in unstable GCL performance. In this paper, we propose Positive-incentive Noise driven Graph Data Augmentation (PiNGDA), where positive-incentive noise (pi-noise) scientifically analyzes the beneficial effect of noise under the information theory. To bridge the standard GCL and pi-noise framework, we design a Gaussian auxiliary variable to convert the loss function to information entropy. We prove that the standard GCL with pre-defined augmentations is equivalent to estimate the beneficial noise via the point estimation. Following our analysis, PiNGDA is derived from learning the beneficial noise on both topology and attributes through a trainable noise generator for graph augmentations, instead of the simple estimation. Since the generator learns how to produce beneficial perturbations on graph topology and node attributes, PiNGDA is more reliable compared with the existing methods. Extensive experimental results validate the effectiveness and stability of PiNGDA.

[LG-104] Querying Kernel Methods Suffices for Reconstructing their Training Data

链接: https://arxiv.org/abs/2505.19019
作者: Daniel Barzilai,Yuval Margalit,Eitan Gronich,Gilad Yehudai,Meirav Galun,Ronen Basri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over-parameterized models have raised concerns about their potential to memorize training data, even when achieving strong generalization. The privacy implications of such memorization are generally unclear, particularly in scenarios where only model outputs are accessible. We study this question in the context of kernel methods, and demonstrate both empirically and theoretically that querying kernel models at various points suffices to reconstruct their training data, even without access to model parameters. Our results hold for a range of kernel methods, including kernel regression, support vector machines, and kernel density estimation. Our hope is that this work can illuminate potential privacy concerns for such models.

[LG-105] okenizing Electron Cloud in Protein-Ligand Interaction Learning

链接: https://arxiv.org/abs/2505.19014
作者: Haitao Lin,Odin Zhang,Jia Xu,Yunfan Liu,Zheng Cheng,Lirong Wu,Yufei Huang,Zhifeng Gao,Stan Z. Li
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
*备注: conference paper

点击查看摘要

Abstract:The affinity and specificity of protein-molecule binding directly impact functional outcomes, uncovering the mechanisms underlying biological regulation and signal transduction. Most deep-learning-based prediction approaches focus on structures of atoms or fragments. However, quantum chemical properties, such as electronic structures, are the key to unveiling interaction patterns but remain largely underexplored. To bridge this gap, we propose ECBind, a method for tokenizing electron cloud signals into quantized embeddings, enabling their integration into downstream tasks such as binding affinity prediction. By incorporating electron densities, ECBind helps uncover binding modes that cannot be fully represented by atom-level models. Specifically, to remove the redundancy inherent in electron cloud signals, a structure-aware transformer and hierarchical codebooks encode 3D binding sites enriched with electron structures into tokens. These tokenized codes are then used for specific tasks with labels. To extend its applicability to a wider range of scenarios, we utilize knowledge distillation to develop an electron-cloud-agnostic prediction model. Experimentally, ECBind demonstrates state-of-the-art performance across multiple tasks, achieving improvements of 6.42% and 15.58% in per-structure Pearson and Spearman correlation coefficients, respectively.

[LG-106] Automatic and Structure-Aware Sparsification of Hybrid Neural ODEs

链接: https://arxiv.org/abs/2505.18996
作者: Bob Junyi Zou,Lu Tian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Hybrid neural ordinary differential equations (neural ODEs) integrate mechanistic models with neural ODEs, offering strong inductive bias and flexibility, and are particularly advantageous in data-scarce healthcare settings. However, excessive latent states and interactions from mechanistic models can lead to training inefficiency and over-fitting, limiting practical effectiveness of hybrid neural ODEs. In response, we propose a new hybrid pipeline for automatic state selection and structure optimization in mechanistic neural ODEs, combining domain-informed graph modifications with data-driven regularization to sparsify the model for improving predictive performance and stability while retaining mechanistic plausibility. Experiments on synthetic and real-world data show improved predictive performance and robustness with desired sparsity, establishing an effective solution for hybrid model reduction in healthcare applications.

[LG-107] FedSKC: Federated Learning with Non-IID Data via Structural Knowledge Collaboration

链接: https://arxiv.org/abs/2505.18981
作者: Huan Wang,Haoran Li,Huaming Chen,Jun Yan,Lijuan Wang,Jiahua Shi,Shiping Chen,Jun Shen
类目: Machine Learning (cs.LG)
*备注: 11 pages, International Conference on Web Services (ICWS) 2025

点击查看摘要

Abstract:With the advancement of edge computing, federated learning (FL) displays a bright promise as a privacy-preserving collaborative learning paradigm. However, one major challenge for FL is the data heterogeneity issue, which refers to the biased labeling preferences among multiple clients, negatively impacting convergence and model performance. Most previous FL methods attempt to tackle the data heterogeneity issue locally or globally, neglecting underlying class-wise structure information contained in each client. In this paper, we first study how data heterogeneity affects the divergence of the model and decompose it into local, global, and sampling drift sub-problems. To explore the potential of using intra-client class-wise structural knowledge in handling these drifts, we thus propose Federated Learning with Structural Knowledge Collaboration (FedSKC). The key idea of FedSKC is to extract and transfer domain preferences from inter-client data distributions, offering diverse class-relevant knowledge and a fair convergent signal. FedSKC comprises three components: i) local contrastive learning, to prevent weight divergence resulting from local training; ii) global discrepancy aggregation, which addresses the parameter deviation between the server and clients; iii) global period review, correcting for the sampling drift introduced by the server randomly selecting devices. We have theoretically analyzed FedSKC under non-convex objectives and empirically validated its superiority through extensive experimental results.

[LG-108] GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

链接: https://arxiv.org/abs/2505.18979
作者: Zixuan Chen,Hao Lin,Ke Xu,Xinghao Jiang,Tanfeng Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation models can inadvertently produce not-safe-for-work (NSFW) content, prompting the integration of text and image safety filters. Recent advances employ large language models (LLMs) for semantic-level detection, rendering traditional token-level perturbation attacks largely ineffective. However, our evaluation shows that existing jailbreak methods are ineffective against these modern filters. We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) Dynamic Optimization, an iterative process that guides a large language model (LLM) using feedback from text safety filters and CLIP similarity scores to generate semantically aligned adversarial prompts; and (ii) Adaptive Safety Indicator Injection, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. GhostPrompt achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5% (Sneakyprompt) to 99.0%, improving CLIP score from 0.2637 to 0.2762, and reducing the time cost by 4.2 \times . Moreover, it generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images in our evaluation, revealing systemic vulnerabilities in current multimodal defenses. To support further research on AI safety and red-teaming, we will release code and adversarial prompts under a controlled-access protocol.

[LG-109] Online Knowledge Distillation with Reward Guidance

链接: https://arxiv.org/abs/2505.18952
作者: Chen Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work studies knowledge distillation (KD) for large language models (LLMs) through preference optimization. We propose a reward-guided imitation learning framework for sequential KD, formulating a min-max optimization problem between the policy and reward model (RM) to minimize the performance gap between the student and teacher policies. Specifically, the reward optimization is constrained to achieve near-optimality within a confidence set for preference alignment. For preference data construction, we explore both offline and online preference-based KD. Additionally, we reformulate the RM using the Q -value function and extend the framework to white-box KD, where the teacher policy’s predicted probabilities are accessible. Theoretical analysis and empirical results demonstrate the effectiveness of the proposed framework.

[LG-110] Exact Expressive Power of Transformers with Padding

链接: https://arxiv.org/abs/2505.18948
作者: William Merrill,Ashish Sabharwal
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer’s expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding converge to precisely the class \mathsfTC^0 of extremely parallelizable problems. While the \mathsfTC^0 upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with O(\log^d n) looping on inputs of length n recognize exactly the class \mathsfTC^d of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers’ expressive power: with polylogarithmic looping, padded transformers converge to the class \mathsfNC , the best that could be expected without losing parallelism (unless \mathsfNC = \mathsfP ). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.

[LG-111] Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time

链接: https://arxiv.org/abs/2505.18926
作者: Jingxuan Xu,Hong Huang,Chuhang Zou,Manolis Savva,Yunchao Wei,Wuyang Chen
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We propose a neural physics system for real-time, interactive fluid simulations. Traditional physics-based methods, while accurate, are computationally intensive and suffer from latency issues. Recent machine-learning methods reduce computational costs while preserving fidelity; yet most still fail to satisfy the latency constraints for real-time use and lack support for interactive applications. To bridge this gap, we introduce a novel hybrid method that integrates numerical simulation, neural physics, and generative control. Our neural physics jointly pursues low-latency simulation and high physical fidelity by employing a fallback safeguard to classical numerical solvers. Furthermore, we develop a diffusion-based controller that is trained using a reverse modeling strategy to generate external dynamic force fields for fluid manipulation. Our system demonstrates robust performance across diverse 2D/3D scenarios, material types, and obstacle interactions, achieving real-time simulations at high frame rates (11~29% latency) while enabling fluid control guided by user-friendly freehand sketches. We present a significant step towards practical, controllable, and physically plausible fluid simulations for real-time interactive applications. We promise to release both models and data upon acceptance.

[LG-112] Graph-Based Operator Learning from Limited Data on Irregular Domains

链接: https://arxiv.org/abs/2505.18923
作者: Yile Li,Shandian Zhe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Operator learning seeks to approximate mappings from input functions to output solutions, particularly in the context of partial differential equations (PDEs). While recent advances such as DeepONet and Fourier Neural Operator (FNO) have demonstrated strong performance, they often rely on regular grid discretizations, limiting their applicability to complex or irregular domains. In this work, we propose a Graph-based Operator Learning with Attention (GOLA) framework that addresses this limitation by constructing graphs from irregularly sampled spatial points and leveraging attention-enhanced Graph Neural Netwoks (GNNs) to model spatial dependencies with global information. To improve the expressive capacity, we introduce a Fourier-based encoder that projects input functions into a frequency space using learnable complex coefficients, allowing for flexible embeddings even with sparse or nonuniform samples. We evaluated our approach across a range of 2D PDEs, including Darcy Flow, Advection, Eikonal, and Nonlinear Diffusion, under varying sampling densities. Our method consistently outperforms baselines, particularly in data-scarce regimes, demonstrating strong generalization and efficiency on irregular domains.

[LG-113] Conformal Prediction for Uncertainty Estimation in Drug-Target Interaction Prediction

链接: https://arxiv.org/abs/2505.18890
作者: Morteza Rakhshaninejad,Mira Jurgens,Nicolas Dewolf,Willem Waegeman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate drug-target interaction (DTI) prediction with machine learning models is essential for drug discovery. Such models should also provide a credible representation of their uncertainty, but applying classical marginal conformal prediction (CP) in DTI prediction often overlooks variability across drug and protein subgroups. In this work, we analyze three cluster-conditioned CP methods for DTI prediction, and compare them with marginal and group-conditioned CP. Clusterings are obtained via nonconformity scores, feature similarity, and nearest neighbors, respectively. Experiments on the KIBA dataset using four data-splitting strategies show that nonconformity-based clustering yields the tightest intervals and most reliable subgroup coverage, especially in random and fully unseen drug-protein splits. Group-conditioned CP works well when one entity is familiar, but residual-driven clustering provides robust uncertainty estimates even in sparse or novel scenarios. These results highlight the potential of cluster-based CP for improving DTI prediction under uncertainty.

[LG-114] KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning NEURIPS

链接: https://arxiv.org/abs/2505.18886
作者: Zhendong Mi,Qitao Tan,Xiaodong Yu,Zining Zhu,Geng Yuan,Shaoyi Huang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, Neurips

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes–making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating this bias and improving optimization stability. KerZOO achieves comparable or superior performance to existing ZO baselines in both full-parameter and parameter-efficient fine-tuning settings of LLMs, while significantly reducing the number of iterations required to reach convergence. For example, KerZOO reduces total GPU training hours by as much as 74% and 44% on WSC and MultiRC datasets in fine-tuning OPT-2.7B model and can exceed the MeZO baseline by 2.9% and 2.6% in accuracy. We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.

[LG-115] Partition Generative Modeling: Masked Modeling Without Masks

链接: https://arxiv.org/abs/2505.18883
作者: Justin Deschenaux,Lan Tran,Caglar Gulcehre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce ``Partition Generative Models’’ (PGMs), a novel approach to masked generative modeling (MGMs), particularly effective for masked diffusion language modeling (MDLMs). PGM divides tokens into two distinct groups and employs sparse attention patterns to prevent cross-group information exchange. Hence, the model is trained to predict tokens in one group based solely on information from the other group. This partitioning strategy eliminates the need for MASK tokens entirely. While traditional MGMs inefficiently process MASK tokens during generation, PGMs achieve greater computational efficiency by operating exclusively on unmasked tokens. Our experiments on OpenWebText with a context length of 1024 tokens demonstrate that PGMs deliver at least 5x improvements in both latency and throughput compared to MDLM when using the same number of sampling steps, while generating samples with better generative perplexity than MDLM. Finally, we show that PGMs can be distilled with Self-Distillation Through Time (SDTT), a method originally devised for MDLM, in order to achieve further inference gains.

[LG-116] RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models

链接: https://arxiv.org/abs/2505.18877
作者: Yilang Zhang,Bingcong Li,Georgios B. Giannakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) lowers the computational and memory overhead of fine-tuning large models by updating a low-dimensional subspace of the pre-trained weight matrix. Albeit efficient, LoRA exhibits suboptimal convergence and noticeable performance degradation, due to inconsistent and imbalanced weight updates induced by its nonunique low-rank factorizations. To overcome these limitations, this article identifies the optimal low-rank factorization per step that minimizes an upper bound on the loss. The resultant refactored low-rank adaptation (RefLoRA) method promotes a flatter loss landscape, along with consistent and balanced weight updates, thus speeding up stable convergence. Extensive experiments evaluate RefLoRA on natural language understanding, and commonsense reasoning tasks with popular large language models including DeBERTaV3, LLaMA-7B, LLaMA2-7B and LLaMA3-8B. The numerical tests corroborate that RefLoRA converges faster, outperforms various benchmarks, and enjoys negligible computational overhead compared to state-of-the-art LoRA variants.

[LG-117] Distribution-Aware Mobility-Assisted Decentralized Federated Learning

链接: https://arxiv.org/abs/2505.18866
作者: Md Farhamdur Reza,Reza Jahani,Richeng Jin,Huaiyu Dai
类目: Machine Learning (cs.LG)
*备注: Under review for possible publication in IEEE GLOBECOM 2025

点击查看摘要

Abstract:Decentralized federated learning (DFL) has attracted significant attention due to its scalability and independence from a central server. In practice, some participating clients can be mobile, yet the impact of user mobility on DFL performance remains largely unexplored, despite its potential to facilitate communication and model convergence. In this work, we demonstrate that introducing a small fraction of mobile clients, even with random movement, can significantly improve the accuracy of DFL by facilitating information flow. To further enhance performance, we propose novel distribution-aware mobility patterns, where mobile clients strategically navigate the network, leveraging knowledge of data distributions and static client locations. The proposed moving strategies mitigate the impact of data heterogeneity and boost learning convergence. Extensive experiments validate the effectiveness of induced mobility in DFL and demonstrate the superiority of our proposed mobility patterns over random movement.

[LG-118] Guided by Guardrails: Control Barrier Functions as Safety Instructors for Robotic Learning

链接: https://arxiv.org/abs/2505.18858
作者: Maeva Guerrier,Karthik Soma,Hassan Fouad,Giovanni Beltrame
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety stands as the primary obstacle preventing the widespread adoption of learning-based robotic systems in our daily lives. While reinforcement learning (RL) shows promise as an effective robot learning paradigm, conventional RL frameworks often model safety by using single scalar negative rewards with immediate episode termination, failing to capture the temporal consequences of unsafe actions (e.g., sustained collision damage). In this work, we introduce a novel approach that simulates these temporal effects by applying continuous negative rewards without episode termination. Our experiments reveal that standard RL methods struggle with this model, as the accumulated negative values in unsafe zones create learning barriers. To address this challenge, we demonstrate how Control Barrier Functions (CBFs), with their proven safety guarantees, effectively help robots avoid catastrophic regions while enhancing learning outcomes. We present three CBF-based approaches, each integrating traditional RL methods with Control Barrier Functions, guiding the agent to learn safe behavior. Our empirical analysis, conducted in both simulated environments and real-world settings using a four-wheel differential drive robot, explores the possibilities of employing these approaches for safe robotic learning.

[LG-119] Improved Regret and Contextual Linear Extension for Pandoras Box and Prophet Inequality

链接: https://arxiv.org/abs/2505.18828
作者: Junyan Liu,Ziyun Chen,Kun Wang,Haipeng Luo,Lillian J. Ratliff
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study the Pandora’s Box problem in an online learning setting with semi-bandit feedback. In each round, the learner sequentially pays to open up to n boxes with unknown reward distributions, observes rewards upon opening, and decides when to stop. The utility of the learner is the maximum observed reward minus the cumulative cost of opened boxes, and the goal is to minimize regret defined as the gap between the cumulative expected utility and that of the optimal policy. We propose a new algorithm that achieves \widetildeO(\sqrtnT) regret after T rounds, which improves the \widetildeO(n\sqrtT) bound of Agarwal et al. [2024] and matches the known lower bound up to logarithmic factors. To better capture real-life applications, we then extend our results to a natural but challenging contextual linear setting, where each box’s expected reward is linear in some known but time-varying d -dimensional context and the noise distribution is fixed over time. We design an algorithm that learns both the linear function and the noise distributions, achieving \widetildeO(nd\sqrtT) regret. Finally, we show that our techniques also apply to the online Prophet Inequality problem, where the learner must decide immediately whether or not to accept a revealed reward. In both non-contextual and contextual settings, our approach achieves similar improvements and regret bounds.

[LG-120] Governing Equation Discovery from Data Based on Differential Invariants

链接: https://arxiv.org/abs/2505.18798
作者: Lexiang Hu,Yikang Li,Zhouchen Lin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The explicit governing equation is one of the simplest and most intuitive forms for characterizing physical laws. However, directly discovering partial differential equations (PDEs) from data poses significant challenges, primarily in determining relevant terms from a vast search space. Symmetry, as a crucial prior knowledge in scientific fields, has been widely applied in tasks such as designing equivariant networks and guiding neural PDE solvers. In this paper, we propose a pipeline for governing equation discovery based on differential invariants, which can losslessly reduce the search space of existing equation discovery methods while strictly adhering to symmetry. Specifically, we compute the set of differential invariants corresponding to the infinitesimal generators of the symmetry group and select them as the relevant terms for equation discovery. Taking DI-SINDy (SINDy based on Differential Invariants) as an example, we demonstrate that its success rate and accuracy in PDE discovery surpass those of other symmetry-informed governing equation discovery methods across a series of PDEs.

[LG-121] Leverag ing Per-Instance Privacy for Machine Unlearning

链接: https://arxiv.org/abs/2505.18786
作者: Nazanin Mohammadi Sepahvand,Anvith Thudi,Berivan Isik,Ashmita Bhattacharyya,Nicolas Papernot,Eleni Triantafillou,Daniel M. Roy,Gintare Karolina Dziugaite
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility-unlearning tradeoff by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al., 2024), each of which bounds the (Renyi) divergence to retraining without an individual data point. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while also identifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points.

[LG-122] Geometry Aware Operator Transformer as an Efficient and Accurate Neural Surrogate for PDEs on Arbitrary Domains

链接: https://arxiv.org/abs/2505.18781
作者: Shizheng Wen,Arsh Kumbhat,Levi Lingsch,Sepehr Mousavi,Praveen Chandrashekar,Siddhartha Mishra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The very challenging task of learning solution operators of PDEs on arbitrary domains accurately and efficiently is of vital importance to engineering and industrial simulations. Despite the existence of many operator learning algorithms to approximate such PDEs, we find that accurate models are not necessarily computationally efficient and vice versa. We address this issue by proposing a geometry aware operator transformer (GAOT) for learning PDEs on arbitrary domains. GAOT combines novel multiscale attentional graph neural operator encoders and decoders, together with geometry embeddings and (vision) transformer processors to accurately map information about the domain and the inputs into a robust approximation of the PDE solution. Multiple innovations in the implementation of GAOT also ensure computational efficiency and scalability. We demonstrate this significant gain in both accuracy and efficiency of GAOT over several baselines on a large number of learning tasks from a diverse set of PDEs, including achieving state of the art performance on a large scale three-dimensional industrial CFD dataset.

[LG-123] One Policy but Many Worlds: A Scalable Unified Policy for Versatile Humanoid Locomotion

链接: https://arxiv.org/abs/2505.18780
作者: Yahao Fan,Tianxiang Gui,Kaiyang Ji,Shutong Ding,Chixuan Zhang,Jiayuan Gu,Jingyi Yu,Jingya Wang,Ye Shi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humanoid locomotion faces a critical scalability challenge: traditional reinforcement learning (RL) methods require task-specific rewards and struggle to leverage growing datasets, even as more training terrains are introduced. We propose DreamPolicy, a unified framework that enables a single policy to master diverse terrains and generalize zero-shot to unseen scenarios by systematically integrating offline data and diffusion-driven motion synthesis. At its core, DreamPolicy introduces Humanoid Motion Imagery (HMI) - future state predictions synthesized through an autoregressive terrain-aware diffusion planner curated by aggregating rollouts from specialized policies across various distinct terrains. Unlike human motion datasets requiring laborious retargeting, our data directly captures humanoid kinematics, enabling the diffusion planner to synthesize “dreamed” trajectories that encode terrain-specific physical constraints. These trajectories act as dynamic objectives for our HMI-conditioned policy, bypassing manual reward engineering and enabling cross-terrain generalization. DreamPolicy addresses the scalability limitations of prior methods: while traditional RL fails to exploit growing datasets, our framework scales seamlessly with more offline data. As the dataset expands, the diffusion prior learns richer locomotion skills, which the policy leverages to master new terrains without retraining. Experiments demonstrate that DreamPolicy achieves average 90% success rates in training environments and an average of 20% higher success on unseen terrains than the prevalent method. It also generalizes to perturbed and composite scenarios where prior approaches collapse. By unifying offline data, diffusion-based trajectory synthesis, and policy optimization, DreamPolicy overcomes the “one task, one policy” bottleneck, establishing a paradigm for scalable, data-driven humanoid control.

[LG-124] Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization UAI2025

链接: https://arxiv.org/abs/2505.18765
作者: Dai Hai Nguyen,Hiroshi Mamitsuka,Atsuyoshi Nakamura
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to UAI 2025 (Oral Presentation)

点击查看摘要

Abstract:We address the optimization problem of simultaneously minimizing multiple objective functionals over a family of probability distributions. This type of Multi-Objective Distributional Optimization commonly arises in machine learning and statistics, with applications in areas such as multiple target sampling, multi-task learning, and multi-objective generative modeling. To solve this problem, we propose an iterative particle-based algorithm, which we call Muliple Wasserstein Gradient Descent (MWGraD), which constructs a flow of intermediate empirical distributions, each being represented by a set of particles, which gradually minimize the multiple objective functionals simultaneously. Specifically, MWGraD consists of two key steps at each iteration. First, it estimates the Wasserstein gradient for each objective functional based on the current particles. Then, it aggregates these gradients into a single Wasserstein gradient using dynamically adjusted weights and updates the particles accordingly. In addition, we provide theoretical analysis and present experimental results on both synthetic and real-world datasets, demonstrating the effectiveness of MWGraD.

[LG-125] GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

链接: https://arxiv.org/abs/2505.18763
作者: Shutong Ding,Ke Hu,Shan Zhong,Haoyang Luo,Weinan Zhang,Jingya Wang,Jun Wang,Ye Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO’s superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

[LG-126] Reducing Storag e of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding

链接: https://arxiv.org/abs/2505.18758
作者: Alexander Conzelmann,Robert Bamler
类目: Machine Learning (cs.LG)
*备注: 9 pages + 5 pages of appendix

点击查看摘要

Abstract:The ever-growing size of neural networks poses serious challenges on resource-constrained devices, such as embedded sensors. Compression algorithms that reduce their size can mitigate these problems, provided that model performance stays close to the original. We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding by (1) extending the well-known layer-wise loss by a quadratic rate estimation, and (2) providing locally exact solutions to this modified objective following the Optimal Brain Surgeon (OBS) method. Our method allows for very fast decoding and is compatible with arbitrary quantization grids. We verify our results empirically by testing on various computer-vision networks, achieving a 20-40% decrease in bit rate at the same performance as the popular compression algorithm NNCodec. Our code is available at this https URL.

[LG-127] AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping

链接: https://arxiv.org/abs/2505.18738
作者: Haonan Dong,Wenhao Zhu,Guojie Song,Liang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method validated across NLP and CV domains. However, LoRA faces an inherent low-rank bottleneck: narrowing its performance gap with full finetuning requires increasing the rank of its parameter matrix, resulting in significant parameter overhead. Recent linear LoRA variants have attempted to enhance expressiveness by introducing additional linear mappings; however, their composition remains inherently linear and fails to fundamentally improve LoRA’s representational capacity. To address this limitation, we propose AuroRA, which incorporates an Adaptive Nonlinear Layer (ANL) between two linear projectors to capture fixed and learnable nonlinearities. This combination forms an MLP-like structure with a compressed rank, enabling flexible and precise approximation of diverse target functions while theoretically guaranteeing lower approximation errors and bounded gradients. Extensive experiments on 22 datasets and 6 pretrained models demonstrate that AuroRA: (I) not only matches or surpasses full fine-tuning performance with only 6.18% ~ 25% of LoRA’s parameters but also (II) outperforms state-of-the-art PEFT methods by up to 10.88% in both NLP and CV tasks, and (III) exhibits robust performance across various rank configurations.

[LG-128] MADCAT: Combating Malware Detection Under Concept Drift with Test-Time Adaptation

链接: https://arxiv.org/abs/2505.18734
作者: Eunjin Roh,Yigitcan Kaya,Christopher Kruegel,Giovanni Vigna,Sanghyun Hong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Pre-print; 4 pages

点击查看摘要

Abstract:We present MADCAT, a self-supervised approach designed to address the concept drift problem in malware detection. MADCAT employs an encoder-decoder architecture and works by test-time training of the encoder on a small, balanced subset of the test-time data using a self-supervised objective. During test-time training, the model learns features that are useful for detecting both previously seen (old) data and newly arriving samples. We demonstrate the effectiveness of MADCAT in continuous Android malware detection settings. MADCAT consistently outperforms baseline methods in detection performance at test time. We also show the synergy between MADCAT and prior approaches in addressing concept drift in malware detection

[LG-129] Reward-Driven Interaction: Enhancing Proactive Dialogue Agents through User Satisfaction Prediction

链接: https://arxiv.org/abs/2505.18731
作者: Wei Shen,Xiaonan He,Chuheng Zhang,Xuyun Zhang,Xiaolong Xu,Wanchun Dou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward-driven proactive dialogue agents require precise estimation of user satisfaction as an intrinsic reward signal to determine optimal interaction strategies. Specifically, this framework triggers clarification questions when detecting potential user dissatisfaction during interactions in the industrial dialogue system. Traditional works typically rely on training a neural network model based on weak labels which are generated by a simple model trained on user actions after current turn. However, existing methods suffer from two critical limitations in real-world scenarios: (1) Noisy Reward Supervision, dependence on weak labels derived from post-hoc user actions introduces bias, particularly failing to capture satisfaction signals in ASR-error-induced utterances; (2) Long-Tail Feedback Sparsity, the power-law distribution of user queries causes reward prediction accuracy to drop in low-frequency domains. The noise in the weak labels and a power-law distribution of user utterances results in that the model is hard to learn good representation of user utterances and sessions. To address these limitations, we propose two auxiliary tasks to improve the representation learning of user utterances and sessions that enhance user satisfaction prediction. The first one is a contrastive self-supervised learning task, which helps the model learn the representation of rare user utterances and identify ASR errors. The second one is a domain-intent classification task, which aids the model in learning the representation of user sessions from long-tailed domains and improving the model’s performance on such domains. The proposed method is evaluated on DuerOS, demonstrating significant improvements in the accuracy of error recognition on rare user utterances and long-tailed domains.

[LG-130] Audio Geolocation: A Natural Sounds Benchmark

链接: https://arxiv.org/abs/2505.18726
作者: Mustafa Chasmai,Wuao Liu,Subhransu Maji,Grant Van Horn
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Can we determine someone’s geographic location purely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? We tackle the challenge of global-scale audio geolocation, formalize the problem, and conduct an in-depth analysis with wildlife audio from the iNatSounds dataset. Adopting a vision-inspired approach, we convert audio recordings to spectrograms and benchmark existing image geolocation techniques. We hypothesize that species vocalizations offer strong geolocation cues due to their defined geographic ranges and propose an approach that integrates species range prediction with retrieval-based geolocation. We further evaluate whether geolocation improves when analyzing species-rich recordings or when aggregating across spatiotemporal neighborhoods. Finally, we introduce case studies from movies to explore multimodal geolocation using both audio and visual content. Our work highlights the advantages of integrating audio and visual cues, and sets the stage for future research in audio geolocation.

[LG-131] MonarchAttention: Zero-Shot Conversion to Fast Hardware-Aware Structured Attention

链接: https://arxiv.org/abs/2505.18698
作者: Can Yaras,Alec S. Xu,Pierre Abillama,Changwoo Lee,Laura Balzano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention – a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with \Theta(N\sqrtN d) computational complexity and \Theta(Nd) memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: 1.4\times for shorter sequences (N=256) , 4.5\times for medium-length sequences (N=4K) , and 8.2\times for longer sequences (N=16K) . We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at this https URL.

[LG-132] Simultaneous Optimization of Efficiency and Degradation in Tunable HTL-Free Perovskite Solar Cells with MWCNT-Integrated Back Contact Using a Machine Learning-Derived Polynomial Regressor

链接: https://arxiv.org/abs/2505.18693
作者: Ihtesham Ibn Malek,Hafiz Imtiaz,Samia Subrina
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Under review in Elsevier Renewable Energy

点击查看摘要

Abstract:Perovskite solar cells (PSCs) without a hole transport layer (HTL) offer a cost-effective and stable alternative to conventional architectures, utilizing only an absorber layer and an electron transport layer (ETL). This study presents a machine learning (ML)-driven framework to optimize the efficiency and stability of HTL-free PSCs by integrating experimental validation with numerical simulations. Excellent agreement is achieved between a fabricated device and its simulated counterpart at a molar fraction ( x = 68.7% ) in (\mathrmMAPb_1-x\mathrmSb_2x/3\mathrmI_3), where MA is methylammonium. A dataset of 1650 samples is generated by varying molar fraction, absorber defect density, thickness, and ETL doping, with corresponding efficiency and 50-hour degradation as targets. A fourth-degree polynomial regressor (PR-4) shows the best performance, achieving RMSEs of 0.0179 and 0.0117, and ( R^2 ) scores of 1 and 0.999 for efficiency and degradation, respectively. The derived model generalizes beyond the training range and is used in an L-BFGS-B optimization algorithm with a weighted objective function to maximize efficiency and minimize degradation. This improves device efficiency from 13.7% to 16.84% and reduces degradation from 6.61% to 2.39% over 1000 hours. Finally, the dataset is labeled into superior and inferior classes, and a multilayer perceptron (MLP) classifier achieves 100% accuracy, successfully identifying optimal configurations.

[LG-133] Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?

链接: https://arxiv.org/abs/2505.18672
作者: Hongzheng Yang,Yongqiang Chen,Zeyu Qin,Tongliang Liu,Chaowei Xiao,Kun Zhang,Bo Han
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Hongzheng and Yongqiang contributed equally; project page: this https URL

点击查看摘要

Abstract:Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

[LG-134] Self-Supervised Evolution Operator Learning for High-Dimensional Dynamical Systems

链接: https://arxiv.org/abs/2505.18671
作者: Giacomo Turri,Luigi Bonati,Kai Zhu,Massimiliano Pontil,Pietro Novelli
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We introduce an encoder-only approach to learn the evolution operators of large-scale non-linear dynamical systems, such as those describing complex natural phenomena. Evolution operators are particularly well-suited for analyzing systems that exhibit complex spatio-temporal patterns and have become a key analytical tool across various scientific communities. As terabyte-scale weather datasets and simulation tools capable of running millions of molecular dynamics steps per day are becoming commodities, our approach provides an effective tool to make sense of them from a data-driven perspective. The core of it lies in a remarkable connection between self-supervised representation learning methods and the recently established learning theory of evolution operators. To show the usefulness of the proposed method, we test it across multiple scientific domains: explaining the folding dynamics of small proteins, the binding process of drug-like molecules in host sites, and autonomously finding patterns in climate data. Code and data to reproduce the experiments are made available open source.

[LG-135] LLM -QFL: Distilling Large Language Model for Quantum Federated Learning

链接: https://arxiv.org/abs/2505.18656
作者: Dev Gurung,Shiva Raj Pokhrel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inspired by the power of large language models (LLMs), our research adapts them to quantum federated learning (QFL) to boost efficiency and performance. We propose a federated fine-tuning method that distills an LLM within QFL, allowing each client to locally adapt the model to its own data while preserving privacy and reducing unnecessary global updates. The fine-tuned LLM also acts as a reinforcement agent, optimizing QFL by adjusting optimizer steps, cutting down communication rounds, and intelligently selecting clients. Experiments show significant efficiency gains. We pioneer a synergy between LLM and QFL, offering: i) practical efficiency: Reduced communication costs and faster convergence. ii) theoretical rigor: Provable guarantees for adaptive federated optimization. iii) scalability: PEFT methods (LoRA, QLoRA) enable deployment on resource-constrained quantum devices. Code implementation is available here 1.

[LG-136] Asymmetric Duos: Sidekicks Improve Uncertainty

链接: https://arxiv.org/abs/2505.18636
作者: Tim G. Zhou,Evan Shelhamer,Geoff Pleiss
类目: Machine Learning (cs.LG)
*备注: 24 pages, 14 figures

点击查看摘要

Abstract:The go-to strategy to apply deep networks in settings where uncertainty informs decisions–ensembling multiple training runs with random initializations–is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller “sidekick” (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this \emphAsymmetric Duo by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only \sim10-20% more computation.

[LG-137] hink Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

链接: https://arxiv.org/abs/2505.18629
作者: Yixuan Wang,Yijun Liu,Shiyu ji,Yuzhuang Xu,Yang Xu,Qingfu Zhu,Wanxiang Che
类目: Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel during verification. Using prompt-based probing, we obtain both the original and reflective distributions of draft tokens in a single forward pass. The fusion of these distributions enables semantic-level verification of draft tokens that incorporates both consistency and correctness. Experiments across multiple domain benchmarks and model scales demonstrate that our method significantly increases the acceptance length of draft tokens without compromising model performance. Furthermore, we find that the proposed Reflective Verification is orthogonal to existing statistical verification methods, and their combination yields additional 5 \sim 15% improvements in decoding speed.

[LG-138] MLRan: A Behavioural Dataset for Ransomware Analysis and Detection

链接: https://arxiv.org/abs/2505.18613
作者: Faithful Chiagoziem Onwuegbuche,Adelodun Olaoluwa,Anca Delia Jurcut,Liliana Pasquale
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at this https URL.

[LG-139] Exemplar-Free Continual Learning for State Space Models

链接: https://arxiv.org/abs/2505.18604
作者: Isaac Ning Lee,Leila Mahmoodi,Trung Le,Mehrtash Harandi
类目: Machine Learning (cs.LG)
*备注: 41 pages, 4 figures

点击查看摘要

Abstract:State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose challenges in adapting them under Continual Learning (CL). This is particularly difficult in exemplar-free settings, where the absence of prior data leaves updates to the dynamic SSM states unconstrained, resulting in catastrophic forgetting. To address this, we propose Inf-SSM, a novel and simple geometry-aware regularization method that utilizes the geometry of the infinite-dimensional Grassmannian to constrain state evolution during CL. Unlike classical continual learning methods that constrain weight updates, Inf-SSM regularizes the infinite-horizon evolution of SSMs encoded in their extended observability subspace. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs \mathcalO(n^3) complexity. We develop a \mathcalO(n^2) solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks, including ImageNet-R and Caltech-256, demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.

[LG-140] Bayesian Meta-Reinforcement Learning with Laplace Variational Recurrent Networks

链接: https://arxiv.org/abs/2505.18591
作者: Joery A. de Vries,Jinke He,Mathijs M. de Weerdt,Matthijs T.J. Spaan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Meta-reinforcement learning trains a single reinforcement learning agent on a distribution of tasks to quickly generalize to new tasks outside of the training set at test time. From a Bayesian perspective, one can interpret this as performing amortized variational inference on the posterior distribution over training tasks. Among the various meta-reinforcement learning approaches, a common method is to represent this distribution with a point-estimate using a recurrent neural network. We show how one can augment this point estimate to give full distributions through the Laplace approximation, either at the start of, during, or after learning, without modifying the base model architecture. With our approximation, we are able to estimate distribution statistics (e.g., the entropy) of non-Bayesian agents and observe that point-estimate based methods produce overconfident estimators while not satisfying consistency. Furthermore, when comparing our approach to full-distribution based learning of the task posterior, our method performs on par with variational baselines while having much fewer parameters.

[LG-141] Mechanical in-sensor computing: a programmable meta-sensor for structural damage classification without external electronic power

链接: https://arxiv.org/abs/2505.18579
作者: Tingpeng Zhang,Xuzhang Peng,Mingyuan Zhou,Guobiao Hu,Zhilu Lai
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Structural health monitoring (SHM) involves sensor deployment, data acquisition, and data interpretation, commonly implemented via a tedious wired system. The information processing in current practice majorly depends on electronic computers, albeit with universal applications, delivering challenges such as high energy consumption and low throughput due to the nature of digital units. In recent years, there has been a renaissance interest in shifting computations from electronic computing units to the use of real physical systems, a concept known as physical computation. This approach provides the possibility of thinking out of the box for SHM, seamlessly integrating sensing and computing into a pure-physical entity, without relying on external electronic power supplies, thereby properly coping with resource-restricted scenarios. The latest advances of metamaterials (MM) hold great promise for this proactive idea. In this paper, we introduce a programmable metamaterial-based sensor (termed as MM-sensor) for physically processing structural vibration information to perform specific SHM tasks, such as structural damage warning (binary classification) in this initiation, without the need for further information processing or resource-consuming, that is, the data collection and analysis are completed in-situ at the sensor level. We adopt the configuration of a locally resonant metamaterial plate (LRMP) to achieve the first fabrication of the MM-sensor. We take advantage of the bandgap properties of LRMP to physically differentiate the dynamic behavior of structures before and after damage. By inversely designing the geometric parameters, our current approach allows for adjustments to the bandgap features. This is effective for engineering systems with a first natural frequency ranging from 9.54 Hz to 81.86 Hz.

[LG-142] VISTA: Vision-Language Inference for Training-Free Stock Time-Series Analysis

链接: https://arxiv.org/abs/2505.18570
作者: Tina Khezresmaeilzadeh,Parsa Razmara,Seyedarmin Azizi,Mohammad Erfan Sadeghi,Erfan Baghaei Portaghloo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stock price prediction remains a complex and high-stakes task in financial analysis, traditionally addressed using statistical models or, more recently, language models. In this work, we introduce VISTA (Vision-Language Inference for Stock Time-series Analysis), a novel, training-free framework that leverages Vision-Language Models (VLMs) for multi-modal stock forecasting. VISTA prompts a VLM with both textual representations of historical stock prices and their corresponding line charts to predict future price values. By combining numerical and visual modalities in a zero-shot setting and using carefully designed chain-of-thought prompts, VISTA captures complementary patterns that unimodal approaches often miss. We benchmark VISTA against standard baselines, including ARIMA and text-only LLM-based prompting methods. Experimental results show that VISTA outperforms these baselines by up to 89.83%, demonstrating the effectiveness of multi-modal inference for stock time-series analysis and highlighting the potential of VLMs in financial forecasting tasks without requiring task-specific training.

[LG-143] Learning Fluid-Structure Interaction Dynamics with Physics-Informed Neural Networks and Immersed Boundary Methods

链接: https://arxiv.org/abs/2505.18565
作者: Afrah Farea,Saiful Khan,Reza Daryani,Emre Cenk Ersan,Mustafa Serdar Celebi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We introduce neural network architectures that combine physics-informed neural networks (PINNs) with the immersed boundary method (IBM) to solve fluid-structure interaction (FSI) problems. Our approach features two distinct architectures: a Single-FSI network with a unified parameter space, and an innovative Eulerian-Lagrangian network that maintains separate parameter spaces for fluid and structure domains. We study each architecture using standard Tanh and adaptive B-spline activation functions. Empirical studies on a 2D cavity flow problem involving a moving solid structure show that the Eulerian-Lagrangian architecture performs significantly better. The adaptive B-spline activation further enhances accuracy by providing locality-aware representation near boundaries. While our methodology shows promising results in predicting the velocity field, pressure recovery remains challenging due to the absence of explicit force-coupling constraints in the current formulation. Our findings underscore the importance of domain-specific architectural design and adaptive activation functions for modeling FSI problems within the PINN framework.

[LG-144] Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning ICML2018

链接: https://arxiv.org/abs/2505.18558
作者: Wenbo He,Zhijian Ou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2018 submission. arXiv admin note: text overlap with arXiv:1808.01630

点击查看摘要

Abstract:Our examination of existing deep generative models (DGMs), including VAEs and GANs, reveals two problems. First, their capability in handling discrete observations and latent codes is unsatisfactory, though there are interesting efforts. Second, both VAEs and GANs optimize some criteria that are indirectly related to the data likelihood. To address these problems, we formally present Joint-stochastic-approximation (JSA) autoencoders - a new family of algorithms for building deep directed generative models, with application to semi-supervised learning. The JSA learning algorithm directly maximizes the data log-likelihood and simultaneously minimizes the inclusive KL divergence the between the posteriori and the inference model. We provide theoretical results and conduct a series of experiments to show its superiority such as being robust to structure mismatch between encoder and decoder, consistent handling of both discrete and continuous variables. Particularly we empirically show that JSA autoencoders with discrete latent space achieve comparable performance to other state-of-the-art DGMs with continuous latent space in semi-supervised tasks over the widely adopted datasets - MNIST and SVHN. To the best of our knowledge, this is the first demonstration that discrete latent variable models are successfully applied in the challenging semi-supervised tasks.

[LG-145] LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

链接: https://arxiv.org/abs/2505.18551
作者: Md Ahsanul Haque,Ismail Hossain,Md Mahmuduzzaman Kamol,Md Jahangir Alam,Suresh Kumar Amalapuram,Sajedul Talukder,Mohammad Saidur Rahman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 31 pages, 21 figures, and 16 tables

点击查看摘要

Abstract:Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift – distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection. To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013-2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA’s utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges. The dataset and code are available at: this https URL. Comments: 31 pages, 21 figures, and 16 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2505.18551 [cs.CR] (or arXiv:2505.18551v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.18551 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-146] Benchmarking Poisoning Attacks against Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2505.18543
作者: Baolei Zhang,Haoran Xin,Jiatong Li,Dongzhe Zhang,Minghong Fang,Zhuqing Liu,Lihai Nie,Zheli Liu
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven effective in mitigating hallucinations in large language models by incorporating external knowledge during inference. However, this integration introduces new security vulnerabilities, particularly to poisoning attacks. Although prior work has explored various poisoning strategies, a thorough assessment of their practical threat to RAG systems remains missing. To address this gap, we propose the first comprehensive benchmark framework for evaluating poisoning attacks on RAG. Our benchmark covers 5 standard question answering (QA) datasets and 10 expanded variants, along with 13 poisoning attack methods and 7 defense mechanisms, representing a broad spectrum of existing techniques. Using this benchmark, we conduct a comprehensive evaluation of all included attacks and defenses across the full dataset spectrum. Our findings show that while existing attacks perform well on standard QA datasets, their effectiveness drops significantly on the expanded versions. Moreover, our results demonstrate that various advanced RAG architectures, such as sequential, branching, conditional, and loop RAG, as well as multi-turn conversational RAG, multimodal RAG systems, and RAG-based LLM agent systems, remain susceptible to poisoning attacks. Notably, current defense techniques fail to provide robust protection, underscoring the pressing need for more resilient and generalizable defense strategies.

[LG-147] Convergence Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD

链接: https://arxiv.org/abs/2505.18535
作者: Dmitry Dudukalov,Artem Logachov,Vladimir Lotov,Timofei Prasolov,Evgeny Prokopenko,Anton Tarasenko
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the convergence properties and escape dynamics of Stochastic Gradient Descent (SGD) in one-dimensional landscapes, separately considering infinite- and finite-variance noise. Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ‘‘basin’’. Under suitable conditions on the noise distribution, we prove that SGD converges to the basin’s minimum unless the initial point lies too close to a local maximum. In that near-maximum scenario, we show that SGD can linger for a long time in its neighborhood. For initial points near a ‘‘sharp’’ maximum, we show that SGD does not remain stuck there, and we provide results to estimate the probability that it will reach each of the two neighboring minima. Overall, our findings present a nuanced view of SGD’s transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.

[LG-148] Preserving AUC Fairness in Learning with Noisy Protected Groups

链接: https://arxiv.org/abs/2505.18532
作者: Mingyang Wu,Li Lin,Wenbin Zhang,Xin Wang,Zhenhuan Yang,Shu Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in this https URL.

[LG-149] Enhancing Training Data Attribution with Representational Optimization

链接: https://arxiv.org/abs/2505.18513
作者: Weiwei Sun,Haokun Liu,Nikhil Kandpal,Colin Raffel,Yiming Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training data attribution (TDA) methods aim to measure how training data impacts a model’s predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at this https URL.

[LG-150] SPDEBench: An Extensive Benchmark for Learning Regular and Singular Stochastic PDEs

链接: https://arxiv.org/abs/2505.18511
作者: Zheyan Li,Yuantu Zhu,Hao Ni,Siran Li,Bingguang Chen,Qi Meng
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Stochastic Partial Differential Equations (SPDEs) driven by random noise play a central role in modelling physical processes whose spatio-temporal dynamics can be rough, such as turbulence flows, superconductors, and quantum dynamics. To efficiently model these processes and make predictions, machine learning (ML)-based surrogate models are proposed, with their network architectures incorporating the spatio-temporal roughness in their design. However, it lacks an extensive and unified datasets for SPDE learning; especially, existing datasets do not account for the computational error introduced by noise sampling and the necessary renormalization required for handling singular SPDEs. We thus introduce SPDEBench, which is designed to solve typical SPDEs of physical significance (e.g., the \Phi^4_d , wave, incompressible Navier–Stokes, and KdV equations) on 1D or 2D tori driven by white noise via ML methods. New datasets for singular SPDEs based on the renormalization process have been constructed, and novel ML models achieving the best results to date have been proposed. In particular, we investigate the impact of computational error introduced by noise sampling and renormalization on the performance comparison of ML models and highlight the importance of selecting high-quality test data for accurate evaluation. Results are benchmarked with traditional numerical solvers and ML-based models, including FNO, NSPDE and DLR-Net, etc. It is shown that, for singular SPDEs, naively applying ML models on data without specifying the numerical schemes can lead to significant errors and misleading conclusions. Our SPDEBench provides an open-source codebase that ensures full reproducibility of benchmarking across a variety of SPDE datasets while offering the flexibility to incorporate new datasets and machine learning baselines, making it a valuable resource for the community.

[LG-151] Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

链接: https://arxiv.org/abs/2505.18495
作者: Chen-Hao Chao,Wei-Fang Sun,Hanwen Liang,Chun-Yi Lee,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.

[LG-152] he Prompt is Mightier than the Example

链接: https://arxiv.org/abs/2505.18485
作者: Shengzhe Xu,Nikhil Muralidhar,Naren Ramakrishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerous recent prompt optimization approaches like chain-of-thought, have been demonstrated to significantly improve the quality of content generated by large language models (LLMs). In-context learning (ICL), a recent paradigm where a few representative examples guide content generation has also led to strong improvements in generation quality of LLM generated content. This idea has been applied to great effect in synthetic tabular data generation, where LLMs, through effective use of ICL and prompt optimization, can generate data that approximate samples from complex, heterogeneous distributions based on representative examples. However, ensuring high-fidelity synthetic data often requires a very large number of ICL examples which may be unavailable or costly to obtain. At the same time, as LLMs get larger and larger, their in-built prior knowledge becomes vast and can potentially substitute for specific data examples. In this paper, we introduce Knowledge-Guided Prompting (KGP) as a new knob in prompt optimization and explore the ability of KGP-based prompt optimization to offset the cost of ICL. Specifically, we explore the question `how many examples can a prompt substitute for?’ and explore knowledge-guided prompting (KGP) where domain knowledge, either inferred or available, is explicitly injected into the prompt, reducing dependence on ICL examples. Our experiments systematically explore the trade-off between ICL and KGP, revealing an empirical scaling law that quantifies how quality of generated synthetic data varies with increasing domain knowledge and decreasing example count. Our results demonstrate that knowledge-guided prompting can be a scalable alternative, or addition, to in-context examples, unlocking new approaches to synthetic data generation.

[LG-153] Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer Reinforcement Learning ICML2025

链接: https://arxiv.org/abs/2505.18447
作者: Chi Zhang,Ziying Jia,George K. Atia,Sihong He,Yue Wang
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Transfer reinforcement learning aims to derive a near-optimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a novel framework based on the pessimism principle, which constructs and optimizes a conservative estimation of the target domain’s performance. Our framework effectively addresses the two challenges by providing an optimized lower bound on target performance, ensuring safe and reliable decisions, and by exhibiting monotonic improvement with respect to the quality of the source domains, thereby avoiding negative transfer. We construct two types of conservative estimations, rigorously characterize their effectiveness, and develop efficient distributed algorithms with convergence guarantees. Our framework provides a theoretically sound and practically robust solution for transfer learning in reinforcement learning.

[LG-154] DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

链接: https://arxiv.org/abs/2505.18441
作者: Romeo Valentin,Sydney M. Katz,Vincent Vanhoucke,Mykel J. Kochenderfer
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS); Applications (stat.AP)
*备注: 9 pages + 4 pages appendix

点击查看摘要

Abstract:Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings, however, requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this structure is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling embeddings of the Gemma-2-2B model and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) that traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We provide an implementation of DB-KSVD at this https URL.

[LG-155] Finite-Time Global Optimality Convergence in Deep Neural Actor-Critic Methods for Decentralized Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2505.18433
作者: Zhiyao Zhang,Myeung Suk Oh,FNU Hairi,Ziyue Luo,Alvaro Velasquez,Jia Liu
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Actor-critic methods for decentralized multi-agent reinforcement learning (MARL) facilitate collaborative optimal decision making without centralized coordination, thus enabling a wide range of applications in practice. To date, however, most theoretical convergence studies for existing actor-critic decentralized MARL methods are limited to the guarantee of a stationary solution under the linear function approximation. This leaves a significant gap between the highly successful use of deep neural actor-critic for decentralized MARL in practice and the current theoretical understanding. To bridge this gap, in this paper, we make the first attempt to develop a deep neural actor-critic method for decentralized MARL, where both the actor and critic components are inherently non-linear. We show that our proposed method enjoys a global optimality guarantee with a finite-time convergence rate of O(1/T), where T is the total iteration times. This marks the first global convergence result for deep neural actor-critic methods in the MARL literature. We also conduct extensive numerical experiments, which verify our theoretical results.

[LG-156] Development of Interactive Nomograms for Predicting Short-Term Survival in ICU Patients with Aplastic Anemia

链接: https://arxiv.org/abs/2505.18421
作者: Junyi Fan,Shuheng Chen,Li Sun,Yong Si,Elham Pishgar,Kamiar Alaei,Greg Placencia,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aplastic anemia is a rare, life-threatening hematologic disorder characterized by pancytopenia and bone marrow failure. ICU admission in these patients often signals critical complications or disease progression, making early risk assessment crucial for clinical decision-making and resource allocation. In this study, we used the MIMIC-IV database to identify ICU patients diagnosed with aplastic anemia and extracted clinical features from five domains: demographics, synthetic indicators, laboratory results, comorbidities, and medications. Over 400 variables were reduced to seven key predictors through machine learning-based feature selection. Logistic regression and Cox regression models were constructed to predict 7-, 14-, and 28-day mortality, and their performance was evaluated using AUROC. External validation was conducted using the eICU Collaborative Research Database to assess model generalizability. Among 1,662 included patients, the logistic regression model demonstrated superior performance, with AUROC values of 0.8227, 0.8311, and 0.8298 for 7-, 14-, and 28-day mortality, respectively, compared to the Cox model. External validation yielded AUROCs of 0.7391, 0.7119, and 0.7093. Interactive nomograms were developed based on the logistic regression model to visually estimate individual patient risk. In conclusion, we identified a concise set of seven predictors, led by APS III, to build validated and generalizable nomograms that accurately estimate short-term mortality in ICU patients with aplastic anemia. These tools may aid clinicians in personalized risk stratification and decision-making at the point of care.

[LG-157] A Dual Basis Approach for Structured Robust Euclidean Distance Geometry

链接: https://arxiv.org/abs/2505.18414
作者: Chandra Kundu,Abiy Tasissa,HanQin Cai
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Euclidean Distance Matrix (EDM), which consists of pairwise squared Euclidean distances of a given point configuration, finds many applications in modern machine learning. This paper considers the setting where only a set of anchor nodes is used to collect the distances between themselves and the rest. In the presence of potential outliers, it results in a structured partial observation on EDM with partial corruptions. Note that an EDM can be connected to a positive semi-definite Gram matrix via a non-orthogonal dual basis. Inspired by recent development of non-orthogonal dual basis in optimization, we propose a novel algorithmic framework, dubbed Robust Euclidean Distance Geometry via Dual Basis (RoDEoDB), for recovering the Euclidean distance geometry, i.e., the underlying point configuration. The exact recovery guarantees have been established in terms of both the Gram matrix and point configuration, under some mild conditions. Empirical experiments show superior performance of RoDEoDB on sensor localization and molecular conformation datasets.

[LG-158] X-MethaneWet: A Cross-scale Global Wetland Methane Emission Benchmark Dataset for Advancing Science Discovery with AI

链接: https://arxiv.org/abs/2505.18355
作者: Yiming Sun,Shuo Chen,Shengyu Chen,Chonghao Qiu,Licheng Liu,Youmi Oh,Sparkle L. Malone,Gavin McNicol,Qianlai Zhuang,Chris Smith,Yiqun Xie,Xiaowei Jia
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Methane (CH _4 ) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential. Accurately modeling CH _4 fluxes across the globe and at fine temporal scales is essential for understanding its spatial and temporal variability and developing effective mitigation strategies. In this work, we introduce the first-of-its-kind cross-scale global wetland methane benchmark dataset (X-MethaneWet), which synthesizes physics-based model simulation data from TEM-MDM and the real-world observation data from FLUXNET-CH _4 . This dataset can offer opportunities for improving global wetland CH _4 modeling and science discovery with new AI algorithms. To set up AI model baselines for methane flux prediction, we evaluate the performance of various sequential deep learning models on X-MethaneWet. Furthermore, we explore four different transfer learning techniques to leverage simulated data from TEM-MDM to improve the generalization of deep learning models on real-world FLUXNET-CH _4 observations. Our extensive experiments demonstrate the effectiveness of these approaches, highlighting their potential for advancing methane emission modeling and contributing to the development of more accurate and scalable AI-driven climate models.

[LG-159] Diffusion Self-Weighted Guidance for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2505.18345
作者: Augusto Tagle,Javier Ruiz-del-Solar,Felipe Tobar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Offline reinforcement learning (RL) recovers the optimal policy \pi given historical observations of an agent. In practice, \pi is modeled as a weighted version of the agent’s behavior policy \mu , using a weight function w working as a critic of the agent’s behavior. Though recent approaches to offline RL based on diffusion models have exhibited promising results, the computation of the required scores is challenging due to their dependence on the unknown w . In this work, we alleviate this issue by constructing a diffusion over both the actions and the weights. With the proposed setting, the required scores are directly obtained from the diffusion model without learning extra networks. Our main conceptual contribution is a novel guidance method, where guidance (which is a function of w ) comes from the same diffusion model, therefore, our proposal is termed Self-Weighted Guidance (SWG). We show that SWG generates samples from the desired distribution on toy examples and performs on par with state-of-the-art methods on D4RL’s challenging environments, while maintaining a streamlined training pipeline. We further validate SWG through ablation studies on weight formulations and scalability.

[LG-160] An Attack to Break Permutation-Based Private Third-Party Inference Schemes for LLM s ICML2025

链接: https://arxiv.org/abs/2505.18332
作者: Rahul Thomas,Louai Zahran,Erica Choi,Akilesh Potti,Micah Goldblum,Arka Pal
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To be published in ICML 2025 Main Proceedings as “Hidden No More: Attacking and Defending Private Third-Party LLM Inference”

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have led to the widespread adoption of third-party inference services, raising critical privacy concerns. Existing methods of performing private third-party inference, such as Secure Multiparty Computation (SMPC), often rely on cryptographic methods. However, these methods are thousands of times slower than standard unencrypted inference, and fail to scale to large modern LLMs. Therefore, recent lines of work have explored the replacement of expensive encrypted nonlinear computations in SMPC with statistical obfuscation methods - in particular, revealing permuted hidden states to the third parties, with accompanying strong claims of the difficulty of reversal into the unpermuted states. In this work, we begin by introducing a novel reconstruction technique that can recover original prompts from hidden states with nearly perfect accuracy across multiple state-of-the-art LLMs. We then show that extensions of our attack are nearly perfectly effective in reversing permuted hidden states of LLMs, demonstrating the insecurity of three recently proposed privacy schemes. We further dissect the shortcomings of prior theoretical `proofs’ of permuation security which allow our attack to succeed. Our findings highlight the importance of rigorous security analysis in privacy-preserving LLM inference.

[LG-161] PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

链接: https://arxiv.org/abs/2505.18313
作者: Matan Haroush,Daniel Soudry
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval. In addition, we resolve optimizer state misalignment issues to prevent spurious weight updates and enhance training stability. We empirically demonstrate that PLUMAGE shrinks the full rank optimization’s gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.

[LG-162] Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs ICML2025

链接: https://arxiv.org/abs/2505.18300
作者: Jie Hu,Yi-Ting Ma,Do Young Eun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2025 (Spotlight)

点击查看摘要

Abstract:We propose a history-driven target (HDT) framework in Markov Chain Monte Carlo (MCMC) to improve any random walk algorithm on discrete state spaces, such as general undirected graphs, for efficient sampling from target distribution \boldsymbol\mu . With broad applications in network science and distributed optimization, recent innovations like the self-repellent random walk (SRRW) achieve near-zero variance by prioritizing under-sampled states through transition kernel modifications based on past visit frequencies. However, SRRW’s reliance on explicit computation of transition probabilities for all neighbors at each step introduces substantial computational overhead, while its strict dependence on time-reversible Markov chains excludes advanced non-reversible MCMC methods. To overcome these limitations, instead of direct modification of transition kernel, HDT introduces a history-dependent target distribution \boldsymbol\pi[\mathbfx] to replace the original target \boldsymbol\mu in any graph sampler, where \mathbfx represents the empirical measure of past visits. This design preserves lightweight implementation by requiring only local information between the current and proposed states and achieves compatibility with both reversible and non-reversible MCMC samplers, while retaining unbiased samples with target distribution \boldsymbol\mu and near-zero variance performance. Extensive experiments in graph sampling demonstrate consistent performance gains, and a memory-efficient Least Recently Used (LRU) cache ensures scalability to large general graphs.

[LG-163] A deep solver for backward stochastic Volterra integral equations

链接: https://arxiv.org/abs/2505.18297
作者: Kristoffer Andersson,Alessandro Gnoatto,Camilo Andrés García Trillos
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR); Mathematical Finance (q-fin.MF)
*备注: 25 pages, 10 figures

点击查看摘要

Abstract:We present the first deep-learning solver for backward stochastic Volterra integral equations (BSVIEs) and their fully-coupled forward-backward variants. The method trains a neural network to approximate the two solution fields in a single stage, avoiding the use of nested time-stepping cycles that limit classical algorithms. For the decoupled case we prove a non-asymptotic error bound composed of an a posteriori residual plus the familiar square root dependence on the time step. Numerical experiments confirm this rate and reveal two key properties: \emphscalability, in the sense that accuracy remains stable from low dimension up to 500 spatial variables while GPU batching keeps wall-clock time nearly constant; and \emphgenerality, since the same method handles coupled systems whose forward dynamics depend on the backward solution. These results open practical access to a family of high-dimensional, path-dependent problems in stochastic control and quantitative finance.

[LG-164] Convexified Message-Passing Graph Neural Networks

链接: https://arxiv.org/abs/2505.18289
作者: Saar Cohen,Noa Agmon,Uri Shaham
类目: Machine Learning (cs.LG)
*备注: 31 pages, 4 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become prominent methods for graph representation learning, demonstrating strong empirical results on diverse graph prediction tasks. In this paper, we introduce Convexified Message Passing Graph Neural Networks (CGNNs), a novel and general framework that combines the power of message-passing GNNs with the tractability of convex optimization. By mapping their nonlinear filters into a reproducing kernel Hilbert space, CGNNs transform training into a convex optimization problem, which can be solved efficiently and optimally by projected gradient methods. This convexity further allows the statistical properties of CGNNs to be analyzed accurately and rigorously. For two-layer CGNNs, we establish rigorous generalization guarantees, showing convergence to the performance of the optimal GNN. To scale to deeper architectures, we adopt a principled layer-wise training strategy. Experiments on benchmark datasets show that CGNNs significantly exceed the performance of leading GNN models, achieving 10 to 40 percent higher accuracy in most cases, underscoring their promise as a powerful and principled method with strong theoretical foundations. In rare cases where improvements are not quantitatively substantial, the convex models either slightly exceed or match the baselines, stressing their robustness and wide applicability. Though over-parameterization is often employed to enhance performance in nonconvex models, we show that our CGNNs framework yields shallow convex models that can surpass these models in both accuracy and resource efficiency.

[LG-165] Representative Action Selection for Large Action-Space Meta-Bandits

链接: https://arxiv.org/abs/2505.18269
作者: Quan Zhou,Mark Kozdoba,Shie Mannor
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. We assume that similar actions tend to have related payoffs, modeled by a Gaussian process. To exploit this structure, we propose a simple epsilon-net algorithm to select a representative subset. We provide theoretical guarantees for its performance and compare it empirically to Thompson Sampling and Upper Confidence Bound.

[LG-166] Decomposition of Water Demand Patterns Using Skewed Gaussian Distributions for Behavioral Insights and Operational Planning

链接: https://arxiv.org/abs/2505.18245
作者: Roy Elkayam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a novel approach for decomposing urban water demand patterns using Skewed Gaussian Distributions (SGD) to derive behavioral insights and support operational planning. Hourly demand profiles contain critical information for both long-term infrastructure design and daily operations, influencing network pressures, water quality, energy consumption, and overall reliability. By breaking down each daily demand curve into a baseline component and distinct peak components, the proposed SGD method characterizes each peak with interpretable parameters, including peak amplitude, timing (mean), spread (duration), and skewness (asymmetry), thereby reconstructing the observed pattern and uncovering latent usage dynamics. This detailed peak-level decomposition enables both operational applications, e.g. anomaly and leakage detection, real-time demand management, and strategic analyses, e.g. identifying behavioral shifts, seasonal influences, or policy impacts on consumption patterns. Unlike traditional symmetric Gaussian or purely statistical time-series models, SGDs explicitly capture asymmetric peak shapes such as sharp morning surges followed by gradual declines, improving the fidelity of synthetic pattern generation and enhancing the detection of irregular consumption behavior. The method is demonstrated on several real-world datasets, showing that SGD outperforms symmetric Gaussian models in reconstruction accuracy, reducing root-mean-square error by over 50% on average, while maintaining physical interpretability. The SGD framework can also be used to construct synthetic demand scenarios by designing daily peak profiles with chosen characteristics. All implementation code is publicly available at: this https URL

[LG-167] BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text

链接: https://arxiv.org/abs/2505.18207
作者: Ibrahim Al Azher,Miftahul Jannat Mokarrama,Zhishuai Guo,Sagnik Ray Choudhury,Hamed Alhoori
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:In scientific research, limitations refer to the shortcomings, constraints, or weaknesses within a study. Transparent reporting of such limitations can enhance the quality and reproducibility of research and improve public trust in science. However, authors often a) underreport them in the paper text and b) use hedging strategies to satisfy editorial requirements at the cost of readers’ clarity and confidence. This underreporting behavior, along with an explosion in the number of publications, has created a pressing need to automatically extract or generate such limitations from scholarly papers. In this direction, we present a complete architecture for the computational analysis of research limitations. Specifically, we create a dataset of limitations in ACL, NeurIPS, and PeerJ papers by extracting them from papers’ text and integrating them with external reviews; we propose methods to automatically generate them using a novel Retrieval Augmented Generation (RAG) technique; we create a fine-grained evaluation framework for generated limitations; and we provide a meta-evaluation for the proposed evaluation techniques.

[LG-168] Reinforcement Twinning for Hybrid Control of Flapping-Wing Drones

链接: https://arxiv.org/abs/2505.18201
作者: Romain Poletti,Lorenzo Schena,Lilla Koloszar,Joris Degroote,Miguel Alfonso Mendez
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Controlling the flight of flapping-wing drones requires versatile controllers that handle their time-varying, nonlinear, and underactuated dynamics from incomplete and noisy sensor data. Model-based methods struggle with accurate modeling, while model-free approaches falter in efficiently navigating very high-dimensional and nonlinear control objective landscapes. This article presents a novel hybrid model-free/model-based approach to flight control based on the recently proposed reinforcement twinning algorithm. The model-based (MB) approach relies on an adjoint formulation using an adaptive digital twin, continuously identified from live trajectories, while the model-free (MF) approach relies on reinforcement learning. The two agents collaborate through transfer learning, imitation learning, and experience sharing using the real environment, the digital twin and a referee. The latter selects the best agent to interact with the real environment based on performance within the digital twin and a real-to-virtual environment consistency ratio. The algorithm is evaluated for controlling the longitudinal dynamics of a flapping-wing drone, with the environment simulated as a nonlinear, time-varying dynamical system under the influence of quasi-steady aerodynamic forces. The hybrid control learning approach is tested with three types of initialization of the adaptive model: (1) offline identification using previously available data, (2) random initialization with full online identification, and (3) offline pre-training with an estimation bias, followed by online adaptation. In all three scenarios, the proposed hybrid learning approach demonstrates superior performance compared to purely model-free and model-based methods.

[LG-169] Riemannian Flow Matching for Brain Connectivity Matrices via Pullback Geometry

链接: https://arxiv.org/abs/2505.18193
作者: Antoine Collas,Ce Ju,Nicolas Salvy,Bertrand Thirion
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generating realistic brain connectivity matrices is key to analyzing population heterogeneity in brain organization, understanding disease, and augmenting data in challenging classification problems. Functional connectivity matrices lie in constrained spaces–such as the set of symmetric positive definite or correlation matrices–that can be modeled as Riemannian manifolds. However, using Riemannian tools typically requires redefining core operations (geodesics, norms, integration), making generative modeling computationally inefficient. In this work, we propose DiffeoCFM, an approach that enables conditional flow matching (CFM) on matrix manifolds by exploiting pullback metrics induced by global diffeomorphisms on Euclidean spaces. We show that Riemannian CFM with such metrics is equivalent to applying standard CFM after data transformation. This equivalence allows efficient vector field learning, and fast sampling with standard ODE solvers. We instantiate DiffeoCFM with two different settings: the matrix logarithm for covariance matrices and the normalized Cholesky decomposition for correlation matrices. We evaluate DiffeoCFM on three large-scale fMRI datasets with more than 4600 scans from 2800 subjects (ADNI, ABIDE, OASIS-3) and two EEG motor imagery datasets with over 30000 trials from 26 subjects (BNCI2014-002 and BNCI2015-001). It enables fast training and achieves state-of-the-art performance, all while preserving manifold constraints.

[LG-170] Discovering Interpretable Concepts in Large Generative Music Models

链接: https://arxiv.org/abs/2505.18186
作者: Nikhil Singh,Manuel Cherep,Pattie Maes
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of the structure of such content through statistical learning alone. This could offer a novel lens on theories of human-generated media. Where these representations align with traditional constructs (e.g. chord progressions in music), they demonstrate how these can be inferred from statistical regularities. Where they diverge, they highlight potential limits in our theoretical frameworks – patterns that we may have overlooked but that nonetheless hold significant explanatory power. In this paper, we focus on the specific case of music generators. We introduce a method to discover musical concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream activations of a transformer model. We evaluate this approach by extracting a large set of features and producing an automatic labeling and evaluation pipeline for them. Our results reveal both familiar musical concepts and counterintuitive patterns that lack clear counterparts in existing theories or natural language altogether. Beyond improving model transparency, our work provides a new empirical tool that might help discover organizing principles in ways that have eluded traditional methods of analysis and synthesis.

[LG-171] Clustering scientific publications: lessons learned through experiments with a real citation network

链接: https://arxiv.org/abs/2505.18180
作者: Vu Thi Huong,Thorsten Koch
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of large-scale data, method selection and tuning based on specific structures of bibliometric clustering tasks.

[LG-172] Should We Simultaneously Calibrate Multiple Computer Models?

链接: https://arxiv.org/abs/2505.18176
作者: Jonathan Tammer Eweis-Labolle,Tyler Johnson,Xiangyu Sun,Ramin Bostanabad
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In an increasing number of applications designers have access to multiple computer models which typically have different levels of fidelity and cost. Traditionally, designers calibrate these models one at a time against some high-fidelity data (e.g., experiments). In this paper, we question this tradition and assess the potential of calibrating multiple computer models at the same time. To this end, we develop a probabilistic framework that is founded on customized neural networks (NNs) that are designed to calibrate an arbitrary number of computer models. In our approach, we (1) consider the fact that most computer models are multi-response and that the number and nature of calibration parameters may change across the models, and (2) learn a unique probability distribution for each calibration parameter of each computer model, (3) develop a loss function that enables our NN to emulate all data sources while calibrating the computer models, and (4) aim to learn a visualizable latent space where model-form errors can be identified. We test the performance of our approach on analytic and engineering problems to understand the potential advantages and pitfalls in simultaneous calibration of multiple computer models. Our method can improve predictive accuracy, however, it is prone to non-identifiability issues in higher-dimensional input spaces that are normally constrained by underlying physics.

[LG-173] GenAI Security: Outsmarting the Bots with a Proactive Testing Framework

链接: https://arxiv.org/abs/2505.18172
作者: Sunil Kumar Jang Bahadur,Gopala Dhar,Lavi Nigam
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: IEEE CAI 2025

点击查看摘要

Abstract:The increasing sophistication and integration of Generative AI (GenAI) models into diverse applications introduce new security challenges that traditional methods struggle to address. This research explores the critical need for proactive security measures to mitigate the risks associated with malicious exploitation of GenAI systems. We present a framework encompassing key approaches, tools, and strategies designed to outmaneuver even advanced adversarial attacks, emphasizing the importance of securing GenAI innovation against potential liabilities. We also empirically prove the effectiveness of the said framework by testing it against the SPML Chatbot Prompt Injection Dataset. This work highlights the shift from reactive to proactive security practices essential for the safe and responsible deployment of GenAI technologies

[LG-174] Robust Knowledge Graph Embedding via Denoising

链接: https://arxiv.org/abs/2505.18171
作者: Tengwei Song,Xudong Ma,Yang Liu,Jie Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We focus on obtaining robust knowledge graph embedding under perturbation in the embedding space. To address these challenges, we introduce a novel framework, Robust Knowledge Graph Embedding via Denoising, which enhances the robustness of KGE models on noisy triples. By treating KGE methods as energy-based models, we leverage the established connection between denoising and score matching, enabling the training of a robust denoising KGE model. Furthermore, we propose certified robustness evaluation metrics for KGE methods based on the concept of randomized smoothing. Through comprehensive experiments on benchmark datasets, our framework consistently shows superior performance compared to existing state-of-the-art KGE methods when faced with perturbed entity embedding.

[LG-175] Interpretable Multi-Task PINN for Emotion Recognition and EDA Prediction

链接: https://arxiv.org/abs/2505.18169
作者: Nischal Mandal
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Understanding and predicting human emotional and physiological states using wearable sensors has important applications in stress monitoring, mental health assessment, and affective computing. This study presents a novel Multi-Task Physics-Informed Neural Network (PINN) that performs Electrodermal Activity (EDA) prediction and emotion classification simultaneously, using the publicly available WESAD dataset. The model integrates psychological self-report features (PANAS and SAM) with a physics-inspired differential equation representing EDA dynamics, enforcing biophysically grounded constraints through a custom loss function. This loss combines EDA regression, emotion classification, and a physics residual term for improved interpretability. The architecture supports dual outputs for both tasks and is trained under a unified multi-task framework. Evaluated using 5-fold cross-validation, the model achieves an average EDA RMSE of 0.0362, Pearson correlation of 0.9919, and F1-score of 94.08 percent. These results outperform classical models such as SVR and XGBoost, as well as ablated variants like emotion-only and EDA-only models. In addition, the learned physical parameters including decay rate (alpha_0), emotional sensitivity (beta), and time scaling (gamma) are interpretable and stable across folds, aligning with known principles of human physiology. This work is the first to introduce a multi-task PINN framework for wearable emotion recognition, offering improved performance, generalizability, and model transparency. The proposed system provides a foundation for future interpretable and multimodal applications in healthcare and human-computer interaction. Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2505.18169 [cs.LG] (or arXiv:2505.18169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.18169 Focus to learn more arXiv-issued DOI via DataCite

[LG-176] Emotion Knowledge Enhancement for Vision Large Language Models : A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

链接: https://arxiv.org/abs/2505.18168
作者: Feifan Wang,Tengfei Song,Minggui He,Chang Su,Zhanglin Wu,Hao Yang,Wenming Zheng,Osamu Yoshie
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM’s corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.

[LG-177] Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression

链接: https://arxiv.org/abs/2505.18166
作者: Jacob Sander,David Moe,Achraf Cohen,Brent Venable,Venkat Dasari,Brian Jalaian
类目: Machine Learning (cs.LG)
*备注: 9 Pages, 2 Figures

点击查看摘要

Abstract:Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test accuracy, demonstrating that, even with a basic MLP-only pruning, the choice of loss function materially affects compressed model recovery in resource-constrained environments.

[LG-178] Improved and Oracle-Efficient Online ell_1-Multicalibration ICML2025

链接: https://arxiv.org/abs/2505.17365
作者: Rohan Ghuge,Vidya Muthukumar,Sahil Singla
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:We study \emphonline multicalibration, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across T rounds. Although online calibration is typically studied in the \ell_1 norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as \ell_2 and \ell_\infty ) and then transferred these guarantees to \ell_1 at additional loss. In contrast, we propose a direct method that achieves improved and oracle-efficient rates of \widetilde\mathcalO(T^-1/3) and \widetilde\mathcalO(T^-1/4) respectively, for online \ell_1 -multicalibration. Our key insight is a novel reduction of online (\ell_1)-multicalibration to an online learning problem with product-based rewards, which we refer to as \emphonline linear-product optimization ( \mathttOLPO ). To obtain the improved rate of \widetilde\mathcalO(T^-1/3) , we introduce a linearization of \mathttOLPO and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it becomes computationally expensive when the group family (\mathcalH) is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to \mathttOLPO that makes only a polynomial number of calls to an offline optimization (\emphmulticalibration evaluation) oracle, resulting in \emphoracle-efficient online (\ell_1)-multicalibration with a rate of \widetilde\mathcalO(T^-1/4) . Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a 1 -Lipschitz property of the (\ell_1)-multicalibration error with respect to (\mathcalH). Comments: Accepted to ICML 2025 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2505.17365 [cs.LG] (or arXiv:2505.17365v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.17365 Focus to learn more arXiv-issued DOI via DataCite

[LG-179] Lorentz Local Canonicalization: How to Make Any Network Lorentz-Equivariant

链接: https://arxiv.org/abs/2505.20280
作者: Jonas Spinner,Luigi Favaro,Peter Lippmann,Sebastian Pitz,Gerrit Gerhartz,Tilman Plehn,Fred A. Hamprecht
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 22 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Lorentz-equivariant neural networks are becoming the leading architectures for high-energy physics. Current implementations rely on specialized layers, limiting architectural choices. We introduce Lorentz Local Canonicalization (LLoCa), a general framework that renders any backbone network exactly Lorentz-equivariant. Using equivariantly predicted local reference frames, we construct LLoCa-transformers and graph networks. We adapt a recent approach to geometric message passing to the non-compact Lorentz group, allowing propagation of space-time tensorial features. Data augmentation emerges from LLoCa as a special choice of reference frame. Our models surpass state-of-the-art accuracy on relevant particle physics tasks, while being 4\times faster and using 5 - 100\times fewer FLOPs.

[LG-180] New Perspectives on the Polyak Stepsize: Surrogate Functions and Negative Results

链接: https://arxiv.org/abs/2505.20219
作者: Francesco Orabona,Ryan D’Orazio
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Polyak stepsize has been proven to be a fundamental stepsize in convex optimization, giving near optimal gradient descent rates across a wide range of assumptions. The universality of the Polyak stepsize has also inspired many stochastic variants, with theoretical guarantees and strong empirical performance. Despite the many theoretical results, our understanding of the convergence properties and shortcomings of the Polyak stepsize or its variants is both incomplete and fractured across different analyses. We propose a new, unified, and simple perspective for the Polyak stepsize and its variants as gradient descent on a surrogate loss. We show that each variant is equivalent to minimize a surrogate function with stepsizes that adapt to a guaranteed local curvature. Our general surrogate loss perspective is then used to provide a unified analysis of existing variants across different assumptions. Moreover, we show a number of negative results proving that the non-convergence results in some of the upper bounds is indeed real.

[LG-181] No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference

链接: https://arxiv.org/abs/2505.20178
作者: Pranav Mani,Peng Xu,Zachary C. Lipton,Michael Oberst
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prediction-Powered Inference (PPI) is a popular strategy for combining gold-standard and possibly noisy pseudo-labels to perform statistical estimation. Prior work has shown an asymptotic “free lunch” for PPI++, an adaptive form of PPI, showing that the asymptotic variance of PPI++ is always less than or equal to the variance obtained from using gold-standard labels alone. Notably, this result holds regardless of the quality of the pseudo-labels. In this work, we demystify this result by conducting an exact finite-sample analysis of the estimation error of PPI++ on the mean estimation problem. We give a “no free lunch” result, characterizing the settings (and sample sizes) where PPI++ has provably worse estimation error than using gold-standard labels alone. Specifically, PPI++ will outperform if and only if the correlation between pseudo- and gold-standard is above a certain level that depends on the number of labeled samples ( n ). In some cases our results simplify considerably: For Gaussian data, the correlation must be at least 1/\sqrtn - 2 in order to see improvement, and a similar result holds for binary labels. In experiments, we illustrate that our theoretical findings hold on real-world datasets, and give insights into trade-offs between single-sample and sample-splitting variants of PPI++.

[LG-182] A fast sound power prediction tool for genset noise using machine learning

链接: https://arxiv.org/abs/2505.20079
作者: Saurabh Pargal,Abhijit A. Sane
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the application of machine learning regression algorithms Kernel Ridge Regression (KRR), Huber Regressor (HR), and Gaussian Process Regression (GPR) for predicting sound power levels of gensets, offering significant value for marketing and sales teams during the early bidding process. When engine sizes and genset enclosure dimensions are tentative, and measured noise data is unavailable, these algorithms enable reliable noise level estimation for unbuilt gensets. The study utilizes high fidelity datasets from over 100 experiments conducted at Cummins Acoustics Technology Center (ATC) in a hemi-anechoic chamber, adhering to ISO 3744 standards. By using readily available information from the bidding and initial design stages, KRR predicts sound power with an average accuracy of within 5 dBA. While HR and GPR show slightly higher prediction errors, all models effectively capture the overall noise trends across various genset configurations. These findings present a promising method for early-stage noise estimation in genset design.

[LG-183] Linear Bandits with Non-i.i.d. Noise

链接: https://arxiv.org/abs/2505.20017
作者: Baptiste Abélès,Eugenio Clerico,Hamish Flynn,Gergely Neu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the linear stochastic bandit problem, relaxing the standard i.i.d. assumption on the observation noise. As an alternative to this restrictive assumption, we allow the noise terms across rounds to be sub-Gaussian but interdependent, with dependencies that decay over time. To address this setting, we develop new confidence sequences using a recently introduced reduction scheme to sequential probability assignment, and use these to derive a bandit algorithm based on the principle of optimism in the face of uncertainty. We provide regret bounds for the resulting algorithm, expressed in terms of the decay rate of the strength of dependence between observations. Among other results, we show that our bounds recover the standard rates up to a factor of the mixing time for geometrically mixing observation noise.

[LG-184] Cellwise and Casewise Robust Covariance in High Dimensions

链接: https://arxiv.org/abs/2505.19925
作者: Fabio Centofanti,Mia Hubert,Peter J. Rousseeuw
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.

[LG-185] Efficient Deconvolution in Populational Inverse Problems

链接: https://arxiv.org/abs/2505.19841
作者: Arnaud Vadeboncoeur,Mark Girolami,Andrew M. Stuart
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This work is focussed on the inversion task of inferring the distribution over parameters of interest leading to multiple sets of observations. The potential to solve such distributional inversion problems is driven by increasing availability of data, but a major roadblock is blind deconvolution, arising when the observational noise distribution is unknown. However, when data originates from collections of physical systems, a population, it is possible to leverage this information to perform deconvolution. To this end, we propose a methodology leveraging large data sets of observations, collected from different instantiations of the same physical processes, to simultaneously deconvolve the data corrupting noise distribution, and to identify the distribution over model parameters defining the physical processes. A parameter-dependent mathematical model of the physical process is employed. A loss function characterizing the match between the observed data and the output of the mathematical model is defined; it is minimized as a function of the both the parameter inputs to the model of the physics and the parameterized observational noise. This coupled problem is addressed with a modified gradient descent algorithm that leverages specific structure in the noise model. Furthermore, a new active learning scheme is proposed, based on adaptive empirical measures, to train a surrogate model to be accurate in parameter regions of interest; this approach accelerates computation and enables automatic differentiation of black-box, potentially nondifferentiable, code computing parameter-to-solution maps. The proposed methodology is demonstrated on porous medium flow, damped elastodynamics, and simplified models of atmospheric dynamics.

[LG-186] Weighted Leave-One-Out Cross Validation

链接: https://arxiv.org/abs/2505.19737
作者: Luc Pronzato(RT-UQ),Maria-João Rendas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We present a weighted version of Leave-One-Out (LOO) cross-validation for estimating the Integrated Squared Error (ISE) when approximating an unknown function by a predictor that depends linearly on evaluations of the function over a finite collection of sites. The method relies on the construction of the best linear estimator of the squared prediction error at an arbitrary unsampled site based on squared LOO residuals, assuming that the function is a realization of a Gaussian Process (GP). A theoretical analysis of performance of the ISE estimator is presented, and robustness with respect to the choice of the GP kernel is investigated first analytically, then through numerical examples. Overall, the estimation of ISE is significantly more precise than with classical, unweighted, LOO cross validation. Application to model selection is briefly considered through examples.

[LG-187] Accelerating Nash Learning from Human Feedback via Mirror Prox

链接: https://arxiv.org/abs/2505.19731
作者: Daniil Tiapkin,Daniele Calandriello,Denis Belomestny,Eric Moulines,Alexey Naumov,Kashif Rasul,Michal Valko,Pierre Menard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley-Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. In this work, we introduce Nash Mirror Prox ( \mathttNash-MP ), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium. Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the \beta -regularized Nash equilibrium. Specifically, we prove that the KL-divergence to the optimal policy decreases at a rate of order (1+2\beta)^-N/2 , where N is a number of preference queries. We further demonstrate last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities, with all these rates being independent of the size of the action space. Furthermore, we propose and analyze an approximate version of Nash-MP where proximal steps are estimated using stochastic policy gradients, making the algorithm closer to applications. Finally, we detail a practical implementation strategy for fine-tuning large language models and present experiments that demonstrate its competitive performance and compatibility with existing methods.

[LG-188] A Structured Tour of Optimization with Finite Differences

链接: https://arxiv.org/abs/2505.19720
作者: Marco Rando,Cesare Molinari,Lorenzo Rosasco,Silvia Villa
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 24 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Finite-difference methods are widely used for zeroth-order optimization in settings where gradient information is unavailable or expensive to compute. These procedures mimic first-order strategies by approximating gradients through function evaluations along a set of random directions. From a theoretical perspective, recent studies indicate that imposing structure (such as orthogonality) on the chosen directions allows for the derivation of convergence rates comparable to those achieved with unstructured random directions (i.e., directions sampled independently from a distribution). Empirically, although structured directions are expected to enhance performance, they often introduce additional computational costs, which can limit their applicability in high-dimensional settings. In this work, we examine the impact of structured direction selection in finite-difference methods. We review and extend several strategies for constructing structured direction matrices and compare them with unstructured approaches in terms of computational cost, gradient approximation quality, and convergence behavior. Our evaluation spans both synthetic tasks and real-world applications such as adversarial perturbation. The results demonstrate that structured directions can be generated with computational costs comparable to unstructured ones while significantly improving gradient estimation accuracy and optimization performance.

[LG-189] Information-theoretic Generalization Analysis for VQ-VAEs: A Role of Latent Variables

链接: https://arxiv.org/abs/2505.19470
作者: Futoshi Futami,Masahiro Fujisawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Latent variables (LVs) play a crucial role in encoder-decoder models by enabling effective data compression, prediction, and generation. Although their theoretical properties, such as generalization, have been extensively studied in supervised learning, similar analyses for unsupervised models such as variational autoencoders (VAEs) remain insufficiently underexplored. In this work, we extend information-theoretic generalization analysis to vector-quantized (VQ) VAEs with discrete latent spaces, introducing a novel data-dependent prior to rigorously analyze the relationship among LVs, generalization, and data generation. We derive a novel generalization error bound of the reconstruction loss of VQ-VAEs, which depends solely on the complexity of LVs and the encoder, independent of the decoder. Additionally, we provide the upper bound of the 2-Wasserstein distance between the distributions of the true data and the generated data, explaining how the regularization of the LVs contributes to the data generation performance.

[LG-190] Uniform convergence of the smooth calibration error and its relationship with functional gradient

链接: https://arxiv.org/abs/2505.19396
作者: Futoshi Futami,Atsushi Nitanda
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Calibration is a critical requirement for reliable probabilistic prediction, especially in high-risk applications. However, the theoretical understanding of which learning algorithms can simultaneously achieve high accuracy and good calibration remains limited, and many existing studies provide empirical validation or a theoretical guarantee in restrictive settings. To address this issue, in this work, we focus on the smooth calibration error (CE) and provide a uniform convergence bound, showing that the smooth CE is bounded by the sum of the smooth CE over the training dataset and a generalization gap. We further prove that the functional gradient of the loss function can effectively control the training smooth CE. Based on this framework, we analyze three representative algorithms: gradient boosting trees, kernel boosting, and two-layer neural networks. For each, we derive conditions under which both classification and calibration performances are simultaneously guaranteed. Our results offer new theoretical insights and practical guidance for designing reliable probabilistic models with provable calibration guarantees.

[LG-191] Adaptive Diffusion Guidance via Stochastic Optimal Control

链接: https://arxiv.org/abs/2505.19367
作者: Iskander Azangulov,Peter Potaptchik,Qinyu Li,Eddie Aamari,George Deligiannidis,Judith Rousseau
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Guidance is a cornerstone of modern diffusion models, playing a pivotal role in conditional generation and enhancing the quality of unconditional samples. However, current approaches to guidance scheduling–determining the appropriate guidance weight–are largely heuristic and lack a solid theoretical foundation. This work addresses these limitations on two fronts. First, we provide a theoretical formalization that precisely characterizes the relationship between guidance strength and classifier confidence. Second, building on this insight, we introduce a stochastic optimal control framework that casts guidance scheduling as an adaptive optimization problem. In this formulation, guidance strength is not fixed but dynamically selected based on time, the current sample, and the conditioning class, either independently or in combination. By solving the resulting control problem, we establish a principled foundation for more effective guidance in diffusion models.

[LG-192] FlashMD: long-stride universal prediction of molecular dynamics

链接: https://arxiv.org/abs/2505.19350
作者: Filippo Bigi,Sanggyu Chong,Agustinus Kristiadi,Michele Ceriotti
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Molecular dynamics (MD) provides insights into atomic-scale processes by integrating over time the equations that describe the motion of atoms under the action of interatomic forces. Machine learning models have substantially accelerated MD by providing inexpensive predictions of the forces, but they remain constrained to minuscule time integration steps, which are required by the fast time scale of atomic motion. In this work, we propose FlashMD, a method to predict the evolution of positions and momenta over strides that are between one and two orders of magnitude longer than typical MD time steps. We incorporate considerations on the mathematical and physical properties of Hamiltonian dynamics in the architecture, generalize the approach to allow the simulation of any thermodynamic ensemble, and carefully assess the possible failure modes of such a long-stride MD approach. We validate FlashMD’s accuracy in reproducing equilibrium and time-dependent properties, using both system-specific and general-purpose models, extending the ability of MD simulation to reach the long time scales needed to model microscopic processes of high scientific and technological relevance.

[LG-193] PIGPVAE: Physics-Informed Gaussian Process Variational Autoencoders

链接: https://arxiv.org/abs/2505.19320
作者: Michail Spitieris,Massimiliano Ruocco,Abdulmajid Murad,Alessandro Nocente
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:Recent advances in generative AI offer promising solutions for synthetic data generation but often rely on large datasets for effective training. To address this limitation, we propose a novel generative model that learns from limited data by incorporating physical constraints to enhance performance. Specifically, we extend the VAE architecture by incorporating physical models in the generative process, enabling it to capture underlying dynamics more effectively. While physical models provide valuable insights, they struggle to capture complex temporal dependencies present in real-world data. To bridge this gap, we introduce a discrepancy term to account for unmodeled dynamics, represented within a latent Gaussian Process VAE (GPVAE). Furthermore, we apply regularization to ensure the generated data aligns closely with observed data, enhancing both the diversity and accuracy of the synthetic samples. The proposed method is applied to indoor temperature data, achieving state-of-the-art performance. Additionally, we demonstrate that PIGPVAE can produce realistic samples beyond the observed distribution, highlighting its robustness and usefulness under distribution shifts.

[LG-194] Fractional-Boundary-Regularized Deep Galerkin Method for Variational Inequalities in Mixed Optimal Stopping and Control

链接: https://arxiv.org/abs/2505.19309
作者: Yun Zhao,Harry Zheng
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Mixed optimal stopping and stochastic control problems define variational inequalities with non-linear Hamilton-Jacobi-Bellman (HJB) operators, whose numerical solution is notoriously difficult and lack of reliable benchmarks. We first use the dual approach to transform it into a linear operator, and then introduce a Fractional-Boundary-Regularized Deep Galerkin Method (FBR-DGM) that augments the classical L^2 loss with Sobolev-Slobodeckij norms on the parabolic boundary, enforcing regularity and yielding consistent improvements in the network approximation and its derivatives. The improved accuracy allows the network to be converted back to the original solution using the dual transform. The self-consistency and stability of the network can be tested by checking the primal-dual relationship among optimal value, optimal wealth, and optimal control, offering innovative benchmarks in the absence of analytical solutions.

[LG-195] Do Large Language Models (Really) Need Statistical Foundations?

链接: https://arxiv.org/abs/2505.19145
作者: Weijie Su
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) represent a new paradigm for processing unstructured data, with applications across an unprecedented range of domains. In this paper, we address, through two arguments, whether the development and application of LLMs would genuinely benefit from foundational contributions from the statistics discipline. First, we argue affirmatively, beginning with the observation that LLMs are inherently statistical models due to their profound data dependency and stochastic generation processes, where statistical insights are naturally essential for handling variability and uncertainty. Second, we argue that the persistent black-box nature of LLMs – stemming from their immense scale, architectural complexity, and development practices often prioritizing empirical performance over theoretical interpretability – renders closed-form or purely mechanistic analyses generally intractable, thereby necessitating statistical approaches due to their flexibility and often demonstrated effectiveness. To substantiate these arguments, the paper outlines several research areas – including alignment, watermarking, uncertainty quantification, evaluation, and data mixture optimization – where statistical methodologies are critically needed and are already beginning to make valuable contributions. We conclude with a discussion suggesting that statistical research concerning LLMs will likely form a diverse ``mosaic’’ of specialized topics rather than deriving from a single unifying theory, and highlighting the importance of timely engagement by our statistics community in LLM research.

[LG-196] Uncertainty Quantification for Physics-Informed Neural Networks with Extended Fiducial Inference

链接: https://arxiv.org/abs/2505.19136
作者: Frank Shih,Zhenghao Jiang,Faming Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) in scientific machine learning is increasingly critical as neural networks are widely adopted to tackle complex problems across diverse scientific disciplines. For physics-informed neural networks (PINNs), a prominent model in scientific machine learning, uncertainty is typically quantified using Bayesian or dropout methods. However, both approaches suffer from a fundamental limitation: the prior distribution or dropout rate required to construct honest confidence sets cannot be determined without additional information. In this paper, we propose a novel method within the framework of extended fiducial inference (EFI) to provide rigorous uncertainty quantification for PINNs. The proposed method leverages a narrow-neck hyper-network to learn the parameters of the PINN and quantify their uncertainty based on imputed random errors in the observations. This approach overcomes the limitations of Bayesian and dropout methods, enabling the construction of honest confidence sets based solely on observed data. This advancement represents a significant breakthrough for PINNs, greatly enhancing their reliability, interpretability, and applicability to real-world scientific and engineering challenges. Moreover, it establishes a new theoretical framework for EFI, extending its application to large-scale models, eliminating the need for sparse hyper-networks, and significantly improving the automaticity and robustness of statistical inference.

[LG-197] Statistical inference for Linear Stochastic Approximation with Markovian Noise

链接: https://arxiv.org/abs/2505.19102
作者: Sergey Samsonov,Marina Sheshukova,Eric Moulines,Alexey Naumov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper we derive non-asymptotic Berry-Esseen bounds for Polyak-Ruppert averaged iterates of the Linear Stochastic Approximation (LSA) algorithm driven by the Markovian noise. Our analysis yields \mathcalO(n^-1/4) convergence rates to the Gaussian limit in the Kolmogorov distance. We further establish the non-asymptotic validity of a multiplier block bootstrap procedure for constructing the confidence intervals, guaranteeing consistent inference under Markovian sampling. Our work provides the first non-asymptotic guarantees on the rate of convergence of bootstrap-based confidence intervals for stochastic approximation with Markov noise. Moreover, we recover the classical rate of order \mathcalO(n^-1/8) up to logarithmic factors for estimating the asymptotic variance of the iterates of the LSA algorithm.

[LG-198] A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random

链接: https://arxiv.org/abs/2505.19093
作者: Binh H. Ho,Long Nguyen Chi,TrungTin Nguyen,Binh T. Nguyen,Van Ha Hoang,Christopher Drovandi
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.

[LG-199] Geometric Determinations Of Characteristic Redshifts From DESI-DR2 BAO and DES-SN5YR Observations: Hints For New Expansion Rate Anomalies

链接: https://arxiv.org/abs/2505.19083
作者: Purba Mukherjee,Anjan A Sen
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 21 pages, 11 figures, 5 tables. Comments are welcome

点击查看摘要

Abstract:In this work, we perform a model-agnostic reconstruction of the cosmic expansion history by combining DESI-DR2 BAO and DES-SN5YR data, with a focus on geometric determination of characteristic redshifts where notable tensions in the expansion rate are found to emerge. Employing Gaussian process regression alongside knot-based spline techniques, we reconstruct cosmic distances and their derivatives to pinpoint these characteristic redshifts and infer E(z) . Our analysis reveals significant deviations of approximately 4 to 5 \sigma from the Planck 2018 \Lambda CDM predictions, particularly pronounced in the redshift range z \sim 0.35-0.55 . These anomalies are consistently observed across both reconstruction methods and combined datasets, indicating robust late-time departures that could signal new physics beyond the standard cosmological framework. The joint use of BAO and SN probes enhances the precision of our constraints, allowing us to isolate these deviations without reliance on specific cosmological assumptions. Our findings underscore the role of characteristic redshifts as sensitive indicators of expansion rate anomalies and motivate further scrutiny with forthcoming datasets from DESI-5YR BAO, Euclid, and LSST. These future surveys will tighten constraints and help distinguish whether these late-time anomalies arise from new fundamental physics or unresolved systematics in the data.

[LG-200] When Models Dont Collapse: On the Consistency of Iterative MLE

链接: https://arxiv.org/abs/2505.19046
作者: Daniel Barzilai,Ohad Shamir
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about \emphmodel collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.

[LG-201] Optimal Conformal Prediction under Epistemic Uncertainty

链接: https://arxiv.org/abs/2505.19033
作者: Alireza Javanmardi,Soroush H. Zargarbashi,Santo M. A. R. Thies,Willem Waegeman,Aleksandar Bojchevski,Eyke Hüllermeier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) is a popular frequentist framework for representing uncertainty by providing prediction sets that guarantee coverage of the true label with a user-adjustable probability. In most applications, CP operates on confidence scores coming from a standard (first-order) probabilistic predictor (e.g., softmax outputs). Second-order predictors, such as credal set predictors or Bayesian models, are also widely used for uncertainty quantification and are known for their ability to represent both aleatoric and epistemic uncertainty. Despite their popularity, there is still an open question on ``how they can be incorporated into CP’'. In this paper, we discuss the desiderata for CP when valid second-order predictions are available. We then introduce Bernoulli prediction sets (BPS), which produce the smallest prediction sets that ensure conditional coverage in this setting. When given first-order predictions, BPS reduces to the well-known adaptive prediction sets (APS). Furthermore, when the validity assumption on the second-order predictions is compromised, we apply conformal risk control to obtain a marginal coverage guarantee while still accounting for epistemic uncertainty.

[LG-202] ALPCAHUS: Subspace Clustering for Heteroscedastic Data

链接: https://arxiv.org/abs/2505.18918
作者: Javier Salazar Cavazos,Jeffrey A Fessler,Laura Balzano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Manuscript submitted to IEEE Transactions on Signal Processing (TSP) pending review

点击查看摘要

Abstract:Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. Various methods have been proposed to extend PCA to the union of subspace (UoS) setting for clustering data that come from multiple subspaces like K-Subspaces (KSS). However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a heteroscedastic-focused subspace clustering method, named ALPCAHUS, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace bases associated with the low-rank structure of the data. This clustering algorithm builds on K-Subspaces (KSS) principles by extending the recently proposed heteroscedastic PCA method, named LR-ALPCAH, for clusters with heteroscedastic noise in the UoS setting. Simulations and real-data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing clustering algorithms. Code available at this https URL.

[LG-203] On the Role of Label Noise in the Feature Learning Process ICML2025

链接: https://arxiv.org/abs/2505.18909
作者: Andi Han,Wei Huang,Zhanpeng Zhou,Gang Niu,Wuyang Chen,Junchi Yan,Akiko Takeda,Taiji Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.

[LG-204] Marginal Fairness: Fair Decision-Making under Risk Measures

链接: https://arxiv.org/abs/2505.18895
作者: Fei Huang,Silvana M. Pesenti
类目: Machine Learning (stat.ML); Computational Complexity (cs.CC); Computers and Society (cs.CY); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:This paper introduces marginal fairness, a new individual fairness notion for equitable decision-making in the presence of protected attributes such as gender, race, and religion. This criterion ensures that decisions based on generalized distortion risk measures are insensitive to distributional perturbations in protected attributes, regardless of whether these attributes are continuous, discrete, categorical, univariate, or multivariate. To operationalize this notion and reflect real-world regulatory environments (such as the EU gender-neutral pricing regulation), we model business decision-making in highly regulated industries (such as insurance and finance) as a two-step process: (i) a predictive modeling stage, in which a prediction function for the target variable (e.g., insurance losses) is estimated based on both protected and non-protected covariates; and (ii) a decision-making stage, in which a generalized distortion risk measure is applied to the target variable, conditional only on non-protected covariates, to determine the decision. In this second step, we modify the risk measure such that the decision becomes insensitive to the protected attribute, thus enforcing fairness to ensure equitable outcomes under risk-sensitive, regulatory constraints. Furthermore, by utilizing the concept of cascade sensitivity, we extend the marginal fairness framework to capture how dependencies between covariates propagate the influence of protected attributes through the modeling pipeline. A numerical study and an empirical implementation using an auto insurance dataset demonstrate how the framework can be applied in practice.

[LG-205] Non-Stationary Lipschitz Bandits

链接: https://arxiv.org/abs/2505.18871
作者: Nicolas Nguyen,Solenne Gaucher,Claire Vernade
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of \mathcal\widetildeO(\tildeL^1/3T^2/3) , where \tildeL is the number of significant shifts and T the horizon. This result provides the first optimal guarantee in this setting.

[LG-206] A physics-guided smoothing method for material modeling with digital image correlation (DIC) measurements

链接: https://arxiv.org/abs/2505.18784
作者: Jihong Wang,Chung-Hao Lee,William Richardson,Yue Yu
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present a novel approach to process the DIC measurements of multiple biaxial stretching protocols. In particular, we develop a optimization-based approach, which calculates the smoothed nodal displacements using a moving least-squares algorithm subject to positive strain constraints. As such, physically consistent displacement and strain fields are obtained. Then, we further deploy a data-driven workflow to heterogeneous material modeling from these physically consistent DIC measurements, by estimating a nonlocal constitutive law together with the material microstructure. To demonstrate the applicability of our approach, we apply it in learning a material model and fiber orientation field from DIC measurements of a porcine tricuspid valve anterior leaflet. Our results demonstrate that the proposed DIC data processing approach can significantly improve the accuracy of modeling biological materials.

[LG-207] Mind Your Vision: Multimodal Estimation of Refractive Disorders Using Electrooculography and Eye Tracking

链接: https://arxiv.org/abs/2505.18538
作者: Xin Wei,Huakun Liu,Yutaro Hirao,Monica Perusquia-Hernandez,Katsutoshi Masai,Hideaki Uchiyama,Kiyoshi Kiyokawa
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Refractive errors are among the most common visual impairments globally, yet their diagnosis often relies on active user participation and clinical oversight. This study explores a passive method for estimating refractive power using two eye movement recording techniques: electrooculography (EOG) and video-based eye tracking. Using a publicly available dataset recorded under varying diopter conditions, we trained Long Short-Term Memory (LSTM) models to classify refractive power from unimodal (EOG or eye tracking) and multimodal configuration. We assess performance in both subject-dependent and subject-independent settings to evaluate model personalization and generalizability across individuals. Results show that the multimodal model consistently outperforms unimodal models, achieving the highest average accuracy in both settings: 96.207% in the subject-dependent scenario and 8.882% in the subject-independent scenario. However, generalization remains limited, with classification accuracy only marginally above chance in the subject-independent evaluations. Statistical comparisons in the subject-dependent setting confirmed that the multimodal model significantly outperformed the EOG and eye-tracking models. However, no statistically significant differences were found in the subject-independent setting. Our findings demonstrate both the potential and current limitations of eye movement data-based refractive error estimation, contributing to the development of continuous, non-invasive screening methods using EOG signals and eye-tracking data.

[LG-208] Scalable Gaussian Processes with Low-Rank Deep Kernel Decomposition

链接: https://arxiv.org/abs/2505.18526
作者: Yunqin Zhu,Henry Shaowu Yuchi,Yao Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kernels are key to encoding prior beliefs and data structures in Gaussian process (GP) models. The design of expressive and scalable kernels has garnered significant research attention. Deep kernel learning enhances kernel flexibility by feeding inputs through a neural network before applying a standard parametric form. However, this approach remains limited by the choice of base kernels, inherits high inference costs, and often demands sparse approximations. Drawing on Mercer’s theorem, we introduce a fully data-driven, scalable deep kernel representation where a neural network directly represents a low-rank kernel through a small set of basis functions. This construction enables highly efficient exact GP inference in linear time and memory without invoking inducing points. It also supports scalable mini-batch training based on a principled variational inference framework. We further propose a simple variance correction procedure to guard against overconfidence in uncertainty estimates. Experiments on synthetic and real-world data demonstrate the advantages of our deep kernel GP in terms of predictive accuracy, uncertainty quantification, and computational efficiency.

[LG-209] Statistical Inference under Performativity

链接: https://arxiv.org/abs/2505.18493
作者: Xiang Li,Yunai Li,Huiying Zhong,Lihua Lei,Zhun Deng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Performativity of predictions refers to the phenomena that prediction-informed decisions may influence the target they aim to predict, which is widely observed in policy-making in social sciences and economics. In this paper, we initiate the study of statistical inference under performativity. Our contribution is two-fold. First, we build a central limit theorem for estimation and inference under performativity, which enables inferential purposes in policy-making such as constructing confidence intervals or testing hypotheses. Second, we further leverage the derived central limit theorem to investigate prediction-powered inference (PPI) under performativity, which is based on a small labeled dataset and a much larger dataset of machine-learning predictions. This enables us to obtain more precise estimation and improved confidence regions for the model parameter (i.e., policy) of interest in performative prediction. We demonstrate the power of our framework by numerical experiments. To the best of our knowledge, this paper is the first one to establish statistical inference under performativity, which brings up new challenges and inference settings that we believe will add significant values to policy-making, statistics, and machine learning.

[LG-210] On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

链接: https://arxiv.org/abs/2505.18455
作者: Fanqi Yan,Huy Nguyen,Dung Le,Pedram Akbarian,Nhat Ho,Alessandro Rinaldo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Fanqi Yan, Huy Nguyen, and Dung Le contributed equally to this work

点击查看摘要

Abstract:The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.

[LG-211] LocalKMeans: Convergence of Lloyds Algorithm with Distributed Local Iterations

链接: https://arxiv.org/abs/2505.18420
作者: Harsh Vardhan,Heng Zhu,Avishek Ghosh,Arya Mazumdar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we analyze the classical K -means alternating-minimization algorithm, also known as Lloyd’s algorithm (Lloyd, 1956), for a mixture of Gaussians in a data-distributed setting that incorporates local iteration steps. Assuming unlabeled data distributed across multiple machines, we propose an algorithm, LocalKMeans, that performs Lloyd’s algorithm in parallel in the machines by running its iterations on local data, synchronizing only every L of such local steps. We characterize the cost of these local iterations against the non-distributed setting, and show that the price paid for the local steps is a higher required signal-to-noise ratio. While local iterations were theoretically studied in the past for gradient-based learning methods, the analysis of unsupervised learning methods is more involved owing to the presence of latent variables, e.g. cluster identities, than that of an iterative gradient-based algorithm. To obtain our results, we adapt a virtual iterate method to work with a non-convex, non-smooth objective function, in conjunction with a tight statistical analysis of Lloyd steps.

[LG-212] Identifiability of latent causal graphical models without pure children

链接: https://arxiv.org/abs/2505.18410
作者: Seunghyun Lee,Yuqi Gu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the latent causal graph. Furthermore, it is common for all observed variables to exhibit the same modality. Consequently, the existing identifiability conditions are often too stringent for complex real-world data. We consider a general nonparametric measurement model with arbitrary observed variable types and binary latent variables, and propose a double triangular graphical condition that guarantees identifiability of the entire causal graphical model. The proposed condition significantly relaxes the popular pure children condition. We also establish necessary conditions for identifiability and provide valuable insights into fundamental limits of identifiability. Simulation studies verify that latent structures satisfying our conditions can be accurately estimated from data.

[LG-213] On the Mechanisms of Weak-to-Strong Generalization: A Theoretical Perspective

链接: https://arxiv.org/abs/2505.18346
作者: Behrad Moniri,Hamed Hassani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weak-to-strong generalization, where a student model trained on imperfect labels generated by a weaker teacher nonetheless surpasses that teacher, has been widely observed but the mechanisms that enable it have remained poorly understood. In this paper, through a theoretical analysis of simple models, we uncover three core mechanisms that can drive this phenomenon. First, by analyzing ridge regression, we study the interplay between the teacher and student regularization and prove that a student can compensate for a teacher’s under-regularization and achieve lower test error. We also analyze the role of the parameterization regime of the models. Second, by analyzing weighted ridge regression, we show that a student model with a regularization structure more aligned to the target, can outperform its teacher. Third, in a nonlinear multi-index setting, we demonstrate that a student can learn easy, task-specific features from the teacher while leveraging its own broader pre-training to learn hard-to-learn features that the teacher cannot capture.

[LG-214] Online Statistical Inference of Constrained Stochastic Optimization via Random Scaling

链接: https://arxiv.org/abs/2505.18327
作者: Xinchen Du,Wanrong Zhu,Wei Biao Wu,Sen Na
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 43 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Constrained stochastic nonlinear optimization problems have attracted significant attention for their ability to model complex real-world scenarios in physics, economics, and biology. As datasets continue to grow, online inference methods have become crucial for enabling real-time decision-making without the need to store historical data. In this work, we develop an online inference procedure for constrained stochastic optimization by leveraging a method called Sketched Stochastic Sequential Quadratic Programming (SSQP). As a direct generalization of sketched Newton methods, SSQP approximates the objective with a quadratic model and the constraints with a linear model at each step, then applies a sketching solver to inexactly solve the resulting subproblem. Building on this design, we propose a new online inference procedure called random scaling. In particular, we construct a test statistic based on SSQP iterates whose limiting distribution is free of any unknown parameters. Compared to existing online inference procedures, our approach offers two key advantages: (i) it enables the construction of asymptotically valid confidence intervals; and (ii) it is matrix-free, i.e. the computation involves only primal-dual SSQP iterates (\boldsymbolx_t, \boldsymbol\lambda_t) without requiring any matrix inversions. We validate our theory through numerical experiments on nonlinearly constrained regression problems and demonstrate the superior performance of our random scaling method over existing inference procedures.

[LG-215] Operator Learning for Schrödinger Equation: Unitarity Error Bounds and Time Generalization

链接: https://arxiv.org/abs/2505.18288
作者: Yash Patel,Unique Subedi,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:We consider the problem of learning the evolution operator for the time-dependent Schrödinger equation, where the Hamiltonian may vary with time. Existing neural network-based surrogates often ignore fundamental properties of the Schrödinger equation, such as linearity and unitarity, and lack theoretical guarantees on prediction error or time generalization. To address this, we introduce a linear estimator for the evolution operator that preserves a weak form of unitarity. We establish both upper and lower bounds on the prediction error that hold uniformly over all sufficiently smooth initial wave functions. Additionally, we derive time generalization bounds that quantify how the estimator extrapolates beyond the time points seen during training. Experiments across real-world Hamiltonians – including hydrogen atoms, ion traps for qubit design, and optical lattices – show that our estimator achieves relative errors 10^-2 to 10^-3 times smaller than state-of-the-art methods such as the Fourier Neural Operator and DeepONet.

[LG-216] Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems

链接: https://arxiv.org/abs/2505.18276
作者: Lorenzo Baldassari,Josselin Garnier,Knut Solna,Maarten V. de Hoop
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing algorithms for solving high-dimensional Bayesian inverse problems directly in infinite-dimensional function spaces - where such problems are naturally formulated - is crucial to ensure stability and convergence as the discretization of the underlying problem is refined. In this paper, we contribute to this line of work by analyzing a widely used sampler for linear inverse problems: Langevin dynamics driven by score-based generative models (SGMs) acting as priors, formulated directly in function space. Building on the theoretical framework for SGMs in Hilbert spaces, we give a rigorous definition of this sampler in the infinite-dimensional setting and derive, for the first time, error estimates that explicitly depend on the approximation error of the score. As a consequence, we obtain sufficient conditions for global convergence in Kullback-Leibler divergence on the underlying function space. Preventing numerical instabilities requires preconditioning of the Langevin algorithm and we prove the existence and the form of an optimal preconditioner. The preconditioner depends on both the score error and the forward operator and guarantees a uniform convergence rate across all posterior modes. Our analysis applies to both Gaussian and a general class of non-Gaussian priors. Finally, we present examples that illustrate and validate our theoretical findings.

[LG-217] Acoustic and Machine Learning Methods for Speech-Based Suicide Risk Assessment: A Systematic Review

链接: https://arxiv.org/abs/2505.18195
作者: Ambre Marie,Marine Garnier,Thomas Bertin,Laura Machart,Guillaume Dardenne,Gwenolé Quellec,Sofian Berrouiguet
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Preprint version of a manuscript submitted to Computers in Biology and Medicine. 25 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Suicide remains a public health challenge, necessitating improved detection methods to facilitate timely intervention and treatment. This systematic review evaluates the role of Artificial Intelligence (AI) and Machine Learning (ML) in assessing suicide risk through acoustic analysis of speech. Following PRISMA guidelines, we analyzed 33 articles selected from PubMed, Cochrane, Scopus, and Web of Science databases. These studies primarily explored acoustic differences between individuals at risk of suicide (RS) and those not at risk (NRS), and evaluated ML classifier performance. Findings consistently showed significant acoustic feature variations between RS and NRS populations, particularly involving jitter, fundamental frequency (F0), Mel-frequency cepstral coefficients (MFCC), and power spectral density (PSD). Classifier effectiveness varied based on algorithms, modalities, and speech elicitation methods, with multimodal approaches integrating acoustic, linguistic, and metadata features demonstrating superior performance. However, limitations such as methodological variability, small sample sizes, lack of longitudinal data, and limited linguistic and demographic diversity restrict generalizability. Future research should focus on standardizing methods, expanding multimodal analyses, and utilizing larger, diverse datasets to support AI integration in clinical suicide risk assessment.

[LG-218] Generating Realistic Multi-Beat ECG Signals

链接: https://arxiv.org/abs/2505.18189
作者: Paul Pöhl,Viktor Schlegel,Hao Li,Anil Bharath
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: DSP 2025

点击查看摘要

Abstract:Generating synthetic ECG data has numerous applications in healthcare, from educational purposes to simulating scenarios and forecasting trends. While recent diffusion models excel at generating short ECG segments, they struggle with longer sequences needed for many clinical applications. This paper proposes a novel three-layer synthesis framework for generating realistic long-form ECG signals. We first generate high-fidelity single beats using a diffusion model, then synthesize inter-beat features preserving critical temporal dependencies, and finally assemble beats into coherent long sequences using feature-guided matching. Our comprehensive evaluation demonstrates that the resulting synthetic ECGs maintain both beat-level morphological fidelity and clinically relevant inter-beat relationships. In arrhythmia classification tasks, our long-form synthetic ECGs significantly outperform end-to-end long-form ECG generation using the diffusion model, highlighting their potential for increasing utility for downstream applications. The approach enables generation of unprecedented multi-minute ECG sequences while preserving essential diagnostic characteristics.

[LG-219] BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

链接: https://arxiv.org/abs/2505.18185
作者: Qinfan Xiao,Ziyun Cui,Chi Zhang,Siqi Chen,Wen Wu,Andrew Thwaites,Alexandra Woolgar,Bowen Zhou,Chao Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) and magnetoencephalography (MEG) measure neural activity non-invasively by capturing electromagnetic fields generated by dendritic currents. Although rooted in the same biophysics, EEG and MEG exhibit distinct signal patterns, further complicated by variations in sensor configurations across modalities and recording devices. Existing approaches typically rely on separate, modality- and dataset-specific models, which limits the performance and cross-domain scalability. This paper proposes BrainOmni, the first brain foundation model that generalises across heterogeneous EEG and MEG recordings. To unify diverse data sources, we introduce BrainTokenizer,the first tokenizer that quantises spatiotemporal brain activity into discrete representations. Central to BrainTokenizer is a novel Sensor Encoder that encodes sensor properties such as spatial layout, orientation, and type, enabling compatibility across devices and modalities. Building upon the discrete representations, BrainOmni learns unified semantic embeddings of brain signals by self-supervised pretraining. To the best of our knowledge, it is the first foundation model to support both EEG and MEG signals, as well as the first to incorporate large-scale MEG pretraining. A total of 1,997 hours of EEG and 656 hours of MEG data are curated and standardised from publicly available sources for pretraining. Experiments show that BrainOmni outperforms both existing foundation models and state-of-the-art task-specific models on a range of downstream tasks. It also demonstrates strong generalisation to unseen EEG and MEG devices. Further analysis reveals that joint EEG-MEG (EMEG) training yields consistent improvements across both modalities. Code and model checkpoints will be released upon acceptance.

[LG-220] FRAME-C: A knowledge-augmented deep learning pipeline for classifying multi-electrode array electrophysiological signals

链接: https://arxiv.org/abs/2505.18183
作者: Nisal Ranasinghe,Dzung Do-Ha,Simon Maksour,Tamasha Malepathirana,Sachith Seneviratne,Lezanne Ooi,Saman Halgamuge
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disorder characterized by motor neuron degeneration, with alterations in neural excitability serving as key indicators. Recent advancements in induced pluripotent stem cell (iPSC) technology have enabled the generation of human iPSC-derived neuronal cultures, which, when combined with multi-electrode array (MEA) electrophysiology, provide rich spatial and temporal electrophysiological data. Traditionally, MEA data is analyzed using handcrafted features based on potentially imperfect domain knowledge, which while useful may not fully capture all useful characteristics inherent in the data. Machine learning, particularly deep learning, has the potential to automatically learn relevant characteristics from raw data without solely relying on handcrafted feature extraction. However, handcrafted features remain critical for encoding domain knowledge and improving interpretability, especially with limited or noisy data. This study introduces FRAME-C, a knowledge-augmented machine learning pipeline that combines domain knowledge, raw spike waveform data, and deep learning techniques to classify MEA signals and identify ALS-specific phenotypes. FRAME-C leverages deep learning to learn important features from spike waveforms while incorporating handcrafted features such as spike amplitude, inter-spike interval, and spike duration, preserving key spatial and temporal information. We validate FRAME-C on both simulated and real MEA data from human iPSC-derived neuronal cultures, demonstrating superior performance over existing classification methods. FRAME-C shows over 11% improvement on real data and up to 25% on simulated data. We also show FRAME-C can evaluate handcrafted feature importance, providing insights into ALS phenotypes.

[LG-221] Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)

链接: https://arxiv.org/abs/2505.18182
作者: Damilare Emmanuel Olatunji,Julius Dona Zannu,Carine Pierrette Mukamakuza,Godbright Nixon Uiso,Mona Mamoun Mubarak Aman,John Bosco Thuo,Chol Buol,Nchofon Tagha Ghogomu,Evelyne Umubyeyi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: To conduct a systematic assessment of machine learning applications that utilize electrocardiogram (ECG) and heart sound data in the development of cost-effective detection tools for rheumatic heart disease (RHD) from the year 2015 to 2025, thereby supporting the World Heart Federation’s “25 by 25” mortality reduction objective through the creation of alternatives to echocardiography in underserved regions. Methods: Following PRISMA-ScR guidelines, we conducted a comprehensive search across PubMed, IEEE Xplore, Scopus, and Embase for peer-reviewed literature focusing on ML-based ECG/PCG analysis for RHD detection. Two independent reviewers screened studies, and data extraction focused on methodology, validation approaches, and performance metrics. Results: Analysis of 37 relevant studies revealed that convolutional neural networks (CNNs) have become the predominant technology in post-2020 implementations, achieving a median accuracy of 93.7%. However, 73% of studies relied on single-center datasets, only 10.8% incorporated external validation, and none addressed cost-effectiveness. Performance varied markedly across different valvular lesions, and despite 44% of studies originating from endemic regions, significant gaps persisted in implementation science and demographic diversity. Conclusion: While ML-based ECG/PCG analysis shows promise for RHD detection, substantial methodological limitations hinder clinical translation. Future research must prioritize standardized benchmarking frameworks, multimodal architectures, cost-effectiveness assessments, and prospective trials in endemic settings. Significance: This review provides a critical roadmap for developing accessible ML-based RHD screening tools to help bridge the diagnostic gap in resourceconstrained settings where conventional auscultation misses up to 90% of cases and echocardiography remains inaccessible.

[LG-222] Load Forecasting in the Era of Smart Grids: Opportunities and Advanced Machine Learning Models

链接: https://arxiv.org/abs/2505.18170
作者: Aurausp Maneshni
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electric energy is difficult to store, requiring stricter control over its generation, transmission, and distribution. A persistent challenge in power systems is maintaining real-time equilibrium between electricity demand and supply. Oversupply contributes to resource wastage, while undersupply can strain the grid, increase operational costs, and potentially impact service reliability. To maintain grid stability, load forecasting is needed. Accurate load forecasting balances generation and demand by striving to predict future electricity consumption. This thesis examines and evaluates four machine learning frameworks for short term load forecasting, including gradient boosting decision tree methods such as Extreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM). A hybrid framework is also developed. In addition, two recurrent neural network architectures, Long Short Term Memory (LSTM) networks and Gated Recurrent Units (GRU), are designed and implemented. Pearson Correlation Coefficient is applied to assess the relationships between electricity demand and exogenous variables. The experimental results show that, for the specific dataset and forecasting task in this study, machine learning-based models achieved improved forecasting performance compared to a classical ARIMA baseline.

[LG-223] Dim and Small Target Detection for Drone Broadcast Frames Based on Time-Frequency Analysis

链接: https://arxiv.org/abs/2505.18167
作者: Jie Li,Jing Li,Zhanyu Ju,Fengkui Gong,Lu Lv
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a dim and small target detection algorithm for drone broadcast frames based on the time-frequency analysis of communication protocol. Specifically, by analyzing modulation parameters and frame structures, the prior knowledge of transmission frequency, signal bandwidth, Zadoff-Chu (ZC) sequences, and frame length of drone broadcast frames is established. The RF signals are processed through the designed filter banks, and the frequency domain parameters of bounding boxes generated by the detector are corrected with transmission frequency and signal bandwidth. Given the remarkable correlation characteristics of ZC sequences, the frequency domain parameters of bounding boxes with low confidence scores are corrected based on ZC sequences and frame length, which improves the detection accuracy of dim targets under low signal-to noise ratio (SNR) situations. Besides, a segmented energy refinement method is applied to mitigate the deviation caused by interference signals with high energy strength, which ulteriorly corrects the time domain detection parameters for dim targets. As the sampling duration increases, the detection speed improves while the detection accuracy of broadcast frames termed as small targets decreases. The trade-off between detection accuracy and speed versus sampling duration is established, which helps to meet different drone regulation requirements. Simulation results demonstrate that the proposed algorithm improves the average intersection over union, precision, and recall by 3%, 1.4%, and 2.4%, respectively, compared to existing algorithms. The proposed algorithm also performs strong robustness under varying flight distances, diverse types of environment noise, and different flight visual environment.

[LG-224] Accelerating Battery Material Optimization through iterative Machine Learning

链接: https://arxiv.org/abs/2505.18162
作者: Seon-Hwa Lee,Insoo Ye,Changhwan Lee,Jieun Kim,Geunho Choi,Sang-Cheol Nam,Inchul Park
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:The performance of battery materials is determined by their composition and the processing conditions employed during commercial-scale fabrication, where raw materials undergo complex processing steps with various additives to yield final products. As the complexity of these parameters expands with the development of industry, conventional one-factor-at-a-time (OFAT) experiment becomes old fashioned. While domain expertise aids in parameter optimization, this traditional approach becomes increasingly vulnerable to cognitive limitations and anthropogenic biases as the complexity of factors grows. Herein, we introduce an iterative machine learning (ML) framework that integrates active learning to guide targeted experimentation and facilitate incremental model refinement. This method systematically leverages comprehensive experimental observations, including both successful and unsuccessful results, effectively mitigating human-induced biases and alleviating data scarcity. Consequently, it significantly accelerates exploration within the high-dimensional design space. Our results demonstrate that active-learning-driven experimentation markedly reduces the total number of experimental cycles necessary, underscoring the transformative potential of ML-based strategies in expediting battery material optimization.

信息检索

[IR-0] Measure Domains Gap: A Similar Domain Selection Principle for Multi-Domain Recommendation KDD2025

链接: https://arxiv.org/abs/2505.20227
作者: Yi Wen,Yue Liu,Derong Xu,Huishi Luo,Pengyue Jia,Yiqing Wu,Siwei Wang,Ke Liang,Maolin Wang,Yiqi Wang,Fuzhen Zhuang,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:Multi-Domain Recommendation (MDR) achieves the desirable recommendation performance by effectively utilizing the transfer information across different domains. Despite the great success, most existing MDR methods adopt a single structure to transfer complex domain-shared knowledge. However, the beneficial transferring information should vary across different domains. When there is knowledge conflict between domains or a domain is of poor quality, unselectively leveraging information from all domains will lead to a serious Negative Transfer Problem (NTP). Therefore, how to effectively model the complex transfer relationships between domains to avoid NTP is still a direction worth exploring. To address these issues, we propose a simple and dynamic Similar Domain Selection Principle (SDSP) for multi-domain recommendation in this paper. SDSP presents the initial exploration of selecting suitable domain knowledge for each domain to alleviate NTP. Specifically, we propose a novel prototype-based domain distance measure to effectively model the complexity relationship between domains. Thereafter, the proposed SDSP can dynamically find similar domains for each domain based on the supervised signals of the domain metrics and the unsupervised distance measure from the learned domain prototype. We emphasize that SDSP is a lightweight method that can be incorporated with existing MDR methods for better performance while not introducing excessive time overheads. To the best of our knowledge, it is the first solution that can explicitly measure domain-level gaps and dynamically select appropriate domains in the MDR field. Extensive experiments on three datasets demonstrate the effectiveness of our proposed method.

[IR-1] HIT Model: A Hierarchical Interaction-Enhanced Two-Tower Model for Pre-Ranking Systems

链接: https://arxiv.org/abs/2505.19849
作者: Haoqiang Yang,Congde Yuan,Kun Bai,Mengzhuo Guo,Wei Yang,Chao Zhou
类目: Information Retrieval (cs.IR)
*备注: 7 pages

点击查看摘要

Abstract:Online display advertising platforms rely on pre-ranking systems to efficiently filter and prioritize candidate ads from large corpora, balancing relevance to users with strict computational constraints. The prevailing two-tower architecture, though highly efficient due to its decoupled design and pre-caching, suffers from cross-domain interaction and coarse similarity metrics, undermining its capacity to model complex user-ad relationships. In this study, we propose the Hierarchical Interaction-Enhanced Two-Tower (HIT) model, a new architecture that augments the two-tower paradigm with two key components: \textitgenerators that pre-generate holistic vectors incorporating coarse-grained user-ad interactions through a dual-generator framework with a cosine-similarity-based generation loss as the training objective, and \textitmulti-head representers that project embeddings into multiple latent subspaces to capture fine-grained, multi-faceted user interests and multi-dimensional ad attributes. This design enhances modeling effectiveness without compromising inference efficiency. Extensive experiments on public datasets and large-scale online A/B testing on Tencent’s advertising platform demonstrate that HIT significantly outperforms several baselines in relevance metrics, yielding a 1.66% increase in Gross Merchandise Volume and a 1.55% improvement in Return on Investment, alongside similar serving latency to the vanilla two-tower models. The HIT model has been successfully deployed in Tencent’s online display advertising system, serving billions of impressions daily. The code is available at this https URL.

[IR-2] Light distillation for Incremental Graph Convolution Collaborative Filtering

链接: https://arxiv.org/abs/2505.19810
作者: X Fan,F Mo,C Chen,H Yamana
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems presently utilize vast amounts of data and play a pivotal role in enhancing user experiences. Graph Convolution Networks (GCNs) have surfaced as highly efficient models within the realm of recommender systems due to their ability to capture extensive relational information. The continuously expanding volume of data may render the training of GCNs excessively costly. To tackle this problem, incrementally training GCNs as new data blocks come in has become a vital research direction. Knowledge distillation techniques have been explored as a general paradigm to train GCNs incrementally and alleviate the catastrophic forgetting problem that typically occurs in incremental settings. However, we argue that current methods based on knowledge distillation introduce additional parameters and have a high model complexity, which results in unrealistic training time consumption in an incremental setting and thus difficult to actually deploy in the real world. In this work, we propose a light preference-driven distillation method to distill the preference score of a user for an item directly from historical interactions, which reduces the training time consumption in the incremental setting significantly without noticeable loss in performance. The experimental result on two general datasets shows that the proposed method can save training time from 1.5x to 9.5x compared to the existing methods and improves Recall@20 by 5.41% and 10.64% from the fine-tune method.

[IR-3] One Model to Rank Them All: Unifying Online Advertising with End-to-End Learning

链接: https://arxiv.org/abs/2505.19755
作者: Junyan Qiu,Ze Wang,Fan Zhang,Zuowu Zheng,Jile Zhu,Jiangke Fan,Teng Zhang,Haitao Wang,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern industrial advertising systems commonly employ Multi-stage Cascading Architectures (MCA) to balance computational efficiency with ranking accuracy. However, this approach presents two fundamental challenges: (1) performance inconsistencies arising from divergent optimization targets and capability differences between stages, and (2) failure to account for advertisement externalities - the complex interactions between candidate ads during ranking. These limitations ultimately compromise system effectiveness and reduce platform profitability. In this paper, we present UniROM, an end-to-end generative architecture that Unifies online advertising Ranking as One Model. UniROM replaces cascaded stages with a single model to directly generate optimal ad sequences from the full candidate ad corpus in location-based services (LBS). The primary challenges associated with this approach stem from high costs of feature processing and computational bottlenecks in modeling externalities of large-scale candidate pools. To address these challenges, UniROM introduces an algorithm and engine co-designed hybrid feature service to decouple user and ad feature processing, reducing latency while preserving expressiveness. To efficiently extract intra- and cross-sequence mutual information, we propose RecFormer with an innovative cluster-attention mechanism as its core architectural component. Furthermore, we propose a bi-stage training strategy that integrates pre-training with reinforcement learning-based post-training to meet sophisticated platform and advertising objectives. Extensive offline evaluations on public benchmarks and large-scale online A/B testing on industrial advertising platform have demonstrated the superior performance of UniROM over state-of-the-art MCAs.

[IR-4] Improving Recommendation Fairness without Sensitive Attributes Using Multi-Persona LLM s

链接: https://arxiv.org/abs/2505.19473
作者: Haoran Xin,Ying Sun,Chao Wang,Yanke Yu,Weijia Zhang,Hui Xiong
类目: Information Retrieval (cs.IR)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Despite the success of recommender systems in alleviating information overload, fairness issues have raised concerns in recent years, potentially leading to unequal treatment for certain user groups. While efforts have been made to improve recommendation fairness, they often assume that users’ sensitive attributes are available during model training. However, collecting sensitive information can be difficult, especially on platforms that involve no personal information disclosure. Therefore, we aim to improve recommendation fairness without any access to sensitive attributes. However, this is a non-trivial task because uncovering latent sensitive patterns from complicated user behaviors without explicit sensitive attributes can be difficult. Consequently, suboptimal estimates of sensitive distributions can hinder the fairness training process. To address these challenges, leveraging the remarkable reasoning abilities of Large Language Models (LLMs), we propose a novel LLM-enhanced framework for Fair recommendation withOut Sensitive Attributes (LLMFOSA). A Multi-Persona Sensitive Information Inference module employs LLMs with distinct personas that mimic diverse human perceptions to infer and distill sensitive information. Furthermore, a Confusion-Aware Sensitive Representation Learning module incorporates inference results and rationales to develop robust sensitive representations, considering the mislabeling confusion and collective consensus among agents. The model is then optimized by a formulated mutual information objective. Extensive experiments on two public datasets validate the effectiveness of LLMFOSA in improving fairness.

[IR-5] LLM s as Better Recommenders with Natural Language Collaborative Signals: A Self-Assessing Retrieval Approach

链接: https://arxiv.org/abs/2505.19464
作者: Haoran Xin,Ying Sun,Chao Wang,Weijia Zhang,Hui Xiong
类目: Information Retrieval (cs.IR)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Incorporating collaborative information (CI) effectively is crucial for leveraging LLMs in recommendation tasks. Existing approaches often encode CI using soft tokens or abstract identifiers, which introduces a semantic misalignment with the LLM’s natural language pretraining and hampers knowledge integration. To address this, we propose expressing CI directly in natural language to better align with LLMs’ semantic space. We achieve this by retrieving a curated set of the most relevant user behaviors in natural language form. However, identifying informative CI is challenging due to the complexity of similarity and utility assessment. To tackle this, we introduce a Self-assessing COllaborative REtrieval framework (SCORE) following the retrieve-rerank paradigm. First, a Collaborative Retriever (CAR) is developed to consider both collaborative patterns and semantic similarity. Then, a Self-assessing Reranker (SARE) leverages LLMs’ own reasoning to assess and prioritize retrieved behaviors. Finally, the selected behaviors are prepended to the LLM prompt as natural-language CI to guide recommendation. Extensive experiments on two public datasets validate the effectiveness of SCORE in improving LLM-based recommendation.

[IR-6] DocMMIR: A Framework for Document Multi-modal Information Retrieval

链接: https://arxiv.org/abs/2505.19312
作者: Zirui Li,Siwei Wu,Xingyu Wang,Yi Zhou,Yizhi Li,Chenghua Lin
类目: Information Retrieval (cs.IR)
*备注: Comments: 13 pages, 7 figures. Code and data publicly available at this https URL

点击查看摘要

Abstract:The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in this https URL.

[IR-7] Aligning Web Query Generation with Ranking Objectives via Direct Preference Optimization SIGIR2025

链接: https://arxiv.org/abs/2505.19307
作者: João Coelho,Bruno Martins,João Magalhães,Chenyan Xiong
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025

点击查看摘要

Abstract:Neural retrieval models excel in Web search, but their training requires substantial amounts of labeled query-document pairs, which are costly to obtain. With the widespread availability of Web document collections like ClueWeb22, synthetic queries generated by large language models offer a scalable alternative. Still, synthetic training queries often vary in quality, which leads to suboptimal downstream retrieval performance. Existing methods typically filter out noisy query-document pairs based on signals from an external re-ranker. In contrast, we propose a framework that leverages Direct Preference Optimization (DPO) to integrate ranking signals into the query generation process, aiming to directly optimize the model towards generating high-quality queries that maximize downstream retrieval effectiveness. Experiments show higher ranker-assessed relevance between query-document pairs after DPO, leading to stronger downstream performance on the MS~MARCO benchmark when compared to baseline models trained with synthetic data.

[IR-8] RankLLM : A Python Package for Reranking with LLM s SIGIR2025

链接: https://arxiv.org/abs/2505.19284
作者: Sahel Sharifymoghaddam,Ronak Pradeep,Andre Slavescu,Ryan Nguyen,Andrew Xu,Zijian Chen,Yilin Zhang,Yidi Chen,Jasper Xian,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: SIGIR 2025

点击查看摘要

Abstract:The adoption of large language models (LLMs) as rerankers in multi-stage retrieval systems has gained significant traction in academia and industry. These models refine a candidate list of retrieved documents, often through carefully designed prompts, and are typically used in applications built on retrieval-augmented generation (RAG). This paper introduces RankLLM, an open-source Python package for reranking that is modular, highly configurable, and supports both proprietary and open-source LLMs in customized reranking workflows. To improve usability, RankLLM features optional integration with Pyserini for retrieval and provides integrated evaluation for multi-stage pipelines. Additionally, RankLLM includes a module for detailed analysis of input prompts and LLM responses, addressing reliability concerns with LLM APIs and non-deterministic behavior in Mixture-of-Experts (MoE) models. This paper presents the architecture of RankLLM, along with a detailed step-by-step guide and sample code. We reproduce results from RankGPT, LRL, RankVicuna, RankZephyr, and other recent models. RankLLM integrates with common inference frameworks and a wide range of LLMs. This compatibility allows for quick reproduction of reported results, helping to speed up both research and real-world applications. The complete repository is available at this http URL, and the package can be installed via PyPI.

[IR-9] Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data DATE

链接: https://arxiv.org/abs/2505.19274
作者: Manveer Singh Tamber,Suleman Kazi,Vivek Sourabh,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: updated version of arXiv:2502.19712

点击查看摘要

Abstract:We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning. Prior work has shown that fine-tuning with queries generated using a dataset’s retrieval corpus can boost retrieval effectiveness for the dataset. However, we find that surprisingly, fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models. To overcome this, we revisit cross-encoder listwise distillation and demonstrate that, unlike using contrastive learning alone, listwise distillation can help more consistently improve retrieval effectiveness across multiple datasets. Additionally, we show that synthesizing more training data using diverse query types (such as claims, keywords, and questions) yields greater effectiveness than using any single query type alone, regardless of the query type used in evaluation. Our findings further indicate that synthetic queries offer comparable utility to human-written queries for training. We use our approach to train an embedding model that achieves state-of-the-art effectiveness among BERT embedding models. We release our model and both query generation and training code to facilitate further research.

[IR-10] DeepResearchGym: A Free Transparent and Reproducible Evaluation Sandbox for Deep Research

链接: https://arxiv.org/abs/2505.19253
作者: João Coelho,Jingjie Ning,Jingyuan He,Kangrui Mao,Abhijay Paladugu,Pranav Setlur,Jiahe Jin,Jamie Callan,João Magalhães,Bruno Martins,Chenyan Xiong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks rely on dynamic commercial search APIs, which pose reproducibility and transparency challenges in addition to their cost. To address these limitations, we introduce DeepResearchGym, an open-source sandbox that combines a reproducible search API with a rigorous evaluation protocol for benchmarking deep research systems. The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN. It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is freely available for research use. To evaluate deep research systems’ outputs, we extend the Researchy Questions benchmark with automatic metrics through LLM-as-a-judge assessments to measure alignment with users’ information needs, retrieval faithfulness, and report quality. Experimental results show that systems integrated with DeepResearchGym achieve performance comparable to those using commercial APIs, with performance rankings remaining consistent across evaluation metrics. A human evaluation study further confirms that our automatic protocol aligns with human preferences, validating the framework’s ability to help support controlled assessment of deep research systems. Our code and API documentation are available at this https URL.

[IR-11] POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval ICML2025

链接: https://arxiv.org/abs/2505.19189
作者: Yaoyang Liu,Junlin Li,Yinjun Wu,Zhen Chen
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注: Published in ICML 2025

点击查看摘要

Abstract:Although Multi-Vector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available at this https URL.

[IR-12] DLF: Enhancing Explicit-Implicit Interaction via Dynamic Low-Order-Aware Fusion for CTR Prediction SIGIR SIGIR’25

链接: https://arxiv.org/abs/2505.19182
作者: Kefan Wang,Hao Wang,Wei Guo,Yong Liu,Jianghao Lin,Defu Lian,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25)

点击查看摘要

Abstract:Click-through rate (CTR) prediction is a critical task in online advertising and recommender systems, relying on effective modeling of feature interactions. Explicit interactions capture predefined relationships, such as inner products, but often suffer from data sparsity, while implicit interactions excel at learning complex patterns through non-linear transformations but lack inductive biases for efficient low-order modeling. Existing two-stream architectures integrate these paradigms but face challenges such as limited information sharing, gradient imbalance, and difficulty preserving low-order signals in sparse CTR data. We propose a novel framework, Dynamic Low-Order-Aware Fusion (DLF), which addresses these limitations through two key components: a Residual-Aware Low-Order Interaction Network (RLI) and a Network-Aware Attention Fusion Module (NAF). RLI explicitly preserves low-order signals while mitigating redundancy from residual connections, and NAF dynamically integrates explicit and implicit representations at each layer, enhancing information sharing and alleviating gradient imbalance. Together, these innovations balance low-order and high-order interactions, improving model expressiveness. Extensive experiments on public datasets demonstrate that DLF achieves state-of-the-art performance in CTR prediction, addressing key limitations of existing models. The implementation is publicly available at this https URL.

[IR-13] Semantic-enhanced Co-attention Prompt Learning for Non-overlapping Cross-Domain Recommendation

链接: https://arxiv.org/abs/2505.19085
作者: Lei Guo,Chenlong Song,Feng Guo,Xiaohui Han,Xiaojun Chang,Lei Zhu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Non-overlapping Cross-domain Sequential Recommendation (NCSR) is the task that focuses on domain knowledge transfer without overlapping entities. Compared with traditional Cross-domain Sequential Recommendation (CSR), NCSR poses several challenges: 1) NCSR methods often rely on explicit item IDs, overlooking semantic information among entities. 2) Existing CSR mainly relies on domain alignment for knowledge transfer, risking semantic loss during alignment. 3) Most previous studies do not consider the many-to-one characteristic, which is challenging because of the utilization of multiple source domains. Given the above challenges, we introduce the prompt learning technique for Many-to-one Non-overlapping Cross-domain Sequential Recommendation (MNCSR) and propose a Text-enhanced Co-attention Prompt Learning Paradigm (TCPLP). Specifically, we capture semantic meanings by representing items through text rather than IDs, leveraging natural language universality to facilitate cross-domain knowledge transfer. Unlike prior works that need to conduct domain alignment, we directly learn transferable domain information, where two types of prompts, i.e., domain-shared and domain-specific prompts, are devised, with a co-attention-based network for prompt encoding. Then, we develop a two-stage learning strategy, i.e., pre-train prompt-tuning paradigm, for domain knowledge pre-learning and transferring, respectively. We conduct extensive experiments on three datasets and the experimental results demonstrate the superiority of our TCPLP. Our source codes have been publicly released.

[IR-14] Lightweight Embeddings with Graph Rewiring for Collaborative Filtering

链接: https://arxiv.org/abs/2505.18999
作者: Xurong Liang,Tong Chen,Wei Yuan,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注: Accepted by TOIS’25

点击查看摘要

Abstract:As recommendation services scale rapidly and their deployment now commonly involves resource-constrained edge devices, GNN-based recommender systems face significant challenges, including high embedding storage costs and runtime latency from graph propagations. Our previous work, LEGCF, effectively reduced embedding storage costs but struggled to maintain recommendation performance under stricter storage limits. Additionally, LEGCF did not address the extensive runtime computation costs associated with graph propagation, which involves heavy multiplication and accumulation operations (MACs). These challenges consequently hinder effective training and inference on resource-constrained edge devices. To address these limitations, we propose Lightweight Embeddings with Rewired Graph for Graph Collaborative Filtering (LERG), an improved extension of LEGCF. LERG retains LEGCFs compositional codebook structure but introduces quantization techniques to reduce the storage cost, enabling the inclusion of more meta-embeddings within the same storage. To optimize graph propagation, we pretrain the quantized compositional embedding table using the full interaction graph on resource-rich servers, after which a fine-tuning stage is engaged to identify and prune low-contribution entities via a gradient-free binary integer programming approach, constructing a rewired graph that excludes these entities (i.e., user/item nodes) from propagating signals. The quantized compositional embedding table with selective embedding participation and sparse rewired graph are transferred to edge devices which significantly reduce computation memory and inference time. Experiments on three public benchmark datasets, including an industry-scale dataset, demonstrate that LERG achieves superior recommendation performance while dramatically reducing storage and computation costs for graph-based recommendation services.

[IR-15] Enhancing LLM s Reasoning -Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning

链接: https://arxiv.org/abs/2505.18831
作者: Jinzheng Li,Sibo Ju,Yanzhou Su,Hongguang Li,Yiqing Shen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing large language models (LLMs) driven search agents typically rely on prompt engineering to decouple the user queries into search plans, limiting their effectiveness in complex scenarios requiring reasoning. Furthermore, they suffer from excessive token consumption due to Python-based search plan representations and inadequate integration of multimedia elements for both input processing and response generation. To address these challenges, we introduce SearchExpert, a training method for LLMs to improve their multimedia search capabilities in response to complex search queries. Firstly, we reformulate the search plan in an efficient natural language representation to reduce token consumption. Then, we propose the supervised fine-tuning for searching (SFTS) to fine-tune LLM to adapt to these representations, together with an automated dataset construction pipeline. Secondly, to improve reasoning-intensive search capabilities, we propose the reinforcement learning from search feedback (RLSF) that takes the search results planned by LLM as the reward signals. Thirdly, we propose a multimedia understanding and generation agent that enables the fine-tuned LLM to process visual input and produce visual output during inference. Finally, we establish an automated benchmark construction pipeline and a human evaluation framework. Our resulting benchmark, SearchExpertBench-25, comprises 200 multiple-choice questions spanning financial and international news scenarios that require reasoning in searching. Experiments demonstrate that SearchExpert outperforms the commercial LLM search method (Perplexity Pro) by 36.60% on the existing FinSearchBench-24 benchmark and 54.54% on our proposed SearchExpertBench-25. Human evaluations further confirm the superior readability.

[IR-16] MTGR: Industrial-Scale Generative Recommendation Framework in Meituan

链接: https://arxiv.org/abs/2505.18654
作者: Ruidong Han,Bin Yin,Shangyu Chen,He Jiang,Fei Jiang,Xiang Li,Chi Ma,Mincong Huang,Xiaoguang Li,Chunzhen Jing,Yueming Han,Menglei Zhou,Lei Yu,Chuan Liu,Wei Lin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Scaling law has been extensively validated in many domains such as natural language processing and computer vision. In the recommendation system, recent work has adopted generative recommendations to achieve scalability, but their generative approaches require abandoning the carefully constructed cross features of traditional recommendation models. We found that this approach significantly degrades model performance, and scaling up cannot compensate for it at all. In this paper, we propose MTGR (Meituan Generative Recommendation) to address this issue. MTGR is modeling based on the HSTU architecture and can retain the original deep learning recommendation model (DLRM) features, including cross features. Additionally, MTGR achieves training and inference acceleration through user-level compression to ensure efficient scaling. We also propose Group-Layer Normalization (GLN) to enhance the performance of encoding within different semantic spaces and the dynamic masking strategy to avoid information leakage. We further optimize the training frameworks, enabling support for our models with 10 to 100 times computational complexity compared to the DLRM, without significant cost increases. MTGR achieved 65x FLOPs for single-sample forward inference compared to the DLRM model, resulting in the largest gain in nearly two years both offline and online. This breakthrough was successfully deployed on Meituan, the world’s largest food delivery platform, where it has been handling the main traffic.

[IR-17] he Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems ACL25

链接: https://arxiv.org/abs/2505.18583
作者: Hongru Song,Yu-an Liu,Ruqing Zhang,Jiafeng Guo,Jianming Lv,Maarten de Rijke,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注: 18 pages,accepted by ACL25 findings

点击查看摘要

Abstract:We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities. We focus on generating human-imperceptible adversarial examples and introduce a novel imperceptible retrieve-to-generate attack against RAG. This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top- k candidate set, in order to influence the final answer generation. To address this task, we propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards. Experiments on newly constructed factual and non-factual question-answering benchmarks demonstrate that ReGENT significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.

附件下载

点击下载今日全部论文列表